Researchers from Robust Intelligence and Yale University have developed a machine learning technique called the Tree of Attacks with Pruning (TAP), which can quickly jailbreak large language models (LLMs) in an automated manner. This technique can induce sophisticated models like GPT-4 and Llama-2 to produce harmful responses in just minutes. The researchers found that this vulnerability is universal across LLM technology, and they do not see an obvious fix for it.

There are various attack tactics that can be used against LLM-based AI systems, including prompt attacks, backdooring, extracting sensitive training data, and confusing models with adversarial examples. The automated adversarial machine learning technique discovered by the researchers allows for these attacks by overriding the guardrails placed upon them.

The researchers have tested this method against several LLM models and found that it successfully finds jailbreaking prompts for over 80% of harmful information requests, using fewer than 30 queries on average. They have shared their research with the developers of the tested models before making it public.

As tech giants continue to build new specialized large language models, researchers have been probing them for security weaknesses. Google and Microsoft have set up specific teams and bug bounty programs to cover AI-related threats. Additionally, the AI Village at DEF CON hosted red teamers earlier this year to test LLMs from various companies and uncover vulnerabilities.


Full Stack Developer

About the Author

I’m passionate about web development and design in all its forms, helping small businesses build and improve their online presence. I spend a lot of time learning new techniques and actively helping other people learn web development through a variety of help groups and writing tutorials for my blog about advancements in web design and development.

View Articles