Jailbreaking to Jailbreak: A Novel Approach to Safety Testing

Large Language Models (LLMs) are advancing rapidly and it's easy to focus solely on transformative new possibilities and applications. It is crucial, however, that we do not lose sight of safety. This has long been a significant research area for us, and our most recent contribution is our discovery of a novel approach to scale vulnerability testing: teaching LLMs to systematically break through their own safeguards and those of other models. In other words, we prompt an LLM to act as a red teamer in a sustained and creative multi-turn attack, identifying and exploiting vulnerabilities, even in models designed to refuse such attempts. We are calling this approach J2 (Jailbreaking to Jailbreak).
To prevent harmful outputs, LLMs undergo refusal training that helps them recognize and reject dangerous requests. However, these safeguards can be bypassed because models often fail to generalize defenses against novel attack strategies. There are generally two main approaches: human red-teamers and automated methods. Human red-teamers are by far the most effective at manually probing for weaknesses to exploit, but they are resource-intensive and hard to scale. Automated methods can run many tests quickly, but often fail to catch more nuanced vulnerabilities.
Our J2 approach is a hybrid of these approaches, leveraging the strategy and reasoning of humans, while maintaining the scalability of automated methods. Though not quite as effective as a team of professional human red-teamers just yet, J2 success rates suggest we've reached a milestone in which a hybrid human and LLM red-teaming is now a viable alternative (or addition) to human-only testing, with implications for AI safety at scale.
A Multi-Stage Process
To systematically test LLM safeguards, J2 follows a multi-stage process. First, we jailbreak an LLM to be willing to help jailbreak other models. Then, we employ a three-stage process to run J2 against other models automatically:
1. Plan the attack
2. Engage in adaptive back-and-forth interactions
3. Assess its success
Figure 1: Illustration of the first two stages of J2.
Plan: J2 draws from a curated set of testing strategies developed from human red-teaming insights. These range from technical approaches — such as embedding requests within code or documentation — to more creative methods, like crafting fictional narratives or simulating forum discussions to bypass safeguards. The strategies vary in specificity, some are quite detailed, while some are extremely open-ended allowing for greater flexibility and creativity by the model.
Attack: J2 engages in multi-turn conversations with the target model, dynamically adjusting its approach based on responses. Unlike traditional automated attacks that blindly iterate on the same prompt, J2 treats the conversation as a problem-solving task, identifying weaknesses and adapting its strategy in real time.
Debrief: An external GPT-4o evaluator assesses whether J2’s attack was successful, reducing false positives and ensuring accurate evaluation. J2 uses this to review its own performance, determines whether further attempts could improve success rates, and retains key insights — allowing it to refine its approach over multiple iterations, much like a human red teamer learning from trial and error.
Figure 2: An example attack log between the LLM red teamer (i.e. J2(Sonnet-3.5)) and the target LLM (i.e. GPT-4o).
Key Findings
Our experiments revealed that different models develop distinct strategies for bypassing safeguards, and some were more effective than others. Gemini-1.5-pro performed best with just three turns of conversation, while different versions of Sonnet.3.5 had varied performances. The older checkpoint of Sonnet-3.5-06202024 was very resistant to jailbreak other models, while the newer checkpoint Sonnet-3.5-1022 (known as the first computer-use LLM), was willing to jailbreak and generally took six turns to achieve the best results. GPT-4o was not able to red-team based on similar prompting.
When tested against GPT-4o, our J2 attackers achieved 93% and 91% success rates using Sonnet-3.5 and Gemini-1.5-pro respectively, approaching human red-teaming effectiveness with attack success rates of 98%. Though traditional algorithmic methods still performed better than J2 against strongly defended models, they require access to the internal workings of the model, whereas J2 does not.
We found that capable LLMs used as attack agents can systematically reason through and bypass their own safeguards when given an injection attack starting point, effectively jailbreaking themselves. Additionally, the agent-enabling injection attack transfers effectively between different SOTA LLMs. We also found that J2 models at times displayed unexpectedly aggressive personality traits that exceeded the scope of its testing objectives.
These findings have serious implications for AI safety and suggest that current approaches to model security need fundamental rethinking.
A New Safety Paradigm
While the development of J2 marks a significant advance in scalable, effective, automated security testing, it also reveals how quickly the landscape of AI is evolving, particularly regarding safety. As LLMs become more sophisticated, we need to acknowledge and deal with the possibility that the same intelligence being built into these systems could be used to get around their refusal training.
At Scale, we believe it is essential that we keep the needs of humanity front and center as we develop ever more powerful AI systems. By identifying these vulnerabilities and sharing our methodology, the hope is that we advance the entire field of AI safety research. While we are keeping the specific attack details private, we think it’s important to highlight where systems of AI interacting with each create increasing complexity and potential vulnerability so that safety measures can evolve as well.
Check out the full paper here: https://scale.com/research/j2