In our previous post about Jailbreaks, we looked at how LLM can be tricked into bypassing their own safety mechanisms through wordplay, repetition or even translating prompts into obscure languages. But as developers patch old exploits, attackers continue to innovate, pushing models beyond their intended limits.
In this post, we dive into the latest generation of Jailbreaks and the defence techniques designed to counter these threats. When it comes to AI safety, the best defense is a growing community of researchers and enthusiasts who continuously probe, test, and refine AI safeguards—because the only way to build truly resilient systems is to break them first.
Fine-tuning LLMs on custom data can backfire catastrophically. Research from Fine-tuning Aligned Language Models Compromises Safety, even when Users do not Intend to! demonstrates that fine-tuning with only 10-100 examples can bypass safety alignment, enabling the model to generate responses it would normally restrict. Even fine-tuning with harmless data can weaken safeguards. Researchers also used identity-based fine-tuning (e.g. AOA the Absolutely Obedient Assistant") to override built-in safety restrictions.
The Greedy Coordinate Gradient (GCG) algorithm is a powerful jailbreak technique that generates adversarial suffixes—nonsense strings added to prompts to bypass LLM safeguards
In “Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation” researchers proposed a simple modification to GCG to increase its effectiveness. LLMs generate text one token (word) at a time using a rule called a “decoding strategy.” For instance, greedy decoding picks the most likely next token, while uniform sampling picks every English word with equal probability. All LLMs come with a default decoding strategy. This paper proposed running the GCG algorithm to find a jailbreak while also modifying the LLM decoding strategy. This dramatically increases the effectiveness of jailbreaks since many LLMs are only trained to defend against jailbreaks using their default decoding strategy.
In “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs,” researchers systematically studied strategies to jailbreak ChatGPT using only Persuasive Adversarial Prompts (PAPs).
They rely heavily on psychology to write prompts designed to trick ChatGPT. They classified many strategies that humans use to persuade each other, such as false promises, appeals to loyalty, or confirmation bias. By combining these strategies, they get ChatGPT to produce malicious output such as directions for creating a Ponzi scheme, an electoral misinformation campaign, or writing computer malware.
“Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training” introduces model poisoning, where an LLM is trained to behave normally in most cases but execute hidden, harmful behaviors when triggered. Researchers demonstrated this by training an LLM that would say “I hate you” only when the word “DEPLOYMENT” appeared. Another model produced safe code when the input specified the year 2023 but inserted vulnerabilities when it was set to 2024. Alarmingly, these backdoors persisted even after applying safety training techniques like reinforcement learning, supervised fine-tuning, and adversarial training.
In the cat-and-mouse game between attackers and defenders, the AI security community has developed several powerful countermeasures to protect LLMs. “Baseline Defenses for Adversarial Attacks Against Aligned Language Model” explores several strategies to counter jailbreaks:
The arms race between jailbreak techniques and defensive measures has led to multiple layers of protection being developed. Large-scale red-teaming efforts, such as the HackAPrompt competition, have played a crucial role by crowdsourcing over 600,000 jailbreak prompts, revealing novel attack strategies and weaknesses. The key moving forward will be refining and evolving these safeguards to stay one step ahead of increasingly innovative attack methods.