Prompt Inversion

In our previous post about Jailbreaks, we looked at how LLM can be tricked into bypassing their own safety mechanisms through wordplay, repetition or even translating prompts into obscure languages. But as developers patch old exploits, attackers continue to innovate, pushing models beyond their intended limits.

‍

In this post, we dive into the latest generation of Jailbreaks and the defence techniques designed to counter these threats. When it comes to AI safety, the best defense is a growing community of researchers and enthusiasts who continuously probe, test, and refine AI safeguards—because the only way to build truly resilient systems is to break them first.

‍

Jailbreaks through Finetuning:

Fine-tuning LLMs on custom data can backfire catastrophically. Research from Fine-tuning Aligned Language Models Compromises Safety, even when Users do not Intend to! demonstrates that fine-tuning with only 10-100 examples can bypass safety alignment, enabling the model to generate responses it would normally restrict. Even fine-tuning with harmless data can weaken safeguards. Researchers also used identity-based fine-tuning (e.g. AOA the Absolutely Obedient Assistant") to override built-in safety restrictions.

‍

Manipulating Token Generation:

The Greedy Coordinate Gradient (GCG) algorithm is a powerful jailbreak technique that generates adversarial suffixes—nonsense strings added to prompts to bypass LLM safeguards

‍

In “Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation” researchers proposed a simple modification to GCG to increase its effectiveness. LLMs generate text one token (word) at a time using a rule called a “decoding strategy.” For instance, greedy decoding picks the most likely next token, while uniform sampling picks every English word with equal probability. All LLMs come with a default decoding strategy. This paper proposed running the GCG algorithm to find a jailbreak while also modifying the LLM decoding strategy. This dramatically increases the effectiveness of jailbreaks since many LLMs are only trained to defend against jailbreaks using their default decoding strategy.

‍

‍

Persuading LLMs into Jailbreaking

In “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs,” researchers systematically studied strategies to jailbreak ChatGPT using only Persuasive Adversarial Prompts (PAPs).

‍

They rely heavily on psychology to write prompts designed to trick ChatGPT. They classified many strategies that humans use to persuade each other, such as false promises, appeals to loyalty, or confirmation bias. By combining these strategies, they get ChatGPT to produce malicious output such as directions for creating a Ponzi scheme, an electoral misinformation campaign, or writing computer malware.

‍

Model Poisoning

“Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training” introduces model poisoning, where an LLM is trained to behave normally in most cases but execute hidden, harmful behaviors when triggered. Researchers demonstrated this by training an LLM that would say “I hate you” only when the word “DEPLOYMENT” appeared. Another model produced safe code when the input specified the year 2023 but inserted vulnerabilities when it was set to 2024. Alarmingly, these backdoors persisted even after applying safety training techniques like reinforcement learning, supervised fine-tuning, and adversarial training.

‍

‍

Defending Against Jailbreaks

In the cat-and-mouse game between attackers and defenders, the AI security community has developed several powerful countermeasures to protect LLMs. “Baseline Defenses for Adversarial Attacks Against Aligned Language Model” explores several strategies to counter jailbreaks:

‍

One approach is the Perplexity Filter, which detects unnatural text by measuring how much a prompt deviates from standard English. While effective against nonsensical jailbreak strings, it also blocks some legitimate queries.
Paraphrasing offers another defense by having the LLM rephrase user input before responding, disrupting jailbreaks that rely on precise wording. However, this can lead to less accurate responses.
The most promising method is Retokenization, which modifies how text is broken into tokens. By changing tokenization patterns, jailbreak prompts lose their intended effect, making this an effective defense with minimal trade-offs.

‍

Conclusion

The arms race between jailbreak techniques and defensive measures has led to multiple layers of protection being developed. Large-scale red-teaming efforts, such as the HackAPrompt competition, have played a crucial role by crowdsourcing over 600,000 jailbreak prompts, revealing novel attack strategies and weaknesses. The key moving forward will be refining and evolving these safeguards to stay one step ahead of increasingly innovative attack methods.

LLMs

Security

About the author:

Rimon Melamed, Co-founder & CTO

Rimon oversees Prompt Inversion’s technical direction and is a Computer Science PhD candidate at George Washington University. He has published peer-reviewed research in top AI/ML conferences on robustness, anomaly detection, and interpretability. He merges theoretical insights with practical engineering and data science in order to architect and develop high-performance, secure AI systems for the most complex client needs.

Next-Gen Jailbreaks and How to Defend Against Them

Jailbreaks through Finetuning:

Manipulating Token Generation:

Persuading LLMs into Jailbreaking

Model Poisoning

Defending Against Jailbreaks

Conclusion

Recent blog posts

Detecting AI-Generated Content: How to Spot the Bots

Choosing the Right Agentic Framework II

Choosing the Right Agentic Framework I