AI Jailbreak

Reviewed byFlorian AmetteCybersecurity entrepreneur & security researcher

What is AI Jailbreak?

AI JailbreakA technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid.

AI jailbreaks exploit the gap between a model's general capabilities and its safety fine-tuning. Attackers use role-play scenarios, hypothetical framings, encoded instructions, or many-shot exemplars to convince the model to disregard restrictions on weapons, malware, hate speech, or self-disclosure of its system prompt. Famous early examples include the "DAN" (Do Anything Now) prompts targeting GPT-3.5 and ChatGPT, as well as Anthropic's 2024 research on many-shot jailbreaking. Jailbreaks differ from prompt injection in that the user is the attacker, not a third party. Mitigations include adversarial training, constitutional methods, output classifiers, refusal-style grading, and continuous red-team evaluation.

● Examples

01
"DAN" prompts that ask ChatGPT to role-play an unrestricted alter ego.
02
Many-shot jailbreaks that fill the context with fake examples of compliant harmful responses.

● Frequently asked questions

What is AI Jailbreak?

A technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid. It belongs to the AI & ML Security category of cybersecurity.

What does AI Jailbreak mean?

A technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid.

How do you defend against AI Jailbreak?

Defences for AI Jailbreak typically combine technical controls and operational practices, as detailed in the full definition above.

What are other names for AI Jailbreak?

Common alternative names include: LLM jailbreak, Safety bypass.

AI Jailbreak

What is AI Jailbreak?

● Examples

● Frequently asked questions

● Related terms

● See also