AI Jailbreak
What is AI Jailbreak?
AI JailbreakA technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid.
AI jailbreaks exploit the gap between a model's general capabilities and its safety fine-tuning. Attackers use role-play scenarios, hypothetical framings, encoded instructions, or many-shot exemplars to convince the model to disregard restrictions on weapons, malware, hate speech, or self-disclosure of its system prompt. Famous early examples include the "DAN" (Do Anything Now) prompts targeting GPT-3.5 and ChatGPT, as well as Anthropic's 2024 research on many-shot jailbreaking. Jailbreaks differ from prompt injection in that the user is the attacker, not a third party. Mitigations include adversarial training, constitutional methods, output classifiers, refusal-style grading, and continuous red-team evaluation.
● Examples
- 01
"DAN" prompts that ask ChatGPT to role-play an unrestricted alter ego.
- 02
Many-shot jailbreaks that fill the context with fake examples of compliant harmful responses.
● Frequently asked questions
What is AI Jailbreak?
A technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid. It belongs to the AI & ML Security category of cybersecurity.
What does AI Jailbreak mean?
A technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid.
How does AI Jailbreak work?
AI jailbreaks exploit the gap between a model's general capabilities and its safety fine-tuning. Attackers use role-play scenarios, hypothetical framings, encoded instructions, or many-shot exemplars to convince the model to disregard restrictions on weapons, malware, hate speech, or self-disclosure of its system prompt. Famous early examples include the "DAN" (Do Anything Now) prompts targeting GPT-3.5 and ChatGPT, as well as Anthropic's 2024 research on many-shot jailbreaking. Jailbreaks differ from prompt injection in that the user is the attacker, not a third party. Mitigations include adversarial training, constitutional methods, output classifiers, refusal-style grading, and continuous red-team evaluation.
How do you defend against AI Jailbreak?
Defences for AI Jailbreak typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for AI Jailbreak?
Common alternative names include: LLM jailbreak, Safety bypass.
● Related terms
- ai-security№ 866
Prompt Injection
An attack that overrides an LLM's original instructions by smuggling adversarial text into the prompt, causing the model to ignore safeguards or execute attacker-chosen actions.
- ai-security№ 024
AI Alignment
The research and engineering effort to ensure AI systems pursue goals, follow instructions, and behave in ways that match the intentions of their developers and users.
- ai-security№ 032
AI Red Team
A specialised team that simulates adversaries against AI systems to uncover safety, security, and misuse risks before real attackers do.
- ai-security№ 777
OWASP LLM Top 10
An OWASP-maintained list of the ten most critical security risks affecting applications that build on large language models.
- ai-security№ 618
LLM Guardrails
Mechanisms that constrain what an LLM-based application can input or output, enforcing safety, security, and business rules around the underlying model.
- ai-security№ 1163
Token Smuggling
A class of jailbreak technique that hides harmful instructions for an LLM inside encodings, languages, or token sequences the safety filter does not recognise as dangerous.