Token Smuggling
What is Token Smuggling?
Token SmugglingA class of jailbreak technique that hides harmful instructions for an LLM inside encodings, languages, or token sequences the safety filter does not recognise as dangerous.
Token smuggling exploits the mismatch between how a model tokenizes and decodes text and how its content classifiers analyse it. Attackers split forbidden words across multiple tokens, use Base64, ROT-13, Unicode look-alikes, leet-speak, low-resource languages, or instruct the model to assemble the malicious string from harmless pieces — for example "concatenate the second letter of each word". Variants include payload smuggling through tool inputs and obfuscated function calls. The technique works because guardrails often inspect surface text rather than the model's reconstructed intent. Mitigations include classifier ensembles that operate on decoded text, semantic-level intent detection, decoding-aware safety models, runtime sandboxing of tool calls, and continuous adversarial red-team evaluations.
● Examples
- 01
An attacker asking an LLM to take the first letter of ten harmless words to spell out a forbidden chemical synthesis term.
- 02
Encoding a malicious request in Base64 so a safety filter sees only random-looking characters while the LLM happily decodes and complies.
● Frequently asked questions
What is Token Smuggling?
A class of jailbreak technique that hides harmful instructions for an LLM inside encodings, languages, or token sequences the safety filter does not recognise as dangerous. It belongs to the AI & ML Security category of cybersecurity.
What does Token Smuggling mean?
A class of jailbreak technique that hides harmful instructions for an LLM inside encodings, languages, or token sequences the safety filter does not recognise as dangerous.
How does Token Smuggling work?
Token smuggling exploits the mismatch between how a model tokenizes and decodes text and how its content classifiers analyse it. Attackers split forbidden words across multiple tokens, use Base64, ROT-13, Unicode look-alikes, leet-speak, low-resource languages, or instruct the model to assemble the malicious string from harmless pieces — for example "concatenate the second letter of each word". Variants include payload smuggling through tool inputs and obfuscated function calls. The technique works because guardrails often inspect surface text rather than the model's reconstructed intent. Mitigations include classifier ensembles that operate on decoded text, semantic-level intent detection, decoding-aware safety models, runtime sandboxing of tool calls, and continuous adversarial red-team evaluations.
How do you defend against Token Smuggling?
Defences for Token Smuggling typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for Token Smuggling?
Common alternative names include: Token smuggling jailbreak, Encoded prompt injection.
● Related terms
- ai-security№ 030
AI Jailbreak
A technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid.
- ai-security№ 866
Prompt Injection
An attack that overrides an LLM's original instructions by smuggling adversarial text into the prompt, causing the model to ignore safeguards or execute attacker-chosen actions.
- ai-security№ 528
Indirect Prompt Injection
A prompt-injection variant where malicious instructions are hidden inside third-party content (web pages, documents, emails) that an LLM later ingests through retrieval, browsing, or tool use.
- ai-security№ 777
OWASP LLM Top 10
An OWASP-maintained list of the ten most critical security risks affecting applications that build on large language models.
- ai-security№ 618
LLM Guardrails
Mechanisms that constrain what an LLM-based application can input or output, enforcing safety, security, and business rules around the underlying model.
- ai-security№ 032
AI Red Team
A specialised team that simulates adversaries against AI systems to uncover safety, security, and misuse risks before real attackers do.