LLM System Prompt Leak
What is LLM System Prompt Leak?
LLM System Prompt LeakAn attack that extracts the hidden system prompt or instructions of a deployed large language model application, exposing logic, secrets, and tools.
A system prompt leak occurs when a user induces a deployed LLM application to reveal its hidden system prompt, developer instructions, or attached context such as API keys, internal documentation, or tool definitions. Attackers use direct requests, role-play framings, translation tricks, character-encoding obfuscation, or indirect prompt injection through documents the model is asked to summarise. Even partial leaks help adversaries reverse-engineer business logic, find guardrail bypasses, and craft tailored jailbreaks or social-engineering content. Mitigations include treating system prompts as low-trust public data, removing secrets from prompts, using server-side policy checks, output filtering, and instructing the model not to reveal its instructions while accepting that determined adversaries will often succeed.
● Examples
- 01
An attacker tells a chatbot to repeat everything above its first user message in code blocks, exposing the full system prompt and an embedded API key.
- 02
A summarisation assistant given a malicious PDF returns its hidden tool descriptions because the document instructs it to do so.
● Frequently asked questions
What is LLM System Prompt Leak?
An attack that extracts the hidden system prompt or instructions of a deployed large language model application, exposing logic, secrets, and tools. It belongs to the AI & ML Security category of cybersecurity.
What does LLM System Prompt Leak mean?
An attack that extracts the hidden system prompt or instructions of a deployed large language model application, exposing logic, secrets, and tools.
How does LLM System Prompt Leak work?
A system prompt leak occurs when a user induces a deployed LLM application to reveal its hidden system prompt, developer instructions, or attached context such as API keys, internal documentation, or tool definitions. Attackers use direct requests, role-play framings, translation tricks, character-encoding obfuscation, or indirect prompt injection through documents the model is asked to summarise. Even partial leaks help adversaries reverse-engineer business logic, find guardrail bypasses, and craft tailored jailbreaks or social-engineering content. Mitigations include treating system prompts as low-trust public data, removing secrets from prompts, using server-side policy checks, output filtering, and instructing the model not to reveal its instructions while accepting that determined adversaries will often succeed.
How do you defend against LLM System Prompt Leak?
Defences for LLM System Prompt Leak typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for LLM System Prompt Leak?
Common alternative names include: System prompt extraction, Prompt exfiltration.
● Related terms
- ai-security№ 866
Prompt Injection
An attack that overrides an LLM's original instructions by smuggling adversarial text into the prompt, causing the model to ignore safeguards or execute attacker-chosen actions.
- ai-security№ 528
Indirect Prompt Injection
A prompt-injection variant where malicious instructions are hidden inside third-party content (web pages, documents, emails) that an LLM later ingests through retrieval, browsing, or tool use.
- ai-security№ 030
AI Jailbreak
A technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid.
- ai-security№ 657
MCP Attacks
Attacks that exploit the Model Context Protocol (MCP) to inject prompts, abuse tools, or pivot through servers an AI assistant trusts.
- ai-security№ 032
AI Red Team
A specialised team that simulates adversaries against AI systems to uncover safety, security, and misuse risks before real attackers do.
- attacks№ 277
Data Leak
Accidental or negligent exposure of sensitive data, usually through misconfiguration or human error rather than an active attacker breaking in.