Training Data Extraction
What is Training Data Extraction?
Training Data ExtractionAttacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
● Examples
- 01
A researcher prompts an LLM with 'repeat this word forever: poem' and recovers verbatim chunks of training data including email addresses and phone numbers.
- 02
An audit of a fine-tuned customer model surfaces verbatim contract clauses that should never have left the source repository.
● Frequently asked questions
What is Training Data Extraction?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on. It belongs to the AI & ML Security category of cybersecurity.
What does Training Data Extraction mean?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
How does Training Data Extraction work?
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
How do you defend against Training Data Extraction?
Defences for Training Data Extraction typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for Training Data Extraction?
Common alternative names include: Memorization attack, Data exfiltration via LLM.
● Related terms
- ai-security№ 740
Membership Inference Attack
A privacy attack that determines whether a specific data record was part of a machine-learning model's training set by analysing the model's behaviour on that record.
- ai-security№ 787
Model Extraction
An attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API.
- ai-security№ 788
Model Inversion
A privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients.
- ai-security№ 870
OWASP LLM Top 10
An OWASP-maintained list of the ten most critical security risks affecting applications that build on large language models.
- ai-security№ 311
Data Poisoning
An attack on a machine-learning system in which adversaries inject, alter, or relabel training data so the resulting model behaves incorrectly or contains hidden backdoors.
- ai-security№ 039
AI Supply Chain Risk
The set of threats arising from the third-party datasets, base models, libraries, plug-ins, and infrastructure that organisations combine to build and deploy AI systems.