Training Data Extraction
Was ist Training Data Extraction?
Training Data ExtractionAttacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
● Beispiele
- 01
A researcher prompts an LLM with 'repeat this word forever: poem' and recovers verbatim chunks of training data including email addresses and phone numbers.
- 02
An audit of a fine-tuned customer model surfaces verbatim contract clauses that should never have left the source repository.
● Häufige Fragen
Was ist Training Data Extraction?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on. Es gehört zur Kategorie KI- und ML-Sicherheit der Cybersicherheit.
Was bedeutet Training Data Extraction?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Wie funktioniert Training Data Extraction?
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
Wie schützt man sich gegen Training Data Extraction?
Schutzmaßnahmen gegen Training Data Extraction kombinieren typischerweise technische Kontrollen und operative Praktiken, wie in der Definition oben beschrieben.
Welche anderen Bezeichnungen gibt es für Training Data Extraction?
Übliche alternative Bezeichnungen: Memorization attack, Data exfiltration via LLM.
● Verwandte Begriffe
- ai-security№ 740
Membership-Inference-Angriff
Privacy-Angriff, der durch Analyse des Modellverhaltens auf einem Datensatz bestimmt, ob dieser Datensatz Teil der Trainingsdaten war.
- ai-security№ 787
Modell-Extraktion
Angriff, der Parameter, Verhalten oder Trainingsdaten eines vertraulichen Machine-Learning-Modells durch systematisches Anfragen seiner öffentlichen API rekonstruiert.
- ai-security№ 788
Modell-Inversion
Privacy-Angriff, der sensible Merkmale der Trainingsdaten eines Modells — etwa Gesichter oder Text — durch Ausnutzen der Ausgaben oder Gradienten rekonstruiert.
- ai-security№ 870
OWASP LLM Top 10
Von OWASP gepflegte Liste der zehn kritischsten Sicherheitsrisiken für Anwendungen, die auf großen Sprachmodellen aufbauen.
- ai-security№ 311
Daten-Poisoning
Angriff auf ein ML-System, bei dem Angreifer Trainingsdaten einschleusen, verändern oder umlabeln, sodass das resultierende Modell fehlerhaft arbeitet oder versteckte Backdoors enthält.
- ai-security№ 039
AI-Supply-Chain-Risiko
Summe der Bedrohungen aus Drittanbieter-Datensätzen, Basismodellen, Bibliotheken, Plug-ins und Infrastruktur, die Organisationen zum Bau und Betrieb von KI-Systemen kombinieren.