Training Data Extraction
O que é Training Data Extraction?
Training Data ExtractionAttacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
● Exemplos
- 01
A researcher prompts an LLM with 'repeat this word forever: poem' and recovers verbatim chunks of training data including email addresses and phone numbers.
- 02
An audit of a fine-tuned customer model surfaces verbatim contract clauses that should never have left the source repository.
● Perguntas frequentes
O que é Training Data Extraction?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on. Pertence à categoria Segurança de IA e ML da cibersegurança.
O que significa Training Data Extraction?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Como funciona Training Data Extraction?
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
Como se defender contra Training Data Extraction?
As defesas contra Training Data Extraction costumam combinar controles técnicos e práticas operacionais, conforme detalhado na definição acima.
Quais são outros nomes para Training Data Extraction?
Nomes alternativos comuns: Memorization attack, Data exfiltration via LLM.
● Termos relacionados
- ai-security№ 740
Ataque de inferência de pertença
Ataque de privacidade que determina se um registo específico fez parte do conjunto de treino de um modelo, analisando o seu comportamento sobre esse registo.
- ai-security№ 787
Extração de modelo
Ataque que reconstrói parâmetros, comportamento ou dados de treino de um modelo de ML confidencial através de consultas sistemáticas à sua API pública.
- ai-security№ 788
Inversão de modelo
Ataque de privacidade que reconstrói características sensíveis dos dados de treino de um modelo — como rostos ou texto — explorando as suas saídas ou gradientes.
- ai-security№ 870
OWASP LLM Top 10
Lista mantida pela OWASP com os dez riscos de segurança mais críticos para aplicações construídas sobre grandes modelos de linguagem.
- ai-security№ 311
Envenenamento de dados
Ataque a um sistema de aprendizagem automática em que adversários injetam, alteram ou reetiquetam dados de treino para que o modelo resultante se comporte de forma incorreta ou contenha backdoors ocultas.
- ai-security№ 039
Risco de cadeia de fornecimento de IA
Conjunto de ameaças decorrentes dos datasets, modelos base, bibliotecas, plug-ins e infraestrutura de terceiros que as organizações combinam para construir e implementar sistemas de IA.