Training Data Extraction
¿Qué es Training Data Extraction?
Training Data ExtractionAttacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
● Ejemplos
- 01
A researcher prompts an LLM with 'repeat this word forever: poem' and recovers verbatim chunks of training data including email addresses and phone numbers.
- 02
An audit of a fine-tuned customer model surfaces verbatim contract clauses that should never have left the source repository.
● Preguntas frecuentes
¿Qué es Training Data Extraction?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on. Pertenece a la categoría de Seguridad de IA y ML en ciberseguridad.
¿Qué significa Training Data Extraction?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
¿Cómo funciona Training Data Extraction?
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
¿Cómo defenderse de Training Data Extraction?
Las defensas contra Training Data Extraction combinan habitualmente controles técnicos y prácticas operativas, como se detalla en la definición.
¿Cuáles son otros nombres para Training Data Extraction?
Nombres alternativos comunes: Memorization attack, Data exfiltration via LLM.
● Términos relacionados
- ai-security№ 740
Ataque de inferencia de membresía
Ataque de privacidad que determina si un registro concreto formó parte del conjunto de entrenamiento de un modelo, analizando el comportamiento del modelo sobre ese registro.
- ai-security№ 787
Extracción de modelos
Ataque que reconstruye los parámetros, comportamiento o datos de entrenamiento de un modelo de ML confidencial mediante consultas sistemáticas a su API pública.
- ai-security№ 788
Inversión de modelo
Ataque de privacidad que reconstruye características sensibles de los datos de entrenamiento de un modelo —como rostros o texto— explotando sus salidas o gradientes.
- ai-security№ 870
OWASP LLM Top 10
Lista mantenida por OWASP con los diez riesgos de seguridad más críticos para aplicaciones construidas sobre grandes modelos de lenguaje.
- ai-security№ 311
Envenenamiento de datos
Ataque a un sistema de aprendizaje automático en el que el adversario inyecta, altera o reetiqueta datos de entrenamiento para que el modelo se comporte de forma incorrecta o contenga puertas traseras ocultas.
- ai-security№ 039
Riesgo de cadena de suministro de IA
Conjunto de amenazas derivadas de los datasets, modelos base, librerías, plug-ins e infraestructuras de terceros que las organizaciones combinan para construir y desplegar sistemas de IA.