Training Data Extraction
Training Data Extraction 是什么?
Training Data ExtractionAttacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
● 示例
- 01
A researcher prompts an LLM with 'repeat this word forever: poem' and recovers verbatim chunks of training data including email addresses and phone numbers.
- 02
An audit of a fine-tuned customer model surfaces verbatim contract clauses that should never have left the source repository.
● 常见问题
Training Data Extraction 是什么?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on. 它属于网络安全的 AI 与机器学习安全 分类。
Training Data Extraction 是什么意思?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training Data Extraction 是如何工作的?
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
如何防御 Training Data Extraction?
针对 Training Data Extraction 的防御通常结合技术控制与运营实践,详见上方完整定义。
Training Data Extraction 还有哪些其他名称?
常见的别称包括: Memorization attack, Data exfiltration via LLM。
● 相关术语
- ai-security№ 740
成员推断攻击
一种隐私攻击,通过分析模型对某条记录的行为,判断该记录是否曾出现在该模型的训练集中。
- ai-security№ 787
模型抽取
通过系统地查询机器学习模型的公开 API,重建其参数、行为或训练数据的攻击。
- ai-security№ 788
模型反演
一种隐私攻击,通过利用模型的输出或梯度来重建训练数据中的敏感特征(如人脸或文本)。
- ai-security№ 870
OWASP LLM Top 10
由 OWASP 维护的清单,列出对基于大型语言模型构建的应用最关键的十大安全风险。
- ai-security№ 311
数据投毒
针对机器学习系统的攻击,攻击者注入、篡改或重新标注训练数据,使最终模型出现错误行为或被植入隐蔽后门。
- ai-security№ 039
AI 供应链风险
组织在构建和部署 AI 系统时所组合的第三方数据集、基础模型、依赖库、插件与基础设施带来的威胁集合。