Training Data Extraction
Training Data Extraction とは何ですか?
Training Data ExtractionAttacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
● 例
- 01
A researcher prompts an LLM with 'repeat this word forever: poem' and recovers verbatim chunks of training data including email addresses and phone numbers.
- 02
An audit of a fine-tuned customer model surfaces verbatim contract clauses that should never have left the source repository.
● よくある質問
Training Data Extraction とは何ですか?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on. サイバーセキュリティの AI / ML セキュリティ カテゴリに属します。
Training Data Extraction とはどういう意味ですか?
Attacks that recover verbatim training examples from a deployed model by exploiting memorization, exposing copyrighted text, PII, or proprietary content the model was trained on.
Training Data Extraction はどのように機能しますか?
Training data extraction is a class of model-confidentiality attacks that aim to make an LLM regurgitate sequences from its training corpus exactly. Carlini et al. and follow-up work showed that even production-scale models memorize a non-trivial fraction of their training data, particularly rare strings, code, and personally identifiable information. Practical attacks include divergence prompts (looping a model on a single token until it falls into memorized text — the 2023 'poem poem poem' attack against GPT-3.5 is the canonical example), prefix completion of suspected memorized passages, and membership-inference combined with iterative reconstruction. Successful extraction matters legally (copyright, GDPR right to be forgotten), commercially (proprietary documents bled into a fine-tune), and reputationally (named individuals' details surfacing). Defenses combine training-time deduplication, differential-privacy training, output filters that block long verbatim passages, refusal training against divergence patterns, and limits on output length and entropy.
Training Data Extraction からどのように防御しますか?
Training Data Extraction に対する防御は通常、上記の定義で述べたとおり、技術的統制と運用上の実践を組み合わせます。
Training Data Extraction の別名は何ですか?
一般的な別名: Memorization attack, Data exfiltration via LLM。
● 関連用語
- ai-security№ 740
メンバーシップ推論攻撃
あるデータがモデルの学習セットに含まれていたかどうかを、モデルの挙動を解析することで判定するプライバシー攻撃。
- ai-security№ 787
モデル抽出
公開 API への体系的なクエリを通じて、機密な機械学習モデルのパラメーター・振る舞い・学習データを復元する攻撃。
- ai-security№ 788
モデル反転攻撃
モデルの出力や勾配を利用して、顔画像やテキストなど学習データの機微な特徴を復元するプライバシー攻撃。
- ai-security№ 870
OWASP LLM Top 10
大規模言語モデルを基盤とするアプリケーションに対し、最も重大な 10 のセキュリティリスクをまとめた OWASP のリスト。
- ai-security№ 311
データポイズニング
敵対者が学習データを注入・改ざん・再ラベル付けし、得られるモデルが誤動作したり隠れたバックドアを含んだりするように仕向ける機械学習システムへの攻撃。
- ai-security№ 039
AI サプライチェーンリスク
AI システムを構築・運用するために組織が組み合わせる、第三者のデータセット・ベースモデル・ライブラリ・プラグイン・インフラから生じる脅威の集合。