Model Extraction
What is Model Extraction?
Model ExtractionAn attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API.
Model extraction (or model stealing) treats a deployed model as an oracle. The attacker sends large numbers of crafted inputs, records the outputs (logits, probabilities, or even just labels), and trains a surrogate model that approximates the victim. Tramèr et al. (2016) showed this was practical against commercial MLaaS APIs; modern variants target LLMs by extracting fine-tuned styles, system prompts, or even small dense layers. Goals include intellectual-property theft, bypassing paid usage, building adversarial examples offline, and recovering proprietary data baked into weights. Defences include query rate limits, anomaly detection on access patterns, watermarking outputs, returning only top-k labels, and adding calibrated noise to confidence scores.
● Examples
- 01
Querying a commercial classifier millions of times to train a free clone that mimics its outputs.
- 02
Reconstructing a proprietary system prompt by sampling completions of an LLM-based assistant.
● Frequently asked questions
What is Model Extraction?
An attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API. It belongs to the AI & ML Security category of cybersecurity.
What does Model Extraction mean?
An attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API.
How does Model Extraction work?
Model extraction (or model stealing) treats a deployed model as an oracle. The attacker sends large numbers of crafted inputs, records the outputs (logits, probabilities, or even just labels), and trains a surrogate model that approximates the victim. Tramèr et al. (2016) showed this was practical against commercial MLaaS APIs; modern variants target LLMs by extracting fine-tuned styles, system prompts, or even small dense layers. Goals include intellectual-property theft, bypassing paid usage, building adversarial examples offline, and recovering proprietary data baked into weights. Defences include query rate limits, anomaly detection on access patterns, watermarking outputs, returning only top-k labels, and adding calibrated noise to confidence scores.
How do you defend against Model Extraction?
Defences for Model Extraction typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for Model Extraction?
Common alternative names include: Model stealing, Functionality extraction.
● Related terms
- ai-security№ 704
Model Inversion
A privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients.
- ai-security№ 666
Membership Inference Attack
A privacy attack that determines whether a specific data record was part of a machine-learning model's training set by analysing the model's behaviour on that record.
- ai-security№ 034
AI Supply Chain Risk
The set of threats arising from the third-party datasets, base models, libraries, plug-ins, and infrastructure that organisations combine to build and deploy AI systems.
- ai-security№ 691
MLSecOps
The discipline of integrating security and risk controls across the entire machine-learning lifecycle, from data sourcing through training, deployment, monitoring, and retirement.
- ai-security№ 777
OWASP LLM Top 10
An OWASP-maintained list of the ten most critical security risks affecting applications that build on large language models.
- ai-security№ 035
AI Watermarking
Techniques that embed a detectable signal into AI-generated content so its provenance, model of origin, or training-set membership can be verified later.