Skip to content
Vol. 1 · Ed. 2026
CyberGlossary
Entry № 703

Model Extraction

What is Model Extraction?

Model ExtractionAn attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API.


Model extraction (or model stealing) treats a deployed model as an oracle. The attacker sends large numbers of crafted inputs, records the outputs (logits, probabilities, or even just labels), and trains a surrogate model that approximates the victim. Tramèr et al. (2016) showed this was practical against commercial MLaaS APIs; modern variants target LLMs by extracting fine-tuned styles, system prompts, or even small dense layers. Goals include intellectual-property theft, bypassing paid usage, building adversarial examples offline, and recovering proprietary data baked into weights. Defences include query rate limits, anomaly detection on access patterns, watermarking outputs, returning only top-k labels, and adding calibrated noise to confidence scores.

Examples

  1. 01

    Querying a commercial classifier millions of times to train a free clone that mimics its outputs.

  2. 02

    Reconstructing a proprietary system prompt by sampling completions of an LLM-based assistant.

Frequently asked questions

What is Model Extraction?

An attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API. It belongs to the AI & ML Security category of cybersecurity.

What does Model Extraction mean?

An attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API.

How does Model Extraction work?

Model extraction (or model stealing) treats a deployed model as an oracle. The attacker sends large numbers of crafted inputs, records the outputs (logits, probabilities, or even just labels), and trains a surrogate model that approximates the victim. Tramèr et al. (2016) showed this was practical against commercial MLaaS APIs; modern variants target LLMs by extracting fine-tuned styles, system prompts, or even small dense layers. Goals include intellectual-property theft, bypassing paid usage, building adversarial examples offline, and recovering proprietary data baked into weights. Defences include query rate limits, anomaly detection on access patterns, watermarking outputs, returning only top-k labels, and adding calibrated noise to confidence scores.

How do you defend against Model Extraction?

Defences for Model Extraction typically combine technical controls and operational practices, as detailed in the full definition above.

What are other names for Model Extraction?

Common alternative names include: Model stealing, Functionality extraction.

Related terms

See also