Model Inversion
What is Model Inversion?
Model InversionA privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients.
Model inversion targets the confidentiality of training data rather than the model's parameters. Fredrikson et al. (2015) showed that gradient-based optimization against a face-recognition classifier could reproduce recognisable training images from class labels and confidence scores. Modern variants extract training text from LLMs by prompting them with carefully chosen prefixes, recovering names, emails, or proprietary documents that were memorized during training. The attack is most effective against overfit or insufficiently regularized models and APIs that expose rich confidence signals. Mitigations include differential privacy during training, output minimization, deduplication of training data, regularization, and refusing to disclose internal confidence vectors.
● Examples
- 01
Reconstructing a recognisable face from a face-recognition model's confidence scores for each class.
- 02
Prompting an LLM with a known prefix and recovering memorized personal data from its training corpus.
● Frequently asked questions
What is Model Inversion?
A privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients. It belongs to the AI & ML Security category of cybersecurity.
What does Model Inversion mean?
A privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients.
How does Model Inversion work?
Model inversion targets the confidentiality of training data rather than the model's parameters. Fredrikson et al. (2015) showed that gradient-based optimization against a face-recognition classifier could reproduce recognisable training images from class labels and confidence scores. Modern variants extract training text from LLMs by prompting them with carefully chosen prefixes, recovering names, emails, or proprietary documents that were memorized during training. The attack is most effective against overfit or insufficiently regularized models and APIs that expose rich confidence signals. Mitigations include differential privacy during training, output minimization, deduplication of training data, regularization, and refusing to disclose internal confidence vectors.
How do you defend against Model Inversion?
Defences for Model Inversion typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for Model Inversion?
Common alternative names include: Training data reconstruction, Attribute inference attack.
● Related terms
- ai-security№ 666
Membership Inference Attack
A privacy attack that determines whether a specific data record was part of a machine-learning model's training set by analysing the model's behaviour on that record.
- ai-security№ 703
Model Extraction
An attack that reconstructs a confidential machine-learning model's parameters, behaviour, or training data by systematically querying its public API.
- ai-security№ 281
Data Poisoning
An attack on a machine-learning system in which adversaries inject, alter, or relabel training data so the resulting model behaves incorrectly or contains hidden backdoors.
- ai-security№ 777
OWASP LLM Top 10
An OWASP-maintained list of the ten most critical security risks affecting applications that build on large language models.
- ai-security№ 027
AI Governance
The policies, processes, roles, and controls organisations and regulators use to ensure AI systems are developed, deployed, and operated responsibly and lawfully.
- ai-security№ 691
MLSecOps
The discipline of integrating security and risk controls across the entire machine-learning lifecycle, from data sourcing through training, deployment, monitoring, and retirement.