Model Inversion

Reviewed byFlorian AmetteCybersecurity entrepreneur & security researcher

What is Model Inversion?

Model InversionA privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients.

Model inversion targets the confidentiality of training data rather than the model's parameters. Fredrikson et al. (2015) showed that gradient-based optimization against a face-recognition classifier could reproduce recognisable training images from class labels and confidence scores. Modern variants extract training text from LLMs by prompting them with carefully chosen prefixes, recovering names, emails, or proprietary documents that were memorized during training. The attack is most effective against overfit or insufficiently regularized models and APIs that expose rich confidence signals. Mitigations include differential privacy during training, output minimization, deduplication of training data, regularization, and refusing to disclose internal confidence vectors.

● Examples

01
Reconstructing a recognisable face from a face-recognition model's confidence scores for each class.
02
Prompting an LLM with a known prefix and recovering memorized personal data from its training corpus.

● Frequently asked questions

What is Model Inversion?

A privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients. It belongs to the AI & ML Security category of cybersecurity.

What does Model Inversion mean?

A privacy attack that reconstructs sensitive features of a model's training data — such as faces or text — by exploiting the model's outputs or gradients.

How do you defend against Model Inversion?

Defences for Model Inversion typically combine technical controls and operational practices, as detailed in the full definition above.

What are other names for Model Inversion?

Common alternative names include: Training data reconstruction, Attribute inference attack.

● Related terms

● See also

Training Data Extraction