Skip to content
Vol. 1 · Ed. 2026
CyberGlossary
Entry № 081

Backdoor Attack (ML)

What is Backdoor Attack (ML)?

Backdoor Attack (ML)A training-time attack that implants a hidden behaviour in a model so it acts normally on clean inputs but produces an attacker-chosen output whenever a secret trigger appears.


Backdoor (or trojan) attacks were popularised by Gu et al.'s BadNets paper (2017). The attacker poisons the training data, fine-tuning data, or the model weights themselves with examples that pair a trigger pattern — a sticker, a token, a watermark, even a typing style — with a target label or behaviour. Once deployed, the model passes benchmark tests because clean accuracy is preserved, but it misbehaves whenever the trigger is presented. Backdoors are especially worrying for pre-trained models distributed through public hubs and for federated learning. Defences include training-data provenance, neural cleanse and fine-pruning techniques, activation-cluster analysis, adversarial training, and only loading model weights from trusted, signed sources.

Examples

  1. 01

    An image classifier that labels any photo containing a small yellow square as "airplane", regardless of content.

  2. 02

    An LLM fine-tuned with poisoned data that emits a specific harmful payload whenever a rare control phrase appears.

Frequently asked questions

What is Backdoor Attack (ML)?

A training-time attack that implants a hidden behaviour in a model so it acts normally on clean inputs but produces an attacker-chosen output whenever a secret trigger appears. It belongs to the AI & ML Security category of cybersecurity.

What does Backdoor Attack (ML) mean?

A training-time attack that implants a hidden behaviour in a model so it acts normally on clean inputs but produces an attacker-chosen output whenever a secret trigger appears.

How does Backdoor Attack (ML) work?

Backdoor (or trojan) attacks were popularised by Gu et al.'s BadNets paper (2017). The attacker poisons the training data, fine-tuning data, or the model weights themselves with examples that pair a trigger pattern — a sticker, a token, a watermark, even a typing style — with a target label or behaviour. Once deployed, the model passes benchmark tests because clean accuracy is preserved, but it misbehaves whenever the trigger is presented. Backdoors are especially worrying for pre-trained models distributed through public hubs and for federated learning. Defences include training-data provenance, neural cleanse and fine-pruning techniques, activation-cluster analysis, adversarial training, and only loading model weights from trusted, signed sources.

How do you defend against Backdoor Attack (ML)?

Defences for Backdoor Attack (ML) typically combine technical controls and operational practices, as detailed in the full definition above.

What are other names for Backdoor Attack (ML)?

Common alternative names include: Trojan attack, BadNets attack.

Related terms