Backdoor Attack (ML)
What is Backdoor Attack (ML)?
Backdoor Attack (ML)A training-time attack that implants a hidden behaviour in a model so it acts normally on clean inputs but produces an attacker-chosen output whenever a secret trigger appears.
Backdoor (or trojan) attacks were popularised by Gu et al.'s BadNets paper (2017). The attacker poisons the training data, fine-tuning data, or the model weights themselves with examples that pair a trigger pattern — a sticker, a token, a watermark, even a typing style — with a target label or behaviour. Once deployed, the model passes benchmark tests because clean accuracy is preserved, but it misbehaves whenever the trigger is presented. Backdoors are especially worrying for pre-trained models distributed through public hubs and for federated learning. Defences include training-data provenance, neural cleanse and fine-pruning techniques, activation-cluster analysis, adversarial training, and only loading model weights from trusted, signed sources.
● Examples
- 01
An image classifier that labels any photo containing a small yellow square as "airplane", regardless of content.
- 02
An LLM fine-tuned with poisoned data that emits a specific harmful payload whenever a rare control phrase appears.
● Frequently asked questions
What is Backdoor Attack (ML)?
A training-time attack that implants a hidden behaviour in a model so it acts normally on clean inputs but produces an attacker-chosen output whenever a secret trigger appears. It belongs to the AI & ML Security category of cybersecurity.
What does Backdoor Attack (ML) mean?
A training-time attack that implants a hidden behaviour in a model so it acts normally on clean inputs but produces an attacker-chosen output whenever a secret trigger appears.
How does Backdoor Attack (ML) work?
Backdoor (or trojan) attacks were popularised by Gu et al.'s BadNets paper (2017). The attacker poisons the training data, fine-tuning data, or the model weights themselves with examples that pair a trigger pattern — a sticker, a token, a watermark, even a typing style — with a target label or behaviour. Once deployed, the model passes benchmark tests because clean accuracy is preserved, but it misbehaves whenever the trigger is presented. Backdoors are especially worrying for pre-trained models distributed through public hubs and for federated learning. Defences include training-data provenance, neural cleanse and fine-pruning techniques, activation-cluster analysis, adversarial training, and only loading model weights from trusted, signed sources.
How do you defend against Backdoor Attack (ML)?
Defences for Backdoor Attack (ML) typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for Backdoor Attack (ML)?
Common alternative names include: Trojan attack, BadNets attack.
● Related terms
- ai-security№ 281
Data Poisoning
An attack on a machine-learning system in which adversaries inject, alter, or relabel training data so the resulting model behaves incorrectly or contains hidden backdoors.
- ai-security№ 034
AI Supply Chain Risk
The set of threats arising from the third-party datasets, base models, libraries, plug-ins, and infrastructure that organisations combine to build and deploy AI systems.
- ai-security№ 018
Adversarial Example
An input deliberately perturbed — often imperceptibly to humans — so that a machine-learning model produces a wrong or attacker-chosen prediction.
- ai-security№ 691
MLSecOps
The discipline of integrating security and risk controls across the entire machine-learning lifecycle, from data sourcing through training, deployment, monitoring, and retirement.
- ai-security№ 025
AI Bill of Materials (AIBOM)
A machine-readable inventory of every component that goes into an AI system — datasets, base models, fine-tuning data, libraries, prompts, and evaluation artifacts — used for security, compliance, and accountability.
- ai-security№ 393
Evasion Attack (ML)
An inference-time attack in which an adversary crafts inputs that bypass a deployed machine-learning model's intended decision, such as evading a malware classifier or content filter.