Skip to content
Vol. 1 · Ed. 2026
CyberGlossary
Entry № 024

AI Alignment

What is AI Alignment?

AI AlignmentThe research and engineering effort to ensure AI systems pursue goals, follow instructions, and behave in ways that match the intentions of their developers and users.


Alignment work bridges machine-learning research, policy, and security. Techniques include supervised fine-tuning, reinforcement learning from human feedback (RLHF), reinforcement learning from AI feedback (RLAIF), constitutional AI, debate, scalable oversight, and interpretability. The field studies misalignment risks such as reward hacking, deceptive alignment, sycophancy, specification gaming, and emergent power-seeking behaviour in increasingly capable systems. Alignment is foundational to AI safety: a misaligned model that is otherwise secure can still cause harm because it pursues the wrong objective. Major labs (Anthropic, OpenAI, DeepMind) and institutions like the UK AI Security Institute publish alignment research, evaluations, and benchmarks that feed safety policies, red-team scenarios, and governance frameworks.

Examples

  1. 01

    Using RLHF to train an LLM to follow user instructions while refusing clearly harmful requests.

  2. 02

    Evaluating whether a model engages in sycophantic agreement with incorrect user beliefs.

Frequently asked questions

What is AI Alignment?

The research and engineering effort to ensure AI systems pursue goals, follow instructions, and behave in ways that match the intentions of their developers and users. It belongs to the AI & ML Security category of cybersecurity.

What does AI Alignment mean?

The research and engineering effort to ensure AI systems pursue goals, follow instructions, and behave in ways that match the intentions of their developers and users.

How does AI Alignment work?

Alignment work bridges machine-learning research, policy, and security. Techniques include supervised fine-tuning, reinforcement learning from human feedback (RLHF), reinforcement learning from AI feedback (RLAIF), constitutional AI, debate, scalable oversight, and interpretability. The field studies misalignment risks such as reward hacking, deceptive alignment, sycophancy, specification gaming, and emergent power-seeking behaviour in increasingly capable systems. Alignment is foundational to AI safety: a misaligned model that is otherwise secure can still cause harm because it pursues the wrong objective. Major labs (Anthropic, OpenAI, DeepMind) and institutions like the UK AI Security Institute publish alignment research, evaluations, and benchmarks that feed safety policies, red-team scenarios, and governance frameworks.

How do you defend against AI Alignment?

Defences for AI Alignment typically combine technical controls and operational practices, as detailed in the full definition above.

What are other names for AI Alignment?

Common alternative names include: Value alignment, Model alignment.

Related terms