AI Alignment
What is AI Alignment?
AI AlignmentThe research and engineering effort to ensure AI systems pursue goals, follow instructions, and behave in ways that match the intentions of their developers and users.
Alignment work bridges machine-learning research, policy, and security. Techniques include supervised fine-tuning, reinforcement learning from human feedback (RLHF), reinforcement learning from AI feedback (RLAIF), constitutional AI, debate, scalable oversight, and interpretability. The field studies misalignment risks such as reward hacking, deceptive alignment, sycophancy, specification gaming, and emergent power-seeking behaviour in increasingly capable systems. Alignment is foundational to AI safety: a misaligned model that is otherwise secure can still cause harm because it pursues the wrong objective. Major labs (Anthropic, OpenAI, DeepMind) and institutions like the UK AI Security Institute publish alignment research, evaluations, and benchmarks that feed safety policies, red-team scenarios, and governance frameworks.
● Examples
- 01
Using RLHF to train an LLM to follow user instructions while refusing clearly harmful requests.
- 02
Evaluating whether a model engages in sycophantic agreement with incorrect user beliefs.
● Frequently asked questions
What is AI Alignment?
The research and engineering effort to ensure AI systems pursue goals, follow instructions, and behave in ways that match the intentions of their developers and users. It belongs to the AI & ML Security category of cybersecurity.
What does AI Alignment mean?
The research and engineering effort to ensure AI systems pursue goals, follow instructions, and behave in ways that match the intentions of their developers and users.
How does AI Alignment work?
Alignment work bridges machine-learning research, policy, and security. Techniques include supervised fine-tuning, reinforcement learning from human feedback (RLHF), reinforcement learning from AI feedback (RLAIF), constitutional AI, debate, scalable oversight, and interpretability. The field studies misalignment risks such as reward hacking, deceptive alignment, sycophancy, specification gaming, and emergent power-seeking behaviour in increasingly capable systems. Alignment is foundational to AI safety: a misaligned model that is otherwise secure can still cause harm because it pursues the wrong objective. Major labs (Anthropic, OpenAI, DeepMind) and institutions like the UK AI Security Institute publish alignment research, evaluations, and benchmarks that feed safety policies, red-team scenarios, and governance frameworks.
How do you defend against AI Alignment?
Defences for AI Alignment typically combine technical controls and operational practices, as detailed in the full definition above.
What are other names for AI Alignment?
Common alternative names include: Value alignment, Model alignment.
● Related terms
- ai-security№ 033
AI Safety
The discipline that aims to prevent AI systems from causing unintended harm to users, operators, and society — covering technical, operational, and societal dimensions.
- ai-security№ 032
AI Red Team
A specialised team that simulates adversaries against AI systems to uncover safety, security, and misuse risks before real attackers do.
- ai-security№ 027
AI Governance
The policies, processes, roles, and controls organisations and regulators use to ensure AI systems are developed, deployed, and operated responsibly and lawfully.
- ai-security№ 030
AI Jailbreak
A technique that causes an aligned AI model to bypass its safety policies and produce content or behaviour the operator intended to forbid.
- ai-security№ 618
LLM Guardrails
Mechanisms that constrain what an LLM-based application can input or output, enforcing safety, security, and business rules around the underlying model.
- ai-security№ 028
AI Hallucination
A failure mode in which a generative AI system outputs content that is fluent and confident but factually wrong, fabricated, or unsupported by its sources.