Safety Alignment with Constitutional AI

Implement Constitutional AI (CAI) for training models to be harmless through self-critique and AI feedback — using a defined constitution of principles instead of human labels for harmful content.

When to Use

Use Constitutional AI when:

Training models to be helpful and harmless simultaneously
Need safety alignment without collecting human labels for harmful outputs
Want transparent, principle-based safety (not opaque preference training)
Building models that can explain why content is refused

Consider alternatives when:

Have extensive human preference data → use RLHF with human feedback
Need content filtering at inference → use LlamaGuard or classifiers
Simple toxicity filtering → use keyword/regex filters
Real-time content moderation → use dedicated moderation APIs

Quick Start

Define Your Constitution


CONSTITUTION = [
    {
        "principle": "Helpfulness",
        "critique_prompt": "Does this response actually help the user with their request? Is it accurate and actionable?",
        "revision_prompt": "Revise the response to be more helpful while remaining accurate."
    },
    {
        "principle": "Harmlessness",
        "critique_prompt": "Could this response cause harm to the user or others? Does it contain dangerous, illegal, or unethical advice?",
        "revision_prompt": "Revise the response to remove any potentially harmful content while remaining helpful."
    },
    {
        "principle": "Honesty",
        "critique_prompt": "Is this response truthful? Does it make claims the model can't verify? Does it present opinions as facts?",
        "revision_prompt": "Revise the response to be more honest, acknowledging uncertainty where appropriate."
    },
]

Self-Critique and Revision Pipeline


def constitutional_revision(model, prompt, response, constitution):
    """Apply constitutional principles through self-critique and revision."""

    revised = response

    for principle in constitution:
        # Step 1: Critique
        critique_prompt = f"""Here is a response to the prompt "{prompt}":

{revised}

{principle['critique_prompt']}

Provide your critique:"""

        critique = model.generate(critique_prompt)

        # Step 2: Revise
        revision_prompt = f"""Original prompt: "{prompt}"
Original response: {revised}
Critique: {critique}

{principle['revision_prompt']}

Revised response:"""

        revised = model.generate(revision_prompt)

    return revised

Training Pipeline


# Phase 1: Generate SL data with constitutional revision
sl_data = []
for prompt in training_prompts:
    initial_response = model.generate(prompt)
    revised_response = constitutional_revision(
        model, prompt, initial_response, CONSTITUTION
    )
    sl_data.append({
        "prompt": prompt,
        "chosen": revised_response,
        "rejected": initial_response
    })

# Phase 2: SFT on revised responses
sft_trainer.train(sl_data)

# Phase 3: RLHF with AI feedback (RL-CAI)
# Use the model itself as the reward model
# based on constitutional principles

Core Concepts

Two-Phase Training

Phase 1: Supervised Learning (SL-CAI)
  Prompt → Initial Response → Critique → Revision → SFT on Revised

Phase 2: Reinforcement Learning (RL-CAI)
  Prompt → Generate Pairs → AI Ranks by Constitution → RLHF Training

Constitution Design Principles

Principle	Purpose	Example Critique
Helpfulness	Ensure useful responses	"Does this actually answer the question?"
Harmlessness	Prevent dangerous content	"Could this cause physical harm?"
Honesty	Avoid misinformation	"Are these claims verifiable?"
Fairness	Prevent discrimination	"Does this show bias toward any group?"
Privacy	Protect personal data	"Does this reveal private information?"

Advantages Over RLHF

Aspect	RLHF	Constitutional AI
Labels	Human labels for harmful content	No human labels needed
Transparency	Opaque preference model	Explicit principles
Scalability	Limited by human annotators	Scales with compute
Consistency	Annotator disagreement	Consistent principles
Safety	Annotators exposed to harmful content	AI handles harmful content

Configuration

Parameter	Description
`constitution`	List of principles with critique/revision prompts
`num_revisions`	Revision rounds per response (1-3)
`sl_epochs`	SFT training epochs on revised data
`rl_algorithm`	RL phase algorithm (PPO or DPO)
`rl_learning_rate`	RL phase learning rate
`principle_weights`	Relative importance of each principle

Best Practices

Start with 3-5 core principles — helpfulness, harmlessness, honesty cover most cases
Order principles by priority — apply harmlessness checks before helpfulness refinement
Test with adversarial prompts — ensure the constitution handles jailbreak attempts
Iterate on principles — refine critique prompts based on observed failure modes
Combine with other safety measures — CAI is training-time safety, add inference-time guards too
Document your constitution publicly — transparency builds trust in your model's safety

Common Issues

Over-refusal after safety training: The constitution is too restrictive. Add a helpfulness principle that explicitly pushes back against unnecessary refusals. Balance safety with utility.

Critique quality is poor: Use a stronger model for critique. Make critique prompts more specific. Add examples of good vs bad critiques in the prompt.

Safety gaps despite training: No single technique is sufficient. Layer constitutional AI with inference-time moderation (LlamaGuard), input filtering, and output validation.

⚠️ Loading Issue

Safety Alignment Constitutional System

Safety Alignment with Constitutional AI

When to Use

Quick Start

Define Your Constitution

Self-Critique and Revision Pipeline

Training Pipeline

Core Concepts

Two-Phase Training

Constitution Design Principles

Advantages Over RLHF

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace