Safety Alignment Constitutional System
Boost productivity using this anthropic, method, training, harmless. Includes structured workflows, validation checks, and reusable patterns for ai research.
Safety Alignment with Constitutional AI
Implement Constitutional AI (CAI) for training models to be harmless through self-critique and AI feedback — using a defined constitution of principles instead of human labels for harmful content.
When to Use
Use Constitutional AI when:
- Training models to be helpful and harmless simultaneously
- Need safety alignment without collecting human labels for harmful outputs
- Want transparent, principle-based safety (not opaque preference training)
- Building models that can explain why content is refused
Consider alternatives when:
- Have extensive human preference data → use RLHF with human feedback
- Need content filtering at inference → use LlamaGuard or classifiers
- Simple toxicity filtering → use keyword/regex filters
- Real-time content moderation → use dedicated moderation APIs
Quick Start
Define Your Constitution
CONSTITUTION = [ { "principle": "Helpfulness", "critique_prompt": "Does this response actually help the user with their request? Is it accurate and actionable?", "revision_prompt": "Revise the response to be more helpful while remaining accurate." }, { "principle": "Harmlessness", "critique_prompt": "Could this response cause harm to the user or others? Does it contain dangerous, illegal, or unethical advice?", "revision_prompt": "Revise the response to remove any potentially harmful content while remaining helpful." }, { "principle": "Honesty", "critique_prompt": "Is this response truthful? Does it make claims the model can't verify? Does it present opinions as facts?", "revision_prompt": "Revise the response to be more honest, acknowledging uncertainty where appropriate." }, ]
Self-Critique and Revision Pipeline
def constitutional_revision(model, prompt, response, constitution): """Apply constitutional principles through self-critique and revision.""" revised = response for principle in constitution: # Step 1: Critique critique_prompt = f"""Here is a response to the prompt "{prompt}": {revised} {principle['critique_prompt']} Provide your critique:""" critique = model.generate(critique_prompt) # Step 2: Revise revision_prompt = f"""Original prompt: "{prompt}" Original response: {revised} Critique: {critique} {principle['revision_prompt']} Revised response:""" revised = model.generate(revision_prompt) return revised
Training Pipeline
# Phase 1: Generate SL data with constitutional revision sl_data = [] for prompt in training_prompts: initial_response = model.generate(prompt) revised_response = constitutional_revision( model, prompt, initial_response, CONSTITUTION ) sl_data.append({ "prompt": prompt, "chosen": revised_response, "rejected": initial_response }) # Phase 2: SFT on revised responses sft_trainer.train(sl_data) # Phase 3: RLHF with AI feedback (RL-CAI) # Use the model itself as the reward model # based on constitutional principles
Core Concepts
Two-Phase Training
Phase 1: Supervised Learning (SL-CAI)
Prompt → Initial Response → Critique → Revision → SFT on Revised
Phase 2: Reinforcement Learning (RL-CAI)
Prompt → Generate Pairs → AI Ranks by Constitution → RLHF Training
Constitution Design Principles
| Principle | Purpose | Example Critique |
|---|---|---|
| Helpfulness | Ensure useful responses | "Does this actually answer the question?" |
| Harmlessness | Prevent dangerous content | "Could this cause physical harm?" |
| Honesty | Avoid misinformation | "Are these claims verifiable?" |
| Fairness | Prevent discrimination | "Does this show bias toward any group?" |
| Privacy | Protect personal data | "Does this reveal private information?" |
Advantages Over RLHF
| Aspect | RLHF | Constitutional AI |
|---|---|---|
| Labels | Human labels for harmful content | No human labels needed |
| Transparency | Opaque preference model | Explicit principles |
| Scalability | Limited by human annotators | Scales with compute |
| Consistency | Annotator disagreement | Consistent principles |
| Safety | Annotators exposed to harmful content | AI handles harmful content |
Configuration
| Parameter | Description |
|---|---|
constitution | List of principles with critique/revision prompts |
num_revisions | Revision rounds per response (1-3) |
sl_epochs | SFT training epochs on revised data |
rl_algorithm | RL phase algorithm (PPO or DPO) |
rl_learning_rate | RL phase learning rate |
principle_weights | Relative importance of each principle |
Best Practices
- Start with 3-5 core principles — helpfulness, harmlessness, honesty cover most cases
- Order principles by priority — apply harmlessness checks before helpfulness refinement
- Test with adversarial prompts — ensure the constitution handles jailbreak attempts
- Iterate on principles — refine critique prompts based on observed failure modes
- Combine with other safety measures — CAI is training-time safety, add inference-time guards too
- Document your constitution publicly — transparency builds trust in your model's safety
Common Issues
Over-refusal after safety training: The constitution is too restrictive. Add a helpfulness principle that explicitly pushes back against unnecessary refusals. Balance safety with utility.
Critique quality is poor: Use a stronger model for critique. Make critique prompts more specific. Add examples of good vs bad critiques in the prompt.
Safety gaps despite training: No single technique is sufficient. Layer constitutional AI with inference-time moderation (LlamaGuard), input filtering, and output validation.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.