S

Safety Alignment Constitutional System

Boost productivity using this anthropic, method, training, harmless. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Safety Alignment with Constitutional AI

Implement Constitutional AI (CAI) for training models to be harmless through self-critique and AI feedback — using a defined constitution of principles instead of human labels for harmful content.

When to Use

Use Constitutional AI when:

  • Training models to be helpful and harmless simultaneously
  • Need safety alignment without collecting human labels for harmful outputs
  • Want transparent, principle-based safety (not opaque preference training)
  • Building models that can explain why content is refused

Consider alternatives when:

  • Have extensive human preference data → use RLHF with human feedback
  • Need content filtering at inference → use LlamaGuard or classifiers
  • Simple toxicity filtering → use keyword/regex filters
  • Real-time content moderation → use dedicated moderation APIs

Quick Start

Define Your Constitution

CONSTITUTION = [ { "principle": "Helpfulness", "critique_prompt": "Does this response actually help the user with their request? Is it accurate and actionable?", "revision_prompt": "Revise the response to be more helpful while remaining accurate." }, { "principle": "Harmlessness", "critique_prompt": "Could this response cause harm to the user or others? Does it contain dangerous, illegal, or unethical advice?", "revision_prompt": "Revise the response to remove any potentially harmful content while remaining helpful." }, { "principle": "Honesty", "critique_prompt": "Is this response truthful? Does it make claims the model can't verify? Does it present opinions as facts?", "revision_prompt": "Revise the response to be more honest, acknowledging uncertainty where appropriate." }, ]

Self-Critique and Revision Pipeline

def constitutional_revision(model, prompt, response, constitution): """Apply constitutional principles through self-critique and revision.""" revised = response for principle in constitution: # Step 1: Critique critique_prompt = f"""Here is a response to the prompt "{prompt}": {revised} {principle['critique_prompt']} Provide your critique:""" critique = model.generate(critique_prompt) # Step 2: Revise revision_prompt = f"""Original prompt: "{prompt}" Original response: {revised} Critique: {critique} {principle['revision_prompt']} Revised response:""" revised = model.generate(revision_prompt) return revised

Training Pipeline

# Phase 1: Generate SL data with constitutional revision sl_data = [] for prompt in training_prompts: initial_response = model.generate(prompt) revised_response = constitutional_revision( model, prompt, initial_response, CONSTITUTION ) sl_data.append({ "prompt": prompt, "chosen": revised_response, "rejected": initial_response }) # Phase 2: SFT on revised responses sft_trainer.train(sl_data) # Phase 3: RLHF with AI feedback (RL-CAI) # Use the model itself as the reward model # based on constitutional principles

Core Concepts

Two-Phase Training

Phase 1: Supervised Learning (SL-CAI)
  Prompt → Initial Response → Critique → Revision → SFT on Revised

Phase 2: Reinforcement Learning (RL-CAI)
  Prompt → Generate Pairs → AI Ranks by Constitution → RLHF Training

Constitution Design Principles

PrinciplePurposeExample Critique
HelpfulnessEnsure useful responses"Does this actually answer the question?"
HarmlessnessPrevent dangerous content"Could this cause physical harm?"
HonestyAvoid misinformation"Are these claims verifiable?"
FairnessPrevent discrimination"Does this show bias toward any group?"
PrivacyProtect personal data"Does this reveal private information?"

Advantages Over RLHF

AspectRLHFConstitutional AI
LabelsHuman labels for harmful contentNo human labels needed
TransparencyOpaque preference modelExplicit principles
ScalabilityLimited by human annotatorsScales with compute
ConsistencyAnnotator disagreementConsistent principles
SafetyAnnotators exposed to harmful contentAI handles harmful content

Configuration

ParameterDescription
constitutionList of principles with critique/revision prompts
num_revisionsRevision rounds per response (1-3)
sl_epochsSFT training epochs on revised data
rl_algorithmRL phase algorithm (PPO or DPO)
rl_learning_rateRL phase learning rate
principle_weightsRelative importance of each principle

Best Practices

  1. Start with 3-5 core principles — helpfulness, harmlessness, honesty cover most cases
  2. Order principles by priority — apply harmlessness checks before helpfulness refinement
  3. Test with adversarial prompts — ensure the constitution handles jailbreak attempts
  4. Iterate on principles — refine critique prompts based on observed failure modes
  5. Combine with other safety measures — CAI is training-time safety, add inference-time guards too
  6. Document your constitution publicly — transparency builds trust in your model's safety

Common Issues

Over-refusal after safety training: The constitution is too restrictive. Add a helpfulness principle that explicitly pushes back against unnecessary refusals. Balance safety with utility.

Critique quality is poor: Use a stronger model for critique. Make critique prompts more specific. Add examples of good vs bad critiques in the prompt.

Safety gaps despite training: No single technique is sufficient. Layer constitutional AI with inference-time moderation (LlamaGuard), input filtering, and output validation.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates