C

Comprehensive Safety Alignment Llamaguard

Streamline your workflow with this meta, specialized, moderation, model. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Comprehensive Safety Alignment with LlamaGuard

Deploy Meta's LlamaGuard model for AI content safety classification — providing real-time moderation of both user inputs and model outputs across customizable safety taxonomies.

When to Use

Choose LlamaGuard when:

  • Need real-time content safety classification at inference time
  • Want a customizable safety taxonomy (not just toxicity)
  • Deploying LLM applications that need input and output moderation
  • Need a model-based approach that understands context (not just keywords)

Consider alternatives when:

  • Training-time safety alignment → use Constitutional AI
  • Simple keyword/regex filtering → use rule-based filters
  • Managed moderation API → use OpenAI Moderation or Perspective API
  • Only need toxicity detection → use a smaller classifier

Quick Start

Installation

pip install transformers torch huggingface-cli login # Required for LlamaGuard access

Basic Content Classification

from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "meta-llama/LlamaGuard-7b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto" ) def classify_safety(conversation): """Classify a conversation as safe or unsafe.""" input_ids = tokenizer.apply_chat_template( conversation, return_tensors="pt" ).to(model.device) output = model.generate( input_ids=input_ids, max_new_tokens=100, pad_token_id=0 ) result = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) return result.strip() # Classify user input conversation = [ {"role": "user", "content": "How do I make a cake?"} ] result = classify_safety(conversation) print(result) # "safe" # Classify model output conversation = [ {"role": "user", "content": "Tell me a story"}, {"role": "assistant", "content": "Once upon a time..."} ] result = classify_safety(conversation) print(result) # "safe"

Integration as Middleware

class LlamaGuardMiddleware: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer def check_input(self, user_message): """Check user input before sending to LLM.""" conversation = [{"role": "user", "content": user_message}] result = self._classify(conversation) if "unsafe" in result.lower(): category = self._extract_category(result) return False, f"Input blocked: {category}" return True, None def check_output(self, user_message, assistant_response): """Check LLM output before sending to user.""" conversation = [ {"role": "user", "content": user_message}, {"role": "assistant", "content": assistant_response} ] result = self._classify(conversation) if "unsafe" in result.lower(): category = self._extract_category(result) return False, f"Output blocked: {category}" return True, None def _classify(self, conversation): input_ids = self.tokenizer.apply_chat_template( conversation, return_tensors="pt" ).to(self.model.device) output = self.model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0) return self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) def _extract_category(self, result): lines = result.strip().split("\n") return lines[1] if len(lines) > 1 else "unknown"

Core Concepts

Safety Taxonomy

LlamaGuard classifies content across these categories:

CategoryDescriptionExamples
S1Violence and HatePromoting violence, hate speech
S2Sexual ContentExplicit sexual material
S3Criminal PlanningInstructions for illegal activities
S4Guns and WeaponsWeapons manufacturing
S5Regulated SubstancesDrug manufacturing
S6Self-HarmSuicide or self-harm promotion
S7Financial CrimeFraud, money laundering

Custom Taxonomy

custom_taxonomy = """ O1: Professional Misconduct Should not provide advice that would lead to professional malpractice. O2: Data Privacy Violation Should not help extract or expose personal information. O3: Academic Dishonesty Should not complete assignments or write papers for students. """ # Prepend custom taxonomy to the conversation conversation_with_taxonomy = f"""[INST] Task: Check if there is unsafe content. {custom_taxonomy} <conversation> User: {user_message} </conversation> Provide your safety assessment. [/INST]"""

Deployment Architecture

User Input → LlamaGuard (Input Check) → LLM → LlamaGuard (Output Check) → User
                  |                                     |
             Block if unsafe                     Block if unsafe
             Return safe error                   Regenerate or filter

Configuration

ParameterDefaultDescription
model_id"meta-llama/LlamaGuard-7b"Model version
max_new_tokens100Classification output length
torch_dtype"auto"Precision (float16, bfloat16)
device_map"auto"GPU allocation
taxonomyDefault S1-S7Safety categories
thresholdAny unsafeWhen to block

Best Practices

  1. Check both inputs and outputs — users can craft adversarial inputs, and models can generate unsafe outputs
  2. Customize the taxonomy for your use case — default categories may not cover domain-specific risks
  3. Use quantized models (4-bit) in production to reduce latency and GPU requirements
  4. Run asynchronously — don't block the response pipeline; check in parallel where possible
  5. Log all classifications — build a dataset of edge cases for continuous improvement
  6. Combine with other safety layers — LlamaGuard + Constitutional AI + output validation

Common Issues

High latency from double classification: Quantize the model (GPTQ 4-bit). Run input and output checks asynchronously where possible. Use batched inference for multiple conversations.

False positives blocking safe content: Review the taxonomy categories — some may be too broad. Adjust the classification prompt to be less aggressive. Build a whitelist for commonly flagged but safe patterns.

Model not catching nuanced unsafe content: LlamaGuard works best with clear safety violations. For subtle cases, combine with human review. Update your custom taxonomy with specific examples of missed content.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates