Comprehensive Safety Alignment Llamaguard
Streamline your workflow with this meta, specialized, moderation, model. Includes structured workflows, validation checks, and reusable patterns for ai research.
Comprehensive Safety Alignment with LlamaGuard
Deploy Meta's LlamaGuard model for AI content safety classification — providing real-time moderation of both user inputs and model outputs across customizable safety taxonomies.
When to Use
Choose LlamaGuard when:
- Need real-time content safety classification at inference time
- Want a customizable safety taxonomy (not just toxicity)
- Deploying LLM applications that need input and output moderation
- Need a model-based approach that understands context (not just keywords)
Consider alternatives when:
- Training-time safety alignment → use Constitutional AI
- Simple keyword/regex filtering → use rule-based filters
- Managed moderation API → use OpenAI Moderation or Perspective API
- Only need toxicity detection → use a smaller classifier
Quick Start
Installation
pip install transformers torch huggingface-cli login # Required for LlamaGuard access
Basic Content Classification
from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "meta-llama/LlamaGuard-7b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto" ) def classify_safety(conversation): """Classify a conversation as safe or unsafe.""" input_ids = tokenizer.apply_chat_template( conversation, return_tensors="pt" ).to(model.device) output = model.generate( input_ids=input_ids, max_new_tokens=100, pad_token_id=0 ) result = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) return result.strip() # Classify user input conversation = [ {"role": "user", "content": "How do I make a cake?"} ] result = classify_safety(conversation) print(result) # "safe" # Classify model output conversation = [ {"role": "user", "content": "Tell me a story"}, {"role": "assistant", "content": "Once upon a time..."} ] result = classify_safety(conversation) print(result) # "safe"
Integration as Middleware
class LlamaGuardMiddleware: def __init__(self, model, tokenizer): self.model = model self.tokenizer = tokenizer def check_input(self, user_message): """Check user input before sending to LLM.""" conversation = [{"role": "user", "content": user_message}] result = self._classify(conversation) if "unsafe" in result.lower(): category = self._extract_category(result) return False, f"Input blocked: {category}" return True, None def check_output(self, user_message, assistant_response): """Check LLM output before sending to user.""" conversation = [ {"role": "user", "content": user_message}, {"role": "assistant", "content": assistant_response} ] result = self._classify(conversation) if "unsafe" in result.lower(): category = self._extract_category(result) return False, f"Output blocked: {category}" return True, None def _classify(self, conversation): input_ids = self.tokenizer.apply_chat_template( conversation, return_tensors="pt" ).to(self.model.device) output = self.model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0) return self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) def _extract_category(self, result): lines = result.strip().split("\n") return lines[1] if len(lines) > 1 else "unknown"
Core Concepts
Safety Taxonomy
LlamaGuard classifies content across these categories:
| Category | Description | Examples |
|---|---|---|
| S1 | Violence and Hate | Promoting violence, hate speech |
| S2 | Sexual Content | Explicit sexual material |
| S3 | Criminal Planning | Instructions for illegal activities |
| S4 | Guns and Weapons | Weapons manufacturing |
| S5 | Regulated Substances | Drug manufacturing |
| S6 | Self-Harm | Suicide or self-harm promotion |
| S7 | Financial Crime | Fraud, money laundering |
Custom Taxonomy
custom_taxonomy = """ O1: Professional Misconduct Should not provide advice that would lead to professional malpractice. O2: Data Privacy Violation Should not help extract or expose personal information. O3: Academic Dishonesty Should not complete assignments or write papers for students. """ # Prepend custom taxonomy to the conversation conversation_with_taxonomy = f"""[INST] Task: Check if there is unsafe content. {custom_taxonomy} <conversation> User: {user_message} </conversation> Provide your safety assessment. [/INST]"""
Deployment Architecture
User Input → LlamaGuard (Input Check) → LLM → LlamaGuard (Output Check) → User
| |
Block if unsafe Block if unsafe
Return safe error Regenerate or filter
Configuration
| Parameter | Default | Description |
|---|---|---|
model_id | "meta-llama/LlamaGuard-7b" | Model version |
max_new_tokens | 100 | Classification output length |
torch_dtype | "auto" | Precision (float16, bfloat16) |
device_map | "auto" | GPU allocation |
taxonomy | Default S1-S7 | Safety categories |
threshold | Any unsafe | When to block |
Best Practices
- Check both inputs and outputs — users can craft adversarial inputs, and models can generate unsafe outputs
- Customize the taxonomy for your use case — default categories may not cover domain-specific risks
- Use quantized models (4-bit) in production to reduce latency and GPU requirements
- Run asynchronously — don't block the response pipeline; check in parallel where possible
- Log all classifications — build a dataset of edge cases for continuous improvement
- Combine with other safety layers — LlamaGuard + Constitutional AI + output validation
Common Issues
High latency from double classification: Quantize the model (GPTQ 4-bit). Run input and output checks asynchronously where possible. Use batched inference for multiple conversations.
False positives blocking safe content: Review the taxonomy categories — some may be too broad. Adjust the classification prompt to be less aggressive. Build a whitelist for commonly flagged but safe patterns.
Model not catching nuanced unsafe content: LlamaGuard works best with clear safety violations. For subtle cases, combine with human review. Update your custom taxonomy with specific examples of missed content.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.