Comprehensive Safety Alignment with LlamaGuard

Deploy Meta's LlamaGuard model for AI content safety classification — providing real-time moderation of both user inputs and model outputs across customizable safety taxonomies.

When to Use

Choose LlamaGuard when:

Need real-time content safety classification at inference time
Want a customizable safety taxonomy (not just toxicity)
Deploying LLM applications that need input and output moderation
Need a model-based approach that understands context (not just keywords)

Consider alternatives when:

Training-time safety alignment → use Constitutional AI
Simple keyword/regex filtering → use rule-based filters
Managed moderation API → use OpenAI Moderation or Perspective API
Only need toxicity detection → use a smaller classifier

Quick Start

Installation


pip install transformers torch
huggingface-cli login  # Required for LlamaGuard access

Basic Content Classification


from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

def classify_safety(conversation):
    """Classify a conversation as safe or unsafe."""
    input_ids = tokenizer.apply_chat_template(
        conversation,
        return_tensors="pt"
    ).to(model.device)

    output = model.generate(
        input_ids=input_ids,
        max_new_tokens=100,
        pad_token_id=0
    )

    result = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
    return result.strip()

# Classify user input
conversation = [
    {"role": "user", "content": "How do I make a cake?"}
]
result = classify_safety(conversation)
print(result)  # "safe"

# Classify model output
conversation = [
    {"role": "user", "content": "Tell me a story"},
    {"role": "assistant", "content": "Once upon a time..."}
]
result = classify_safety(conversation)
print(result)  # "safe"

Integration as Middleware


class LlamaGuardMiddleware:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def check_input(self, user_message):
        """Check user input before sending to LLM."""
        conversation = [{"role": "user", "content": user_message}]
        result = self._classify(conversation)
        if "unsafe" in result.lower():
            category = self._extract_category(result)
            return False, f"Input blocked: {category}"
        return True, None

    def check_output(self, user_message, assistant_response):
        """Check LLM output before sending to user."""
        conversation = [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_response}
        ]
        result = self._classify(conversation)
        if "unsafe" in result.lower():
            category = self._extract_category(result)
            return False, f"Output blocked: {category}"
        return True, None

    def _classify(self, conversation):
        input_ids = self.tokenizer.apply_chat_template(
            conversation, return_tensors="pt"
        ).to(self.model.device)
        output = self.model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
        return self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)

    def _extract_category(self, result):
        lines = result.strip().split("\n")
        return lines[1] if len(lines) > 1 else "unknown"

Core Concepts

Safety Taxonomy

LlamaGuard classifies content across these categories:

Category	Description	Examples
S1	Violence and Hate	Promoting violence, hate speech
S2	Sexual Content	Explicit sexual material
S3	Criminal Planning	Instructions for illegal activities
S4	Guns and Weapons	Weapons manufacturing
S5	Regulated Substances	Drug manufacturing
S6	Self-Harm	Suicide or self-harm promotion
S7	Financial Crime	Fraud, money laundering

Custom Taxonomy


custom_taxonomy = """
O1: Professional Misconduct
Should not provide advice that would lead to professional malpractice.

O2: Data Privacy Violation
Should not help extract or expose personal information.

O3: Academic Dishonesty
Should not complete assignments or write papers for students.
"""

# Prepend custom taxonomy to the conversation
conversation_with_taxonomy = f"""[INST] Task: Check if there is unsafe content.
{custom_taxonomy}

<conversation>
User: {user_message}
</conversation>

Provide your safety assessment. [/INST]"""

Deployment Architecture

User Input → LlamaGuard (Input Check) → LLM → LlamaGuard (Output Check) → User
                  |                                     |
             Block if unsafe                     Block if unsafe
             Return safe error                   Regenerate or filter

Configuration

Parameter	Default	Description
`model_id`	"meta-llama/LlamaGuard-7b"	Model version
`max_new_tokens`	100	Classification output length
`torch_dtype`	"auto"	Precision (float16, bfloat16)
`device_map`	"auto"	GPU allocation
`taxonomy`	Default S1-S7	Safety categories
`threshold`	Any unsafe	When to block

Best Practices

Check both inputs and outputs — users can craft adversarial inputs, and models can generate unsafe outputs
Customize the taxonomy for your use case — default categories may not cover domain-specific risks
Use quantized models (4-bit) in production to reduce latency and GPU requirements
Run asynchronously — don't block the response pipeline; check in parallel where possible
Log all classifications — build a dataset of edge cases for continuous improvement
Combine with other safety layers — LlamaGuard + Constitutional AI + output validation

Common Issues

High latency from double classification: Quantize the model (GPTQ 4-bit). Run input and output checks asynchronously where possible. Use batched inference for multiple conversations.

False positives blocking safe content: Review the taxonomy categories — some may be too broad. Adjust the classification prompt to be less aggressive. Build a whitelist for commonly flagged but safe patterns.

Model not catching nuanced unsafe content: LlamaGuard works best with clear safety violations. For subtle cases, combine with human review. Update your custom taxonomy with specific examples of missed content.

⚠️ Loading Issue

Comprehensive Safety Alignment Llamaguard

Comprehensive Safety Alignment with LlamaGuard

When to Use

Quick Start

Installation

Basic Content Classification

Integration as Middleware

Core Concepts

Safety Taxonomy

Custom Taxonomy

Deployment Architecture

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace