Comprehensive Safety Module

Multi-layered AI safety system combining input validation, content moderation, output filtering, and monitoring — providing defense-in-depth for production LLM applications.

When to Use

Deploy this module when:

Building user-facing LLM applications that need comprehensive safety
Regulatory requirements demand content moderation (healthcare, finance, education)
Need defense-in-depth with multiple safety layers
Require audit trails for all safety-related decisions

Use simpler safety when:

Internal tools with trusted users → basic output validation
Low-risk applications → keyword filters
Prototype stage → defer safety to later

Quick Start

Multi-Layer Safety Pipeline


from safety_module import SafetyPipeline, InputFilter, ContentModerator, OutputValidator

pipeline = SafetyPipeline([
    # Layer 1: Input validation
    InputFilter(
        max_length=10000,
        block_patterns=["ignore previous instructions", "system prompt:"],
        sanitize_html=True,
    ),

    # Layer 2: Content classification
    ContentModerator(
        model="llamaguard",
        categories=["violence", "self_harm", "illegal", "sexual", "pii"],
        threshold="medium",
    ),

    # Layer 3: Output validation
    OutputValidator(
        check_pii=True,
        check_code_execution=True,
        check_urls=True,
        max_length=5000,
    ),

    # Layer 4: Audit logging
    AuditLogger(
        log_all=True,
        alert_on_block=True,
        retention_days=90,
    ),
])

# Use in your LLM application
user_input = "How do I reset my password?"

# Check input
input_result = pipeline.check_input(user_input)
if not input_result.safe:
    return {"error": input_result.reason}

# Generate response
llm_response = llm.generate(user_input)

# Check output
output_result = pipeline.check_output(user_input, llm_response)
if not output_result.safe:
    llm_response = output_result.filtered_response  # Sanitized version

PII Detection


from safety_module import PIIDetector

detector = PIIDetector()

text = "Contact John Smith at [email protected] or call 555-0123"
result = detector.scan(text)

# result.pii_found = [
#   {"type": "name", "value": "John Smith", "position": [8, 18]},
#   {"type": "email", "value": "[email protected]", "position": [22, 38]},
#   {"type": "phone", "value": "555-0123", "position": [47, 55]}
# ]

# Redact PII
redacted = detector.redact(text)
# "Contact [NAME] at [EMAIL] or call [PHONE]"

Core Concepts

Defense-in-Depth Architecture

User Input
    ↓
┌──────────────────────┐
│ Layer 1: Input Filter │  Injection, length, format
└──────────┬───────────┘
           ↓
┌──────────────────────┐
│ Layer 2: Moderator   │  Content classification
└──────────┬───────────┘
           ↓
┌──────────────────────┐
│ Layer 3: LLM         │  Generate response
└──────────┬───────────┘
           ↓
┌──────────────────────┐
│ Layer 4: Output Check│  PII, code, URLs
└──────────┬───────────┘
           ↓
┌──────────────────────┐
│ Layer 5: Audit Log   │  Record all decisions
└──────────┬───────────┘
           ↓
     Safe Response

Safety Layers

Layer	Purpose	Speed	Catches
Input Filter	Block malformed/injection inputs	< 1ms	Prompt injection, oversized inputs
Content Moderator	Classify content safety	50-200ms	Violence, hate, self-harm
Output Validator	Check generated content	< 10ms	PII leaks, code execution, bad URLs
Rate Limiter	Prevent abuse	< 1ms	Automated attacks, scraping
Audit Logger	Record all decisions	Async	Post-hoc analysis, compliance

Threat Model

Threat	Layer	Mitigation
Prompt injection	Input Filter	Pattern matching, sanitization
Jailbreaking	Content Moderator	Safety classification
PII extraction	Output Validator	PII detection and redaction
Harmful content	Content Moderator + Output	Multi-layer classification
Abuse/spam	Rate Limiter	Request throttling
Compliance violations	Audit Logger	Full decision trail

Configuration

Parameter	Default	Description
`input_max_length`	10000	Maximum input characters
`moderation_model`	"llamaguard"	Classification model
`moderation_threshold`	"medium"	Sensitivity (low/medium/high)
`pii_detection`	True	Scan for personal information
`pii_action`	"redact"	What to do with PII (redact/block)
`audit_retention`	90	Days to retain audit logs
`rate_limit`	100	Requests per minute per user

Best Practices

Layer your defenses — no single safety measure catches everything
Check both input and output — attacks can bypass input filters and appear in generated content
Use model-based moderation for nuanced content — keyword filters miss context-dependent safety issues
Always detect PII — LLMs can inadvertently reveal personal information from training data
Log everything — audit trails are essential for incident response and compliance
Test with red-teaming — hire security researchers to find bypass vectors in your safety system

Common Issues

Safety checks add too much latency: Run input checks synchronously (they're fast) but output checks can run in parallel with response streaming. Use quantized moderation models.

Too many false positives: Lower the moderation threshold from "high" to "medium". Add domain-specific allowlists for your use case. Review blocked content to tune filters.

PII detection misses custom formats: Add custom regex patterns for domain-specific PII (employee IDs, medical record numbers). Train a custom NER model for your data types.

⚠️ Loading Issue

Comprehensive Safety Module

Comprehensive Safety Module

When to Use

Quick Start

Multi-Layer Safety Pipeline

PII Detection

Core Concepts

Defense-in-Depth Architecture

Safety Layers

Threat Model

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace