Comprehensive Safety Module
Enterprise-grade skill for nvidia, runtime, safety, framework. Includes structured workflows, validation checks, and reusable patterns for ai research.
Comprehensive Safety Module
Multi-layered AI safety system combining input validation, content moderation, output filtering, and monitoring β providing defense-in-depth for production LLM applications.
When to Use
Deploy this module when:
- Building user-facing LLM applications that need comprehensive safety
- Regulatory requirements demand content moderation (healthcare, finance, education)
- Need defense-in-depth with multiple safety layers
- Require audit trails for all safety-related decisions
Use simpler safety when:
- Internal tools with trusted users β basic output validation
- Low-risk applications β keyword filters
- Prototype stage β defer safety to later
Quick Start
Multi-Layer Safety Pipeline
from safety_module import SafetyPipeline, InputFilter, ContentModerator, OutputValidator pipeline = SafetyPipeline([ # Layer 1: Input validation InputFilter( max_length=10000, block_patterns=["ignore previous instructions", "system prompt:"], sanitize_html=True, ), # Layer 2: Content classification ContentModerator( model="llamaguard", categories=["violence", "self_harm", "illegal", "sexual", "pii"], threshold="medium", ), # Layer 3: Output validation OutputValidator( check_pii=True, check_code_execution=True, check_urls=True, max_length=5000, ), # Layer 4: Audit logging AuditLogger( log_all=True, alert_on_block=True, retention_days=90, ), ]) # Use in your LLM application user_input = "How do I reset my password?" # Check input input_result = pipeline.check_input(user_input) if not input_result.safe: return {"error": input_result.reason} # Generate response llm_response = llm.generate(user_input) # Check output output_result = pipeline.check_output(user_input, llm_response) if not output_result.safe: llm_response = output_result.filtered_response # Sanitized version
PII Detection
from safety_module import PIIDetector detector = PIIDetector() text = "Contact John Smith at [email protected] or call 555-0123" result = detector.scan(text) # result.pii_found = [ # {"type": "name", "value": "John Smith", "position": [8, 18]}, # {"type": "email", "value": "[email protected]", "position": [22, 38]}, # {"type": "phone", "value": "555-0123", "position": [47, 55]} # ] # Redact PII redacted = detector.redact(text) # "Contact [NAME] at [EMAIL] or call [PHONE]"
Core Concepts
Defense-in-Depth Architecture
User Input
β
ββββββββββββββββββββββββ
β Layer 1: Input Filter β Injection, length, format
ββββββββββββ¬ββββββββββββ
β
ββββββββββββββββββββββββ
β Layer 2: Moderator β Content classification
ββββββββββββ¬ββββββββββββ
β
ββββββββββββββββββββββββ
β Layer 3: LLM β Generate response
ββββββββββββ¬ββββββββββββ
β
ββββββββββββββββββββββββ
β Layer 4: Output Checkβ PII, code, URLs
ββββββββββββ¬ββββββββββββ
β
ββββββββββββββββββββββββ
β Layer 5: Audit Log β Record all decisions
ββββββββββββ¬ββββββββββββ
β
Safe Response
Safety Layers
| Layer | Purpose | Speed | Catches |
|---|---|---|---|
| Input Filter | Block malformed/injection inputs | < 1ms | Prompt injection, oversized inputs |
| Content Moderator | Classify content safety | 50-200ms | Violence, hate, self-harm |
| Output Validator | Check generated content | < 10ms | PII leaks, code execution, bad URLs |
| Rate Limiter | Prevent abuse | < 1ms | Automated attacks, scraping |
| Audit Logger | Record all decisions | Async | Post-hoc analysis, compliance |
Threat Model
| Threat | Layer | Mitigation |
|---|---|---|
| Prompt injection | Input Filter | Pattern matching, sanitization |
| Jailbreaking | Content Moderator | Safety classification |
| PII extraction | Output Validator | PII detection and redaction |
| Harmful content | Content Moderator + Output | Multi-layer classification |
| Abuse/spam | Rate Limiter | Request throttling |
| Compliance violations | Audit Logger | Full decision trail |
Configuration
| Parameter | Default | Description |
|---|---|---|
input_max_length | 10000 | Maximum input characters |
moderation_model | "llamaguard" | Classification model |
moderation_threshold | "medium" | Sensitivity (low/medium/high) |
pii_detection | True | Scan for personal information |
pii_action | "redact" | What to do with PII (redact/block) |
audit_retention | 90 | Days to retain audit logs |
rate_limit | 100 | Requests per minute per user |
Best Practices
- Layer your defenses β no single safety measure catches everything
- Check both input and output β attacks can bypass input filters and appear in generated content
- Use model-based moderation for nuanced content β keyword filters miss context-dependent safety issues
- Always detect PII β LLMs can inadvertently reveal personal information from training data
- Log everything β audit trails are essential for incident response and compliance
- Test with red-teaming β hire security researchers to find bypass vectors in your safety system
Common Issues
Safety checks add too much latency: Run input checks synchronously (they're fast) but output checks can run in parallel with response streaming. Use quantized moderation models.
Too many false positives: Lower the moderation threshold from "high" to "medium". Add domain-specific allowlists for your use case. Review blocked content to tune filters.
PII detection misses custom formats: Add custom regex patterns for domain-specific PII (employee IDs, medical record numbers). Train a custom NER model for your data types.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.