C

Comprehensive Safety Module

Enterprise-grade skill for nvidia, runtime, safety, framework. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Comprehensive Safety Module

Multi-layered AI safety system combining input validation, content moderation, output filtering, and monitoring β€” providing defense-in-depth for production LLM applications.

When to Use

Deploy this module when:

  • Building user-facing LLM applications that need comprehensive safety
  • Regulatory requirements demand content moderation (healthcare, finance, education)
  • Need defense-in-depth with multiple safety layers
  • Require audit trails for all safety-related decisions

Use simpler safety when:

  • Internal tools with trusted users β†’ basic output validation
  • Low-risk applications β†’ keyword filters
  • Prototype stage β†’ defer safety to later

Quick Start

Multi-Layer Safety Pipeline

from safety_module import SafetyPipeline, InputFilter, ContentModerator, OutputValidator pipeline = SafetyPipeline([ # Layer 1: Input validation InputFilter( max_length=10000, block_patterns=["ignore previous instructions", "system prompt:"], sanitize_html=True, ), # Layer 2: Content classification ContentModerator( model="llamaguard", categories=["violence", "self_harm", "illegal", "sexual", "pii"], threshold="medium", ), # Layer 3: Output validation OutputValidator( check_pii=True, check_code_execution=True, check_urls=True, max_length=5000, ), # Layer 4: Audit logging AuditLogger( log_all=True, alert_on_block=True, retention_days=90, ), ]) # Use in your LLM application user_input = "How do I reset my password?" # Check input input_result = pipeline.check_input(user_input) if not input_result.safe: return {"error": input_result.reason} # Generate response llm_response = llm.generate(user_input) # Check output output_result = pipeline.check_output(user_input, llm_response) if not output_result.safe: llm_response = output_result.filtered_response # Sanitized version

PII Detection

from safety_module import PIIDetector detector = PIIDetector() text = "Contact John Smith at [email protected] or call 555-0123" result = detector.scan(text) # result.pii_found = [ # {"type": "name", "value": "John Smith", "position": [8, 18]}, # {"type": "email", "value": "[email protected]", "position": [22, 38]}, # {"type": "phone", "value": "555-0123", "position": [47, 55]} # ] # Redact PII redacted = detector.redact(text) # "Contact [NAME] at [EMAIL] or call [PHONE]"

Core Concepts

Defense-in-Depth Architecture

User Input
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 1: Input Filter β”‚  Injection, length, format
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 2: Moderator   β”‚  Content classification
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 3: LLM         β”‚  Generate response
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 4: Output Checkβ”‚  PII, code, URLs
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 5: Audit Log   β”‚  Record all decisions
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
     Safe Response

Safety Layers

LayerPurposeSpeedCatches
Input FilterBlock malformed/injection inputs< 1msPrompt injection, oversized inputs
Content ModeratorClassify content safety50-200msViolence, hate, self-harm
Output ValidatorCheck generated content< 10msPII leaks, code execution, bad URLs
Rate LimiterPrevent abuse< 1msAutomated attacks, scraping
Audit LoggerRecord all decisionsAsyncPost-hoc analysis, compliance

Threat Model

ThreatLayerMitigation
Prompt injectionInput FilterPattern matching, sanitization
JailbreakingContent ModeratorSafety classification
PII extractionOutput ValidatorPII detection and redaction
Harmful contentContent Moderator + OutputMulti-layer classification
Abuse/spamRate LimiterRequest throttling
Compliance violationsAudit LoggerFull decision trail

Configuration

ParameterDefaultDescription
input_max_length10000Maximum input characters
moderation_model"llamaguard"Classification model
moderation_threshold"medium"Sensitivity (low/medium/high)
pii_detectionTrueScan for personal information
pii_action"redact"What to do with PII (redact/block)
audit_retention90Days to retain audit logs
rate_limit100Requests per minute per user

Best Practices

  1. Layer your defenses β€” no single safety measure catches everything
  2. Check both input and output β€” attacks can bypass input filters and appear in generated content
  3. Use model-based moderation for nuanced content β€” keyword filters miss context-dependent safety issues
  4. Always detect PII β€” LLMs can inadvertently reveal personal information from training data
  5. Log everything β€” audit trails are essential for incident response and compliance
  6. Test with red-teaming β€” hire security researchers to find bypass vectors in your safety system

Common Issues

Safety checks add too much latency: Run input checks synchronously (they're fast) but output checks can run in parallel with response streaming. Use quantized moderation models.

Too many false positives: Lower the moderation threshold from "high" to "medium". Add domain-specific allowlists for your use case. Review blocked content to tune filters.

PII detection misses custom formats: Add custom regex patterns for domain-specific PII (employee IDs, medical record numbers). Train a custom NER model for your data types.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates