Setup Monitoring Observability Processor

Deploy a complete observability stack with metrics collection, centralized logging, distributed tracing, alerting, and dashboards using Prometheus, Grafana, ELK, or cloud-native solutions.

When to Use This Command

Run this command when...

You need to set up application metrics, infrastructure monitoring, and custom dashboards for your production services
You want centralized logging with structured log formats, aggregation, and search capabilities
You need distributed tracing across microservices to identify performance bottlenecks and request flow issues
You want an alerting system with smart thresholds, escalation policies, and notification channels
You are building an observability stack using Prometheus, Grafana, Jaeger, or cloud-native alternatives

Quick Start


# .claude/commands/setup-monitoring-observability-processor.yaml
name: Setup Monitoring Observability Processor
description: Deploy complete observability stack with metrics, logs, and traces
inputs:
  - name: focus
    description: "metrics, logging, tracing, alerting, or all"
    default: "all"


# Setup complete observability stack
claude "setup-monitoring-observability --focus all"

# Setup metrics with Prometheus + Grafana
claude "setup-monitoring-observability --focus metrics --platform prometheus"

# Setup centralized logging
claude "setup-monitoring-observability --focus logging --platform elk"

Output:
  [detect] Application: Express.js with 12 API endpoints
  [metrics] Prometheus exporter configured (/metrics)
  [metrics] Grafana dashboard with 8 panels created
  [logging] Winston structured logging configured
  [logging] Log aggregation via Loki
  [tracing] OpenTelemetry SDK integrated
  [alerting] 5 alert rules configured (error rate, latency, CPU)
  Done. Observability stack ready. Access Grafana at :3001

Core Concepts

Concept	Description
Metrics Collection	Prometheus-compatible metrics: request rate, error rate, latency percentiles, custom business KPIs
Centralized Logging	Structured JSON logs aggregated via Loki, ELK, or CloudWatch with search and filtering
Distributed Tracing	OpenTelemetry spans across service boundaries for request flow visualization and bottleneck detection
Smart Alerting	Threshold-based and anomaly-detection alerts with escalation policies and silence windows
Dashboards	Pre-built Grafana dashboards for RED metrics (Rate, Errors, Duration) and system health

Observability Stack:
  ┌──────────────────────────────────────┐
  │           Application                │
  │  ┌─────────┬──────────┬───────────┐  │
  │  │ Metrics │ Logging  │  Tracing  │  │
  │  │ prom-   │ winston/ │ OpenTele- │  │
  │  │ client  │ pino     │ metry SDK │  │
  │  └────┬────┴────┬─────┴─────┬─────┘  │
  └───────┼─────────┼───────────┼────────┘
          ▼         ▼           ▼
  ┌──────────┐ ┌─────────┐ ┌─────────┐
  │Prometheus│ │  Loki / │ │ Jaeger /│
  │          │ │   ELK   │ │ Tempo   │
  └────┬─────┘ └────┬────┘ └────┬────┘
       └────────────┼───────────┘
                    ▼
              ┌──────────┐
              │ Grafana  │
              │Dashboard │
              └──────────┘

Configuration

Parameter	Type	Default	Description
`focus`	string	"all"	Component: metrics, logging, tracing, alerting, or all
`platform`	string	"prometheus"	Stack: prometheus (+ Grafana), elk, datadog, cloudwatch, or newrelic
`retention`	string	"30d"	Data retention period for metrics and logs
`alerts`	boolean	true	Configure alerting rules and notification channels
`docker`	boolean	true	Include Docker Compose files for the monitoring infrastructure

Best Practices

Instrument RED metrics first -- Rate, Errors, and Duration cover 90% of monitoring needs. Start with these three metrics per service before adding custom business KPIs.
Use structured logging from day one -- JSON-formatted logs are searchable and parseable. Switching from unstructured to structured logging in a running system is painful and error-prone.
Set alert thresholds based on SLOs -- Define Service Level Objectives first (e.g., 99.9% uptime, P95 latency < 200ms), then create alerts that fire when SLO budgets are at risk.
Add trace context to logs -- Include trace IDs in log entries so you can correlate logs with distributed traces for a complete picture of request handling.
Review dashboards weekly -- Unused dashboards and stale alerts accumulate noise. Prune panels that nobody looks at and adjust alert thresholds that cause false positives.

Common Issues

High cardinality metric labels -- Using user IDs or request paths as metric labels creates millions of time series and crashes Prometheus. Use bounded label values like status codes or endpoint groups.
Log volume overwhelming storage -- Debug-level logging in production generates massive volumes. Set log level to info in production and use debug only in development or when actively investigating issues.
Trace sampling rate too high -- Tracing every request adds overhead and storage cost. Start with 10% sampling and increase for specific services or error scenarios.

⚠️ Loading Issue

Setup Monitoring Observability Processor

Setup Monitoring Observability Processor

When to Use This Command

Quick Start

Core Concepts

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Git Commit Message Generator

React Component Scaffolder

CI/CD Pipeline Generator