P

Pro Metrics Workspace

Production-ready skill that handles query, resource, usage, metrics. Includes structured workflows, validation checks, and reusable patterns for railway.

SkillClipticsrailwayv1.0.0MIT
0 views0 copies

Pro Metrics Workspace

A Railway-focused skill for monitoring application metrics, resource usage, and performance data across Railway services. Pro Metrics Workspace helps you track CPU, memory, network throughput, and request latency to identify bottlenecks and optimize resource allocation.

When to Use This Skill

Choose Pro Metrics Workspace when:

  • Monitoring resource consumption across Railway services
  • Investigating performance issues or unexpected resource spikes
  • Right-sizing service instances based on actual usage data
  • Setting up alerting thresholds for resource limits

Consider alternatives when:

  • You need application-level performance monitoring (use Datadog, New Relic)
  • You're tracking business metrics (use analytics tools)
  • You need distributed tracing (use Jaeger, Zipkin, or OpenTelemetry)

Quick Start

claude "Show me resource usage for my Railway services"
# Check current service status and resource usage railway status # View recent logs for performance indicators railway logs --tail # Check service metrics via Railway dashboard API # Navigate to: railway.app/project/<id>/service/<id>/metrics
// Custom metrics endpoint for application monitoring app.get('/metrics', (req, res) => { const metrics = { uptime: process.uptime(), memory: process.memoryUsage(), cpu: process.cpuUsage(), requestCount: globalMetrics.requestCount, avgResponseTime: globalMetrics.totalResponseTime / globalMetrics.requestCount, activeConnections: globalMetrics.activeConnections, errorRate: globalMetrics.errorCount / globalMetrics.requestCount }; res.json(metrics); });

Core Concepts

Railway Resource Metrics

MetricDescriptionHealthy Range
CPU UsageProcessing utilization< 70% sustained
Memory (RSS)Resident memory usage< 80% of limit
Network InIncoming traffic bytesVaries by service
Network OutOutgoing traffic bytesVaries by service
Disk UsagePersistent volume consumption< 85% capacity
Request LatencyP50/P95/P99 response timesP95 < 500ms

Application-Level Metrics

// Middleware for tracking request metrics const metricsMiddleware = (req, res, next) => { const start = Date.now(); metrics.activeConnections++; res.on('finish', () => { const duration = Date.now() - start; metrics.requestCount++; metrics.totalResponseTime += duration; metrics.activeConnections--; if (res.statusCode >= 500) { metrics.errorCount++; } // Log slow requests if (duration > 1000) { console.warn(`Slow request: ${req.method} ${req.path} - ${duration}ms`); } }); next(); };

Resource Scaling Guide

## Scaling Decision Matrix | Symptom | Metric | Action | |---------|--------|--------| | Slow responses | CPU > 80% | Increase CPU or add replicas | | Out of memory crashes | Memory > 90% | Increase memory limit | | Connection timeouts | Active conns > pool max | Increase pool or add replicas | | Disk full errors | Disk > 90% | Increase volume or clean data | | High error rate | 5xx > 1% | Check logs, scale, or fix code |

Configuration

ParameterDescriptionDefault
metrics_endpointCustom metrics HTTP path/metrics
collection_intervalHow often to sample metrics60s
retention_periodHow long metrics data is kept7 days
alert_cpu_thresholdCPU usage warning level80%
alert_memory_thresholdMemory usage warning level85%

Best Practices

  1. Monitor memory trends, not just snapshots. A service using 60% memory looks fine, but if it's growing 5% per hour, you have a memory leak. Track memory over time and investigate upward trends before they cause OOM crashes.

  2. Set up a custom /metrics endpoint. Railway's built-in metrics cover infrastructure, but application metrics (request count, error rate, queue depth) require instrumentation in your code. Expose them via a dedicated endpoint.

  3. Use P95/P99 latency instead of averages. Average response time hides outliers. If your average is 100ms but P99 is 5 seconds, 1% of users are having a terrible experience. Track percentiles to understand the full latency distribution.

  4. Right-size resources after collecting baseline data. Don't guess CPU and memory limits — deploy with generous limits, monitor actual usage for a week, then set limits at 1.5x the observed peak. This prevents waste without risking OOM kills.

  5. Correlate metrics with deployment events. When metrics change suddenly, check if a deployment happened around the same time. Track deployment timestamps alongside metrics to quickly identify whether code changes caused performance regressions.

Common Issues

Memory usage grows until the service crashes. This indicates a memory leak — common causes include unclosed database connections, growing caches without eviction, and event listener accumulation. Use --inspect flag with Node.js to capture heap snapshots and identify leaking objects.

CPU spikes during specific time periods. Check for cron jobs, scheduled tasks, or traffic patterns that coincide with the spikes. If a background job causes CPU contention with request handling, move it to a separate Railway service with its own resources.

Metrics show healthy resources but users report slowness. The bottleneck may be outside your Railway service — DNS resolution, external API calls, or database queries. Add timing instrumentation to external calls and check for slow queries in your database logs.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates