C

Comprehensive Datadog Cli

Powerful skill for datadog, searching, logs, querying. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Comprehensive Datadog CLI

Overview

The Datadog CLI is a command-line tool designed specifically for AI agents and developers to debug, triage, and monitor production systems using Datadog's observability platform. It provides direct access to Datadog logs, metrics, traces, and dashboards from the terminal, making it ideal for integration with Claude Code and other AI-assisted development workflows. Instead of navigating the Datadog web UI, you can search logs, tail streams, correlate traces, compare error rates between time periods, and manage dashboards entirely from the command line.

The CLI is particularly powerful for incident triage workflows. When an alert fires at 3 AM, the Datadog CLI lets you quickly get an error summary, identify whether the issue is new by comparing to the previous period, find error patterns, drill into specific services, get context around suspicious timestamps, and follow distributed traces -- all from a single terminal session. This is the workflow that AI agents excel at automating, and the Datadog CLI provides the data access layer they need.

The tool is installed and run via npx, requiring no global installation. It needs a Datadog API key and Application key for authentication, and supports all Datadog regional sites (US, EU, US3, US5, AP1).

When to Use

  • Triaging production incidents by searching error logs and correlating traces
  • Monitoring deployment health by comparing error rates before and after releases
  • Investigating performance regressions through metric queries and log analysis
  • Automating on-call workflows by scripting log searches and error summaries
  • Building AI-powered debugging assistants that use Datadog data for root cause analysis
  • Comparing system behavior between time periods to identify anomalies
  • Aggregating logs by facet to identify the most impacted services, endpoints, or users
  • Streaming real-time logs during deployments or incident response
  • Managing Datadog dashboards programmatically (create, update, delete)

Quick Start

# Set required environment variables export DD_API_KEY="your-datadog-api-key" export DD_APP_KEY="your-datadog-application-key" # Get keys from: https://app.datadoghq.com/organization-settings/api-keys # Quick error check -- what's broken right now? npx @leoflores/datadog-cli errors --from 1h --pretty # Search for specific error logs npx @leoflores/datadog-cli logs search \ --query "status:error service:api-gateway" \ --from 1h \ --pretty # Tail logs in real-time (useful during deployments) npx @leoflores/datadog-cli logs tail \ --query "service:api-gateway" \ --pretty
# For non-US Datadog sites, specify the site npx @leoflores/datadog-cli logs search \ --query "status:error" \ --from 1h \ --site datadoghq.eu \ --pretty

Core Concepts

Log Search and Filtering

The log search command is the primary way to find relevant logs. It supports Datadog's full query syntax including facets, tags, and boolean operators.

# ── Basic Search ────────────────────────────────────────────── # Search by status npx @leoflores/datadog-cli logs search \ --query "status:error" \ --from 1h --pretty # Search by service and status npx @leoflores/datadog-cli logs search \ --query "service:payment-service status:error" \ --from 6h --pretty # Search by message content npx @leoflores/datadog-cli logs search \ --query "service:api \"connection refused\"" \ --from 24h --pretty # Combine with boolean operators npx @leoflores/datadog-cli logs search \ --query "(service:api OR service:worker) status:error -env:staging" \ --from 1h --pretty # ── Facet-Based Filtering ──────────────────────────────────── # Filter by custom facets npx @leoflores/datadog-cli logs search \ --query "@http.status_code:>=500 service:api" \ --from 1h --pretty # Filter by environment npx @leoflores/datadog-cli logs search \ --query "status:error env:production @region:us-east-1" \ --from 1h --pretty # ── Export Results ──────────────────────────────────────────── # Save results to JSON for further analysis npx @leoflores/datadog-cli logs search \ --query "status:error service:api" \ --from 24h \ --output errors.json

Real-Time Log Streaming

# Tail all error logs across all services npx @leoflores/datadog-cli logs tail \ --query "status:error" \ --pretty # Tail a specific service during deployment npx @leoflores/datadog-cli logs tail \ --query "service:api-gateway env:production" \ --pretty # Tail with a specific filter for debugging npx @leoflores/datadog-cli logs tail \ --query "service:auth @user.id:12345" \ --pretty

Distributed Trace Correlation

When debugging microservice issues, tracing a request across services is essential.

# Find all logs for a specific trace npx @leoflores/datadog-cli logs trace \ --id "abc123def456789" \ --pretty # Get logs around a specific timestamp (context window) npx @leoflores/datadog-cli logs context \ --timestamp "2026-03-13T10:30:00Z" \ --service api-gateway \ --pretty

Log Patterns and Aggregation

# Find common error patterns (groups similar messages) npx @leoflores/datadog-cli logs patterns \ --query "status:error service:api" \ --from 6h \ --pretty # Aggregate logs by a facet npx @leoflores/datadog-cli logs agg \ --query "status:error" \ --facet "service" \ --from 1h \ --pretty # Compare error counts between time periods npx @leoflores/datadog-cli logs compare \ --query "status:error" \ --period 1h \ --pretty # Shows: current period errors vs previous period errors

Metrics Queries

# Query CPU usage across all hosts npx @leoflores/datadog-cli metrics query \ --query "avg:system.cpu.user{*}" \ --from 1h \ --pretty # Query by specific service/host npx @leoflores/datadog-cli metrics query \ --query "avg:trace.http.request.duration{service:api}" \ --from 6h \ --pretty # Custom metrics with grouping npx @leoflores/datadog-cli metrics query \ --query "sum:custom.orders.count{*} by {region}" \ --from 24h \ --pretty

Implementation Patterns

Incident Triage Workflow

The most common use case is triaging a production incident. This workflow progressively narrows down the root cause.

#!/bin/bash # incident-triage.sh - Automated incident investigation script SERVICE="${1:-api-gateway}" TIMEFRAME="${2:-1h}" echo "=== Step 1: Error Overview ===" npx @leoflores/datadog-cli errors --from "$TIMEFRAME" --pretty echo "" echo "=== Step 2: Is this new? Compare to previous period ===" npx @leoflores/datadog-cli logs compare \ --query "status:error service:$SERVICE" \ --period "$TIMEFRAME" \ --pretty echo "" echo "=== Step 3: Error patterns ===" npx @leoflores/datadog-cli logs patterns \ --query "status:error service:$SERVICE" \ --from "$TIMEFRAME" \ --pretty echo "" echo "=== Step 4: Error breakdown by endpoint ===" npx @leoflores/datadog-cli logs agg \ --query "status:error service:$SERVICE" \ --facet "@http.url_details.path" \ --from "$TIMEFRAME" \ --pretty echo "" echo "=== Step 5: Recent error logs ===" npx @leoflores/datadog-cli logs search \ --query "status:error service:$SERVICE" \ --from "$TIMEFRAME" \ --pretty

Deployment Monitoring

# Compare error rates before and during deployment npx @leoflores/datadog-cli logs compare --query "status:error service:$SERVICE" --period 15m --pretty # Check latency and error rate metrics npx @leoflores/datadog-cli metrics query --query "avg:trace.http.request.duration{service:$SERVICE}" --from 30m --pretty

Multi-Service and Dashboard Management

# Parallel queries across multiple services npx @leoflores/datadog-cli logs multi \ --queries "status:error service:api" "status:error service:auth" "status:error service:payments" \ --from 1h --pretty # List services and dashboards npx @leoflores/datadog-cli services --from 1h --pretty npx @leoflores/datadog-cli dashboards list --pretty

Configuration Reference

Environment Variables

VariableRequiredDescription
DD_API_KEYYesDatadog API key for authentication
DD_APP_KEYYesDatadog Application key for data access
DD_SITENoDatadog site (default: datadoghq.com)

Global Flags

FlagDescription
--prettyHuman-readable output with color formatting
--output <file>Export results to a JSON file
--site <site>Override Datadog site (e.g., datadoghq.eu, us3.datadoghq.com)

Time Format Reference

FormatExampleDescription
Minutes30mLast 30 minutes
Hours6hLast 6 hours
Days7dLast 7 days
ISO 86012026-03-13T10:30:00ZSpecific timestamp

Commands Reference

CommandDescription
logs searchSearch logs with query filters and time range
logs tailStream logs in real-time (like tail -f)
logs traceFind all logs associated with a distributed trace ID
logs contextGet surrounding logs for a specific timestamp and service
logs patternsGroup similar log messages to identify trends
logs compareCompare log counts between current and previous time period
logs multiExecute multiple log queries in parallel
logs aggAggregate logs by a facet (service, endpoint, etc.)
metrics queryQuery timeseries metrics with Datadog metric syntax
errorsQuick error summary grouped by service and error type
servicesList services with recent log activity
dashboardsCRUD operations on Datadog dashboards
dashboard-listsManage dashboard organizational lists

Best Practices

  1. Always start incident investigation with the errors command. It gives you a quick overview of what is broken without needing to know the specific service or query syntax. From there, narrow down to specific services and queries.

  2. Use logs compare before declaring an incident. A spike in errors might be normal if it matches the previous period's pattern. The compare command shows whether the current error rate is genuinely anomalous or within normal variance.

  3. Use --output flag to save investigation results. During incidents, export key findings to JSON files. These become part of your incident postmortem documentation and can be shared with team members who were not on-call.

  4. Combine logs patterns with logs agg for root cause analysis. Patterns show you what types of errors are occurring. Aggregation by facet (service, endpoint, region) shows you where they are occurring. Together, they quickly narrow down the root cause.

  5. Set up shell aliases for common queries. Create bash aliases or functions for your most frequent queries. For example, alias dd-errors='npx @leoflores/datadog-cli errors --from 1h --pretty' saves time during incident response.

  6. Use logs context when you have a specific timestamp of interest. If a user reports an issue at a specific time, logs context shows you all logs from that service around that timestamp, including the logs immediately before the error that often reveal the cause.

  7. Script multi-step investigations for recurring incident types. If you frequently triage the same category of incidents (e.g., high latency on the API gateway), create a bash script that runs the full investigation workflow with a single command.

  8. Use logs trace for microservice debugging. When a request touches multiple services, the trace ID is the thread that connects all the logs. Finding the trace ID in one service's error log and then running logs trace reveals the full request path and where it failed.

  9. Query metrics alongside logs for correlated analysis. High error rates often correlate with CPU spikes, memory pressure, or queue depth. Use metrics query to check system metrics alongside your log investigation.

  10. Keep your Datadog API key and App key in a secure secrets manager. Never hardcode keys in scripts. Use environment variables loaded from a secrets manager or a .env file that is in your .gitignore.

Troubleshooting

Problem: Authentication errors (401/403). Verify that both DD_API_KEY and DD_APP_KEY are set correctly. The API key alone is not sufficient -- you also need an Application key. Check that the keys have not been rotated or revoked in the Datadog organization settings. Also verify you are using the correct --site flag if your Datadog account is not on the US1 site.

Problem: No logs returned for a query that should have results. Check your time range. The default --from might be too narrow. Also verify the query syntax -- Datadog uses specific facet syntax like service:api (no spaces around the colon) and status:error. Check that the service name matches exactly what Datadog shows in the UI. Tag and facet names are case-sensitive.

Problem: npx hangs or is very slow to start. The first run downloads the package, which can be slow. On subsequent runs, npx uses the cached version. If it is consistently slow, install globally with npm install -g @leoflores/datadog-cli and run directly as datadog-cli instead of through npx.

Problem: Too many results making output hard to read. Use faceted queries to narrow results: add service:, env:, @http.status_code: filters. Use logs patterns instead of logs search to group similar messages. Use logs agg to get counts by facet instead of raw log entries. Add the --output flag to export to a file for offline analysis.

Problem: Metrics query returns empty results. Verify the metric name is correct. Datadog metric names use dots as separators (e.g., system.cpu.user, trace.http.request.duration). Use the Datadog web UI to browse available metrics and copy the exact metric name. Also check that the {*} scope matches tags that exist -- {service:api} will return nothing if the metric is not tagged with that service.

Problem: Non-US Datadog site not working. You must specify the --site flag for every command, or set the DD_SITE environment variable. Common site values are: datadoghq.com (US1, default), datadoghq.eu (EU1), us3.datadoghq.com (US3), us5.datadoghq.com (US5), ap1.datadoghq.com (AP1).

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates