Comprehensive Datadog CLI

Overview

The Datadog CLI is a command-line tool designed specifically for AI agents and developers to debug, triage, and monitor production systems using Datadog's observability platform. It provides direct access to Datadog logs, metrics, traces, and dashboards from the terminal, making it ideal for integration with Claude Code and other AI-assisted development workflows. Instead of navigating the Datadog web UI, you can search logs, tail streams, correlate traces, compare error rates between time periods, and manage dashboards entirely from the command line.

The CLI is particularly powerful for incident triage workflows. When an alert fires at 3 AM, the Datadog CLI lets you quickly get an error summary, identify whether the issue is new by comparing to the previous period, find error patterns, drill into specific services, get context around suspicious timestamps, and follow distributed traces -- all from a single terminal session. This is the workflow that AI agents excel at automating, and the Datadog CLI provides the data access layer they need.

The tool is installed and run via npx, requiring no global installation. It needs a Datadog API key and Application key for authentication, and supports all Datadog regional sites (US, EU, US3, US5, AP1).

When to Use

Triaging production incidents by searching error logs and correlating traces
Monitoring deployment health by comparing error rates before and after releases
Investigating performance regressions through metric queries and log analysis
Automating on-call workflows by scripting log searches and error summaries
Building AI-powered debugging assistants that use Datadog data for root cause analysis
Comparing system behavior between time periods to identify anomalies
Aggregating logs by facet to identify the most impacted services, endpoints, or users
Streaming real-time logs during deployments or incident response
Managing Datadog dashboards programmatically (create, update, delete)

Quick Start


# Set required environment variables
export DD_API_KEY="your-datadog-api-key"
export DD_APP_KEY="your-datadog-application-key"

# Get keys from: https://app.datadoghq.com/organization-settings/api-keys

# Quick error check -- what's broken right now?
npx @leoflores/datadog-cli errors --from 1h --pretty

# Search for specific error logs
npx @leoflores/datadog-cli logs search \
  --query "status:error service:api-gateway" \
  --from 1h \
  --pretty

# Tail logs in real-time (useful during deployments)
npx @leoflores/datadog-cli logs tail \
  --query "service:api-gateway" \
  --pretty


# For non-US Datadog sites, specify the site
npx @leoflores/datadog-cli logs search \
  --query "status:error" \
  --from 1h \
  --site datadoghq.eu \
  --pretty

Core Concepts

Log Search and Filtering

The log search command is the primary way to find relevant logs. It supports Datadog's full query syntax including facets, tags, and boolean operators.


# ── Basic Search ──────────────────────────────────────────────

# Search by status
npx @leoflores/datadog-cli logs search \
  --query "status:error" \
  --from 1h --pretty

# Search by service and status
npx @leoflores/datadog-cli logs search \
  --query "service:payment-service status:error" \
  --from 6h --pretty

# Search by message content
npx @leoflores/datadog-cli logs search \
  --query "service:api \"connection refused\"" \
  --from 24h --pretty

# Combine with boolean operators
npx @leoflores/datadog-cli logs search \
  --query "(service:api OR service:worker) status:error -env:staging" \
  --from 1h --pretty

# ── Facet-Based Filtering ────────────────────────────────────

# Filter by custom facets
npx @leoflores/datadog-cli logs search \
  --query "@http.status_code:>=500 service:api" \
  --from 1h --pretty

# Filter by environment
npx @leoflores/datadog-cli logs search \
  --query "status:error env:production @region:us-east-1" \
  --from 1h --pretty

# ── Export Results ────────────────────────────────────────────

# Save results to JSON for further analysis
npx @leoflores/datadog-cli logs search \
  --query "status:error service:api" \
  --from 24h \
  --output errors.json

Real-Time Log Streaming


# Tail all error logs across all services
npx @leoflores/datadog-cli logs tail \
  --query "status:error" \
  --pretty

# Tail a specific service during deployment
npx @leoflores/datadog-cli logs tail \
  --query "service:api-gateway env:production" \
  --pretty

# Tail with a specific filter for debugging
npx @leoflores/datadog-cli logs tail \
  --query "service:auth @user.id:12345" \
  --pretty

Distributed Trace Correlation

When debugging microservice issues, tracing a request across services is essential.


# Find all logs for a specific trace
npx @leoflores/datadog-cli logs trace \
  --id "abc123def456789" \
  --pretty

# Get logs around a specific timestamp (context window)
npx @leoflores/datadog-cli logs context \
  --timestamp "2026-03-13T10:30:00Z" \
  --service api-gateway \
  --pretty

Log Patterns and Aggregation


# Find common error patterns (groups similar messages)
npx @leoflores/datadog-cli logs patterns \
  --query "status:error service:api" \
  --from 6h \
  --pretty

# Aggregate logs by a facet
npx @leoflores/datadog-cli logs agg \
  --query "status:error" \
  --facet "service" \
  --from 1h \
  --pretty

# Compare error counts between time periods
npx @leoflores/datadog-cli logs compare \
  --query "status:error" \
  --period 1h \
  --pretty
# Shows: current period errors vs previous period errors

Metrics Queries


# Query CPU usage across all hosts
npx @leoflores/datadog-cli metrics query \
  --query "avg:system.cpu.user{*}" \
  --from 1h \
  --pretty

# Query by specific service/host
npx @leoflores/datadog-cli metrics query \
  --query "avg:trace.http.request.duration{service:api}" \
  --from 6h \
  --pretty

# Custom metrics with grouping
npx @leoflores/datadog-cli metrics query \
  --query "sum:custom.orders.count{*} by {region}" \
  --from 24h \
  --pretty

Implementation Patterns

Incident Triage Workflow

The most common use case is triaging a production incident. This workflow progressively narrows down the root cause.


#!/bin/bash
# incident-triage.sh - Automated incident investigation script

SERVICE="${1:-api-gateway}"
TIMEFRAME="${2:-1h}"

echo "=== Step 1: Error Overview ==="
npx @leoflores/datadog-cli errors --from "$TIMEFRAME" --pretty

echo ""
echo "=== Step 2: Is this new? Compare to previous period ==="
npx @leoflores/datadog-cli logs compare \
  --query "status:error service:$SERVICE" \
  --period "$TIMEFRAME" \
  --pretty

echo ""
echo "=== Step 3: Error patterns ==="
npx @leoflores/datadog-cli logs patterns \
  --query "status:error service:$SERVICE" \
  --from "$TIMEFRAME" \
  --pretty

echo ""
echo "=== Step 4: Error breakdown by endpoint ==="
npx @leoflores/datadog-cli logs agg \
  --query "status:error service:$SERVICE" \
  --facet "@http.url_details.path" \
  --from "$TIMEFRAME" \
  --pretty

echo ""
echo "=== Step 5: Recent error logs ==="
npx @leoflores/datadog-cli logs search \
  --query "status:error service:$SERVICE" \
  --from "$TIMEFRAME" \
  --pretty

Deployment Monitoring


# Compare error rates before and during deployment
npx @leoflores/datadog-cli logs compare --query "status:error service:$SERVICE" --period 15m --pretty

# Check latency and error rate metrics
npx @leoflores/datadog-cli metrics query --query "avg:trace.http.request.duration{service:$SERVICE}" --from 30m --pretty

Multi-Service and Dashboard Management


# Parallel queries across multiple services
npx @leoflores/datadog-cli logs multi \
  --queries "status:error service:api" "status:error service:auth" "status:error service:payments" \
  --from 1h --pretty

# List services and dashboards
npx @leoflores/datadog-cli services --from 1h --pretty
npx @leoflores/datadog-cli dashboards list --pretty

Configuration Reference

Environment Variables

Variable	Required	Description
`DD_API_KEY`	Yes	Datadog API key for authentication
`DD_APP_KEY`	Yes	Datadog Application key for data access
`DD_SITE`	No	Datadog site (default: datadoghq.com)

Global Flags

Flag	Description
`--pretty`	Human-readable output with color formatting
`--output <file>`	Export results to a JSON file
`--site <site>`	Override Datadog site (e.g., `datadoghq.eu`, `us3.datadoghq.com`)

Time Format Reference

Format	Example	Description
Minutes	`30m`	Last 30 minutes
Hours	`6h`	Last 6 hours
Days	`7d`	Last 7 days
ISO 8601	`2026-03-13T10:30:00Z`	Specific timestamp

Commands Reference

Command	Description
`logs search`	Search logs with query filters and time range
`logs tail`	Stream logs in real-time (like `tail -f`)
`logs trace`	Find all logs associated with a distributed trace ID
`logs context`	Get surrounding logs for a specific timestamp and service
`logs patterns`	Group similar log messages to identify trends
`logs compare`	Compare log counts between current and previous time period
`logs multi`	Execute multiple log queries in parallel
`logs agg`	Aggregate logs by a facet (service, endpoint, etc.)
`metrics query`	Query timeseries metrics with Datadog metric syntax
`errors`	Quick error summary grouped by service and error type
`services`	List services with recent log activity
`dashboards`	CRUD operations on Datadog dashboards
`dashboard-lists`	Manage dashboard organizational lists

Best Practices

Always start incident investigation with the errors command. It gives you a quick overview of what is broken without needing to know the specific service or query syntax. From there, narrow down to specific services and queries.
Use logs compare before declaring an incident. A spike in errors might be normal if it matches the previous period's pattern. The compare command shows whether the current error rate is genuinely anomalous or within normal variance.
Use --output flag to save investigation results. During incidents, export key findings to JSON files. These become part of your incident postmortem documentation and can be shared with team members who were not on-call.
Combine logs patterns with logs agg for root cause analysis. Patterns show you what types of errors are occurring. Aggregation by facet (service, endpoint, region) shows you where they are occurring. Together, they quickly narrow down the root cause.
Set up shell aliases for common queries. Create bash aliases or functions for your most frequent queries. For example, alias dd-errors='npx @leoflores/datadog-cli errors --from 1h --pretty' saves time during incident response.
Use logs context when you have a specific timestamp of interest. If a user reports an issue at a specific time, logs context shows you all logs from that service around that timestamp, including the logs immediately before the error that often reveal the cause.
Script multi-step investigations for recurring incident types. If you frequently triage the same category of incidents (e.g., high latency on the API gateway), create a bash script that runs the full investigation workflow with a single command.
Use logs trace for microservice debugging. When a request touches multiple services, the trace ID is the thread that connects all the logs. Finding the trace ID in one service's error log and then running logs trace reveals the full request path and where it failed.
Query metrics alongside logs for correlated analysis. High error rates often correlate with CPU spikes, memory pressure, or queue depth. Use metrics query to check system metrics alongside your log investigation.
Keep your Datadog API key and App key in a secure secrets manager. Never hardcode keys in scripts. Use environment variables loaded from a secrets manager or a .env file that is in your .gitignore.

Troubleshooting

Problem: Authentication errors (401/403). Verify that both DD_API_KEY and DD_APP_KEY are set correctly. The API key alone is not sufficient -- you also need an Application key. Check that the keys have not been rotated or revoked in the Datadog organization settings. Also verify you are using the correct --site flag if your Datadog account is not on the US1 site.

Problem: No logs returned for a query that should have results. Check your time range. The default --from might be too narrow. Also verify the query syntax -- Datadog uses specific facet syntax like service:api (no spaces around the colon) and status:error. Check that the service name matches exactly what Datadog shows in the UI. Tag and facet names are case-sensitive.

Problem: npx hangs or is very slow to start. The first run downloads the package, which can be slow. On subsequent runs, npx uses the cached version. If it is consistently slow, install globally with npm install -g @leoflores/datadog-cli and run directly as datadog-cli instead of through npx.

Problem: Too many results making output hard to read. Use faceted queries to narrow results: add service:, env:, @http.status_code: filters. Use logs patterns instead of logs search to group similar messages. Use logs agg to get counts by facet instead of raw log entries. Add the --output flag to export to a file for offline analysis.

Problem: Metrics query returns empty results. Verify the metric name is correct. Datadog metric names use dots as separators (e.g., system.cpu.user, trace.http.request.duration). Use the Datadog web UI to browse available metrics and copy the exact metric name. Also check that the {*} scope matches tags that exist -- {service:api} will return nothing if the metric is not tagged with that service.

Problem: Non-US Datadog site not working. You must specify the --site flag for every command, or set the DD_SITE environment variable. Common site values are: datadoghq.com (US1, default), datadoghq.eu (EU1), us3.datadoghq.com (US3), us5.datadoghq.com (US5), ap1.datadoghq.com (AP1).

⚠️ Loading Issue

Comprehensive Datadog Cli