D

Dynatrace Expert Partner

Production-ready agent that handles dynatrace, expert, agent, integrates. Includes structured workflows, validation checks, and reusable patterns for security.

AgentClipticssecurityv1.0.0MIT
0 views0 copies

Dynatrace Expert Partner

Master Dynatrace observability, APM, and DQL analytics for incident response, capacity planning, and security posture monitoring.

When to Use This Agent

Choose this agent when you need to:

  • Investigate production incidents using distributed traces, service-flow analysis, and Davis AI to pinpoint root cause
  • Write or optimize DQL queries for custom dashboards, SLO definitions, and alerting rules across full-stack environments
  • Assess application security through Dynatrace RASP, vulnerability detection, and attack-path analysis

Consider alternatives when:

  • Your monitoring stack is Prometheus, Grafana, or Datadog and you need vendor-specific guidance
  • You require code-level profiling beyond what OneAgent captures automatically

Quick Start

Configuration

name: dynatrace-expert-partner type: agent category: observability

Example Invocation

claude agent:invoke dynatrace-expert-partner "Investigate checkout service latency spike and build a DQL dashboard"

Example Output

Incident - Checkout Service Latency
Environment: prod-us-east | 2026-03-15 08:00-09:30 UTC

Root Cause: Davis AI anomaly at 08:12 UTC
  P95 latency: 180ms -> 2,400ms
  Cause: DB connection pool exhaustion (100/100 saturated)
  Trigger: checkout-api:v3.8.2 deployed at 08:10 (missing connection release)

DQL Query:
  timeseries avg_latency = avg(dt.service.request.response_time),
    filter: dt.entity.service.name == "checkout-api", interval: 1m
  | fieldsAdd threshold = 500

Core Concepts

Dynatrace Observability Overview

AspectDetails
OneAgentAuto code-level injection for Java, .NET, Node.js, Go, Python providing traces, hotspots, and RUM
Davis AICausal AI correlating topology, metrics, events, and logs to identify root cause and impact scope
DQLPipe-based query language for logs, metrics, events, entities with fetch, filter, summarize, timeseries
SmartscapeReal-time dependency map spanning hosts, processes, services, and applications with call relationships
GrailUnified data lakehouse for all signals with schema-on-read and retention up to 10 years

Dynatrace Investigation Architecture

+----------------+     +------------------+     +----------------+
| OneAgent       | --> | Grail Data       | --> | Davis AI       |
| Instrumentation|     | Lakehouse        |     | Correlation    |
+----------------+     +------------------+     +----------------+
        |                       |                       |
        v                       v                       v
+----------------+     +------------------+     +----------------+
| Smartscape     | --> | DQL Queries &    | --> | Dashboards &   |
| Topology       |     | Notebooks        |     | Alerts / SLOs  |
+----------------+     +------------------+     +----------------+

Configuration

ParameterTypeDefaultDescription
dt_environment_urlstring-Environment URL (e.g., https://abc12345.live.dynatrace.com)
dt_api_tokenstring-API token with Read metrics, entities, logs, problems scopes
default_timeframestringlast 2 hoursDefault query timeframe for investigations
management_zonestring-Zone to scope analysis to a specific application or team
slo_targetfloat99.9Default SLO availability target percentage

Best Practices

  1. Start with Davis AI problems - Query the problems API first instead of raw metrics. Davis performs topology-aware root cause analysis that would take hours manually.

  2. Use management zones for scoping - Scope dashboards and alerts to specific zones so teams see only their services, reducing noise and improving query performance.

  3. Build DQL queries iteratively - Start with broad fetch, add filters, then aggregations. Test each stage in a Notebook before embedding in dashboard tiles.

  4. Define SLOs before incidents - Establish latency, error rate, and availability objectives during calm periods. Error budget burn rate provides objective severity measurement.

  5. Correlate deployments with anomalies - Ingest CI/CD deployment events so Davis considers them as root-cause candidates, frequently reducing MTTR.

Common Issues

  1. DQL query timeouts - Queries on large datasets exceed execution limits. Add entity filters, reduce timeframes, or use larger summarize intervals.

  2. OneAgent version mismatch - Different major versions produce inconsistent traces. Use the deployment API to automate rolling upgrades.

  3. Alert fatigue from defaults - Tune Davis sensitivity per service and define metric-based alert profiles tied to SLO budgets.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates