Amplitude Experiment Mentor

An agent that guides feature experiment implementation from planning through analysis, helping teams design proper A/B tests, instrument tracking, and interpret results using Amplitude's experimentation platform.

When to Use This Agent

Choose Amplitude Experiment Mentor when:

Setting up feature flags and A/B tests in Amplitude Experiment
Designing experiment instrumentation and tracking plans
Implementing feature flag checks in application code
Analyzing experiment results and making ship/no-ship decisions
Building a culture of data-driven feature development

Consider alternatives when:

Using a different experimentation platform like LaunchDarkly or Optimizely
Running infrastructure experiments without user-facing changes (use canary deployments)
Doing qualitative user research without quantitative testing (use a UX research agent)

Quick Start


# .claude/agents/amplitude-experiment-mentor.yml
name: Amplitude Experiment Mentor
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Grep
prompt: |
  You are an experimentation expert using Amplitude Experiment.
  Guide teams through experiment design, implementation,
  instrumentation, and analysis. Ensure statistical rigor
  and actionable results.

Example invocation:


claude --agent amplitude-experiment-mentor "Help me set up an
  A/B test for our new checkout flow. We want to measure
  conversion rate impact with 95% confidence."

Core Concepts

Experiment Lifecycle

Phase	Activities	Key Outputs
Design	Hypothesis, metrics, sample size	Experiment plan document
Setup	Feature flags, variant config	Amplitude Experiment config
Instrument	Event tracking, properties	Analytics implementation
QA	Flag verification, event validation	Test report
Run	Traffic allocation, monitoring	Live dashboard
Analyze	Statistical analysis, segmentation	Results and recommendation
Decision	Ship, iterate, or kill	Documented decision

Feature Flag Implementation


import { Experiment } from '@amplitude/experiment-js-client';

// Initialize the client
const experiment = Experiment.initialize('YOUR_API_KEY', {
  automaticExposureTracking: true,
});

// Fetch variants for the user
await experiment.fetch({ user_id: userId });

// Check the variant
const variant = experiment.variant('new-checkout-flow');

if (variant.value === 'treatment') {
  renderNewCheckout();
} else {
  renderCurrentCheckout();
}

Sample Size Calculator

Required Sample = (2 × (Z_α + Z_β)² × σ²) / δ²

Where:
  Z_α = 1.96 (for 95% confidence)
  Z_β = 0.84 (for 80% power)
  σ   = baseline standard deviation
  δ   = minimum detectable effect

Example: Baseline conversion = 5%, MDE = 0.5%
  Sample per variant ≈ 31,234 users
  Total sample ≈ 62,468 users
  At 1,000 users/day → ~63 days to reach significance

Configuration

Parameter	Description	Default
`confidence_level`	Statistical confidence threshold	95%
`power`	Statistical power for sample sizing	80%
`min_runtime_days`	Minimum days before concluding	7
`traffic_allocation`	Percentage of users in experiment	100%
`sticky_bucketing`	Maintain consistent user assignment	true
`exposure_tracking`	Auto-track variant exposures	true
`rollout_strategy`	Gradual rollout after ship decision	10% → 50% → 100%

Best Practices

Define your primary metric before writing any code. Every experiment needs exactly one primary metric that determines the ship/no-ship decision. Secondary metrics provide context but should not override the primary. If you cannot articulate what success looks like in one metric, your experiment scope is too broad.
Run experiments for full business cycles. A checkout experiment that runs Monday through Thursday misses weekend shopping patterns. Always run for at least one full week, ideally two. Day-of-week effects, payroll cycles, and seasonal patterns can all bias results if your experiment window doesn't cover them.
Instrument both the exposure and the outcome. Track when users see the variant (exposure event) and when they complete the target action (outcome event). Without exposure tracking, your analysis includes users who were assigned to a variant but never encountered the changed experience, diluting the measured effect.
Guard against peeking at results prematurely. Checking results daily and stopping when you see significance inflates your false positive rate dramatically. Use sequential testing methods if you need to monitor continuously, or pre-commit to a fixed sample size and analysis date. Amplitude's statistics engine accounts for this, but only if configured correctly.
Document the decision, not just the results. After analysis, record what you decided and why in a shared location. Include the metrics observed, the confidence level achieved, qualitative factors that influenced the decision, and any follow-up experiments planned. This institutional memory prevents re-running the same experiments and helps new team members understand product evolution.

Common Issues

Experiment shows no statistical significance after the planned runtime. This usually means your minimum detectable effect was too small for your traffic volume. Before extending the experiment, check for implementation bugs (are users actually seeing different experiences?), verify event tracking fires correctly, and confirm the primary metric is sensitive to the change. If everything checks out, you likely need either more traffic or a larger design change to move the needle.

Users see different variants across sessions or devices. Sticky bucketing prevents this by storing variant assignments persistently. Configure Amplitude to use a stable user identifier rather than anonymous session IDs. For logged-out experiences, use a device ID cookie. Cross-device consistency requires account-level bucketing, which means the experiment can only include logged-in users.

Results are significant but the effect seems implausibly large. Check for novelty effects by segmenting results by week. If the treatment effect decreases over time, users may be reacting to newness rather than genuine improvement. Also verify that variant assignment is balanced—a skewed split can inflate effect sizes. Run a pre-experiment A/A test on your bucketing logic to confirm uniform distribution.

⚠️ Loading Issue

Amplitude Experiment Mentor

Amplitude Experiment Mentor

When to Use This Agent

Quick Start

Core Concepts

Experiment Lifecycle

Feature Flag Implementation

Sample Size Calculator

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner