A

Amplitude Experiment Mentor

Streamline your workflow with this custom, agent, uses, amplitude. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

Amplitude Experiment Mentor

An agent that guides feature experiment implementation from planning through analysis, helping teams design proper A/B tests, instrument tracking, and interpret results using Amplitude's experimentation platform.

When to Use This Agent

Choose Amplitude Experiment Mentor when:

  • Setting up feature flags and A/B tests in Amplitude Experiment
  • Designing experiment instrumentation and tracking plans
  • Implementing feature flag checks in application code
  • Analyzing experiment results and making ship/no-ship decisions
  • Building a culture of data-driven feature development

Consider alternatives when:

  • Using a different experimentation platform like LaunchDarkly or Optimizely
  • Running infrastructure experiments without user-facing changes (use canary deployments)
  • Doing qualitative user research without quantitative testing (use a UX research agent)

Quick Start

# .claude/agents/amplitude-experiment-mentor.yml name: Amplitude Experiment Mentor model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Grep prompt: | You are an experimentation expert using Amplitude Experiment. Guide teams through experiment design, implementation, instrumentation, and analysis. Ensure statistical rigor and actionable results.

Example invocation:

claude --agent amplitude-experiment-mentor "Help me set up an A/B test for our new checkout flow. We want to measure conversion rate impact with 95% confidence."

Core Concepts

Experiment Lifecycle

PhaseActivitiesKey Outputs
DesignHypothesis, metrics, sample sizeExperiment plan document
SetupFeature flags, variant configAmplitude Experiment config
InstrumentEvent tracking, propertiesAnalytics implementation
QAFlag verification, event validationTest report
RunTraffic allocation, monitoringLive dashboard
AnalyzeStatistical analysis, segmentationResults and recommendation
DecisionShip, iterate, or killDocumented decision

Feature Flag Implementation

import { Experiment } from '@amplitude/experiment-js-client'; // Initialize the client const experiment = Experiment.initialize('YOUR_API_KEY', { automaticExposureTracking: true, }); // Fetch variants for the user await experiment.fetch({ user_id: userId }); // Check the variant const variant = experiment.variant('new-checkout-flow'); if (variant.value === 'treatment') { renderNewCheckout(); } else { renderCurrentCheckout(); }

Sample Size Calculator

Required Sample = (2 × (Z_α + Z_β)² × σ²) / δ²

Where:
  Z_α = 1.96 (for 95% confidence)
  Z_β = 0.84 (for 80% power)
  σ   = baseline standard deviation
  δ   = minimum detectable effect

Example: Baseline conversion = 5%, MDE = 0.5%
  Sample per variant ≈ 31,234 users
  Total sample ≈ 62,468 users
  At 1,000 users/day → ~63 days to reach significance

Configuration

ParameterDescriptionDefault
confidence_levelStatistical confidence threshold95%
powerStatistical power for sample sizing80%
min_runtime_daysMinimum days before concluding7
traffic_allocationPercentage of users in experiment100%
sticky_bucketingMaintain consistent user assignmenttrue
exposure_trackingAuto-track variant exposurestrue
rollout_strategyGradual rollout after ship decision10% → 50% → 100%

Best Practices

  1. Define your primary metric before writing any code. Every experiment needs exactly one primary metric that determines the ship/no-ship decision. Secondary metrics provide context but should not override the primary. If you cannot articulate what success looks like in one metric, your experiment scope is too broad.

  2. Run experiments for full business cycles. A checkout experiment that runs Monday through Thursday misses weekend shopping patterns. Always run for at least one full week, ideally two. Day-of-week effects, payroll cycles, and seasonal patterns can all bias results if your experiment window doesn't cover them.

  3. Instrument both the exposure and the outcome. Track when users see the variant (exposure event) and when they complete the target action (outcome event). Without exposure tracking, your analysis includes users who were assigned to a variant but never encountered the changed experience, diluting the measured effect.

  4. Guard against peeking at results prematurely. Checking results daily and stopping when you see significance inflates your false positive rate dramatically. Use sequential testing methods if you need to monitor continuously, or pre-commit to a fixed sample size and analysis date. Amplitude's statistics engine accounts for this, but only if configured correctly.

  5. Document the decision, not just the results. After analysis, record what you decided and why in a shared location. Include the metrics observed, the confidence level achieved, qualitative factors that influenced the decision, and any follow-up experiments planned. This institutional memory prevents re-running the same experiments and helps new team members understand product evolution.

Common Issues

Experiment shows no statistical significance after the planned runtime. This usually means your minimum detectable effect was too small for your traffic volume. Before extending the experiment, check for implementation bugs (are users actually seeing different experiences?), verify event tracking fires correctly, and confirm the primary metric is sensitive to the change. If everything checks out, you likely need either more traffic or a larger design change to move the needle.

Users see different variants across sessions or devices. Sticky bucketing prevents this by storing variant assignments persistently. Configure Amplitude to use a stable user identifier rather than anonymous session IDs. For logged-out experiences, use a device ID cookie. Cross-device consistency requires account-level bucketing, which means the experiment can only include logged-in users.

Results are significant but the effect seems implausibly large. Check for novelty effects by segmenting results by week. If the treatment effect decreases over time, users may be reacting to newness rather than genuine improvement. Also verify that variant assignment is balanced—a skewed split can inflate effect sizes. Run a pre-experiment A/A test on your bucketing logic to confirm uniform distribution.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates