A

Ab Test Setup Studio

Enterprise-grade skill for user, wants, plan, design. Includes structured workflows, validation checks, and reusable patterns for business marketing.

SkillClipticsbusiness marketingv1.0.0MIT
0 views0 copies

A/B Test Setup Studio

Design statistically valid A/B tests that produce actionable results — covering hypothesis formation, sample size calculation, test implementation, and result analysis.

When to Use

Run A/B tests when:

  • Need data-driven decisions on UI changes, copy, or features
  • Want to measure the impact of a specific change
  • Have sufficient traffic for statistical significance
  • The metric you're optimizing is clearly measurable

Don't A/B test when:

  • Insufficient traffic (< 1000 sessions/week per variant)
  • The change is objectively necessary (bug fix, legal requirement)
  • Too many simultaneous tests would create interaction effects
  • The metric takes months to measure (e.g., yearly retention)

Quick Start

1. Hypothesis Formation

## Test Hypothesis **Current state**: The signup button says "Get Started" and is blue **Change**: Change button text to "Start Free Trial" and color to green **Hypothesis**: Changing the CTA to "Start Free Trial" with a green button will increase signup conversions because it reduces uncertainty about cost and green creates a stronger visual action cue. **Primary metric**: Signup conversion rate **Secondary metrics**: Click-through rate, bounce rate **Guardrail metrics**: Page load time, error rate (must not degrade)

2. Sample Size Calculation

from scipy import stats import numpy as np def calculate_sample_size( baseline_rate, # Current conversion rate (e.g., 0.03 for 3%) minimum_effect, # Minimum detectable effect (e.g., 0.005 for 0.5pp) alpha=0.05, # Significance level power=0.80 # Statistical power ): """Calculate required sample size per variant.""" p1 = baseline_rate p2 = baseline_rate + minimum_effect effect_size = (p2 - p1) / np.sqrt(p1 * (1 - p1)) z_alpha = stats.norm.ppf(1 - alpha / 2) z_beta = stats.norm.ppf(power) n = ((z_alpha + z_beta) / effect_size) ** 2 return int(np.ceil(n)) # Example: 3% baseline, detect 0.5pp improvement n = calculate_sample_size(0.03, 0.005) print(f"Need {n} users per variant") # ~14,751 per variant → ~29,502 total

3. Result Analysis

from scipy.stats import chi2_contingency import numpy as np def analyze_ab_test(control_conversions, control_total, treatment_conversions, treatment_total): """Analyze A/B test results with chi-squared test.""" table = np.array([ [control_conversions, control_total - control_conversions], [treatment_conversions, treatment_total - treatment_conversions] ]) chi2, p_value, dof, expected = chi2_contingency(table) control_rate = control_conversions / control_total treatment_rate = treatment_conversions / treatment_total relative_uplift = (treatment_rate - control_rate) / control_rate * 100 return { "control_rate": f"{control_rate:.4%}", "treatment_rate": f"{treatment_rate:.4%}", "relative_uplift": f"{relative_uplift:.1f}%", "p_value": f"{p_value:.4f}", "significant": p_value < 0.05, "recommendation": "Deploy treatment" if p_value < 0.05 and relative_uplift > 0 else "Keep control" }

Core Concepts

Test Design Checklist

StepActionWhy It Matters
1. Define metricChoose one primary metricMultiple primaries inflate false positives
2. Calculate sampleDetermine required sample sizeUnderpowered tests waste time
3. Set durationRun for full weeks (7, 14, 21 days)Avoid day-of-week bias
4. RandomizeRandom user assignmentEliminate selection bias
5. Monitor guardrailsTrack metrics that must not degradeCatch negative side effects
6. AnalyzeStatistical test at predetermined endpointAvoid peeking bias

Common Test Types

Test TypeVariantsUse When
A/B2Testing one specific change
A/B/C3Comparing two alternatives
Multivariate2^nTesting multiple elements simultaneously
BanditDynamicOptimizing during the test

Statistical Concepts

ConceptDescriptionRule of Thumb
Significance (alpha)False positive rate0.05 (5%)
Power (1-beta)Probability of detecting real effect0.80 (80%)
MDEMinimum Detectable EffectSmallest meaningful change
Confidence intervalRange of plausible effect sizes95% CI

Configuration

ParameterDefaultDescription
alpha0.05Significance level
power0.80Statistical power
traffic_split50/50Traffic allocation
min_duration_days7Minimum test duration
max_duration_days28Maximum test duration
primary_metricMain success metric
guardrail_metrics[]Metrics that must not degrade

Best Practices

  1. One primary metric — don't optimize for multiple metrics simultaneously
  2. Calculate sample size before starting — commit to a duration, don't peek at results
  3. Run for full weeks — at minimum 7 days to avoid day-of-week effects
  4. Randomize at the user level — not session level, to avoid inconsistent experiences
  5. Monitor guardrail metrics — ensure the change doesn't break other important metrics
  6. Document everything — hypothesis, sample size, duration, and results for future reference

Common Issues

Test never reaches significance: Your MDE may be too small for your traffic volume. Either increase traffic, accept a larger MDE, or run longer. Never peek and stop early.

Results vary by segment: This is expected. Pre-register segment analyses to avoid p-hacking. Check if the effect holds across key segments (device, traffic source).

Conflicting metrics: Prioritize your primary metric. If treatment improves conversions but increases bounce rate, investigate why. The primary metric should be the deciding factor.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates