A/B Test Setup Studio

Design statistically valid A/B tests that produce actionable results — covering hypothesis formation, sample size calculation, test implementation, and result analysis.

When to Use

Run A/B tests when:

Need data-driven decisions on UI changes, copy, or features
Want to measure the impact of a specific change
Have sufficient traffic for statistical significance
The metric you're optimizing is clearly measurable

Don't A/B test when:

Insufficient traffic (< 1000 sessions/week per variant)
The change is objectively necessary (bug fix, legal requirement)
Too many simultaneous tests would create interaction effects
The metric takes months to measure (e.g., yearly retention)

Quick Start

1. Hypothesis Formation


## Test Hypothesis

**Current state**: The signup button says "Get Started" and is blue
**Change**: Change button text to "Start Free Trial" and color to green
**Hypothesis**: Changing the CTA to "Start Free Trial" with a green button
will increase signup conversions because it reduces uncertainty about cost
and green creates a stronger visual action cue.

**Primary metric**: Signup conversion rate
**Secondary metrics**: Click-through rate, bounce rate
**Guardrail metrics**: Page load time, error rate (must not degrade)

2. Sample Size Calculation


from scipy import stats
import numpy as np

def calculate_sample_size(
    baseline_rate,      # Current conversion rate (e.g., 0.03 for 3%)
    minimum_effect,     # Minimum detectable effect (e.g., 0.005 for 0.5pp)
    alpha=0.05,         # Significance level
    power=0.80          # Statistical power
):
    """Calculate required sample size per variant."""
    p1 = baseline_rate
    p2 = baseline_rate + minimum_effect
    effect_size = (p2 - p1) / np.sqrt(p1 * (1 - p1))

    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    n = ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))

# Example: 3% baseline, detect 0.5pp improvement
n = calculate_sample_size(0.03, 0.005)
print(f"Need {n} users per variant")
# ~14,751 per variant → ~29,502 total

3. Result Analysis


from scipy.stats import chi2_contingency
import numpy as np

def analyze_ab_test(control_conversions, control_total, treatment_conversions, treatment_total):
    """Analyze A/B test results with chi-squared test."""
    table = np.array([
        [control_conversions, control_total - control_conversions],
        [treatment_conversions, treatment_total - treatment_conversions]
    ])

    chi2, p_value, dof, expected = chi2_contingency(table)

    control_rate = control_conversions / control_total
    treatment_rate = treatment_conversions / treatment_total
    relative_uplift = (treatment_rate - control_rate) / control_rate * 100

    return {
        "control_rate": f"{control_rate:.4%}",
        "treatment_rate": f"{treatment_rate:.4%}",
        "relative_uplift": f"{relative_uplift:.1f}%",
        "p_value": f"{p_value:.4f}",
        "significant": p_value < 0.05,
        "recommendation": "Deploy treatment" if p_value < 0.05 and relative_uplift > 0 else "Keep control"
    }

Core Concepts

Test Design Checklist

Step	Action	Why It Matters
1. Define metric	Choose one primary metric	Multiple primaries inflate false positives
2. Calculate sample	Determine required sample size	Underpowered tests waste time
3. Set duration	Run for full weeks (7, 14, 21 days)	Avoid day-of-week bias
4. Randomize	Random user assignment	Eliminate selection bias
5. Monitor guardrails	Track metrics that must not degrade	Catch negative side effects
6. Analyze	Statistical test at predetermined endpoint	Avoid peeking bias

Common Test Types

Test Type	Variants	Use When
A/B	2	Testing one specific change
A/B/C	3	Comparing two alternatives
Multivariate	2^n	Testing multiple elements simultaneously
Bandit	Dynamic	Optimizing during the test

Statistical Concepts

Concept	Description	Rule of Thumb
Significance (alpha)	False positive rate	0.05 (5%)
Power (1-beta)	Probability of detecting real effect	0.80 (80%)
MDE	Minimum Detectable Effect	Smallest meaningful change
Confidence interval	Range of plausible effect sizes	95% CI

Configuration

Parameter	Default	Description
`alpha`	0.05	Significance level
`power`	0.80	Statistical power
`traffic_split`	50/50	Traffic allocation
`min_duration_days`	7	Minimum test duration
`max_duration_days`	28	Maximum test duration
`primary_metric`	—	Main success metric
`guardrail_metrics`	[]	Metrics that must not degrade

Best Practices

One primary metric — don't optimize for multiple metrics simultaneously
Calculate sample size before starting — commit to a duration, don't peek at results
Run for full weeks — at minimum 7 days to avoid day-of-week effects
Randomize at the user level — not session level, to avoid inconsistent experiences
Monitor guardrail metrics — ensure the change doesn't break other important metrics
Document everything — hypothesis, sample size, duration, and results for future reference

Common Issues

Test never reaches significance: Your MDE may be too small for your traffic volume. Either increase traffic, accept a larger MDE, or run longer. Never peek and stop early.

Results vary by segment: This is expected. Pre-register segment analyses to avoid p-hacking. Check if the effect holds across key segments (device, traffic source).

Conflicting metrics: Prioritize your primary metric. If treatment improves conversions but increases bounce rate, investigate why. The primary metric should be the deciding factor.

⚠️ Loading Issue

Ab Test Setup Studio

A/B Test Setup Studio

When to Use

Quick Start

1. Hypothesis Formation

2. Sample Size Calculation

3. Result Analysis

Core Concepts

Test Design Checklist

Common Test Types

Statistical Concepts

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace