Ab Test Setup Studio
Enterprise-grade skill for user, wants, plan, design. Includes structured workflows, validation checks, and reusable patterns for business marketing.
A/B Test Setup Studio
Design statistically valid A/B tests that produce actionable results — covering hypothesis formation, sample size calculation, test implementation, and result analysis.
When to Use
Run A/B tests when:
- Need data-driven decisions on UI changes, copy, or features
- Want to measure the impact of a specific change
- Have sufficient traffic for statistical significance
- The metric you're optimizing is clearly measurable
Don't A/B test when:
- Insufficient traffic (< 1000 sessions/week per variant)
- The change is objectively necessary (bug fix, legal requirement)
- Too many simultaneous tests would create interaction effects
- The metric takes months to measure (e.g., yearly retention)
Quick Start
1. Hypothesis Formation
## Test Hypothesis **Current state**: The signup button says "Get Started" and is blue **Change**: Change button text to "Start Free Trial" and color to green **Hypothesis**: Changing the CTA to "Start Free Trial" with a green button will increase signup conversions because it reduces uncertainty about cost and green creates a stronger visual action cue. **Primary metric**: Signup conversion rate **Secondary metrics**: Click-through rate, bounce rate **Guardrail metrics**: Page load time, error rate (must not degrade)
2. Sample Size Calculation
from scipy import stats import numpy as np def calculate_sample_size( baseline_rate, # Current conversion rate (e.g., 0.03 for 3%) minimum_effect, # Minimum detectable effect (e.g., 0.005 for 0.5pp) alpha=0.05, # Significance level power=0.80 # Statistical power ): """Calculate required sample size per variant.""" p1 = baseline_rate p2 = baseline_rate + minimum_effect effect_size = (p2 - p1) / np.sqrt(p1 * (1 - p1)) z_alpha = stats.norm.ppf(1 - alpha / 2) z_beta = stats.norm.ppf(power) n = ((z_alpha + z_beta) / effect_size) ** 2 return int(np.ceil(n)) # Example: 3% baseline, detect 0.5pp improvement n = calculate_sample_size(0.03, 0.005) print(f"Need {n} users per variant") # ~14,751 per variant → ~29,502 total
3. Result Analysis
from scipy.stats import chi2_contingency import numpy as np def analyze_ab_test(control_conversions, control_total, treatment_conversions, treatment_total): """Analyze A/B test results with chi-squared test.""" table = np.array([ [control_conversions, control_total - control_conversions], [treatment_conversions, treatment_total - treatment_conversions] ]) chi2, p_value, dof, expected = chi2_contingency(table) control_rate = control_conversions / control_total treatment_rate = treatment_conversions / treatment_total relative_uplift = (treatment_rate - control_rate) / control_rate * 100 return { "control_rate": f"{control_rate:.4%}", "treatment_rate": f"{treatment_rate:.4%}", "relative_uplift": f"{relative_uplift:.1f}%", "p_value": f"{p_value:.4f}", "significant": p_value < 0.05, "recommendation": "Deploy treatment" if p_value < 0.05 and relative_uplift > 0 else "Keep control" }
Core Concepts
Test Design Checklist
| Step | Action | Why It Matters |
|---|---|---|
| 1. Define metric | Choose one primary metric | Multiple primaries inflate false positives |
| 2. Calculate sample | Determine required sample size | Underpowered tests waste time |
| 3. Set duration | Run for full weeks (7, 14, 21 days) | Avoid day-of-week bias |
| 4. Randomize | Random user assignment | Eliminate selection bias |
| 5. Monitor guardrails | Track metrics that must not degrade | Catch negative side effects |
| 6. Analyze | Statistical test at predetermined endpoint | Avoid peeking bias |
Common Test Types
| Test Type | Variants | Use When |
|---|---|---|
| A/B | 2 | Testing one specific change |
| A/B/C | 3 | Comparing two alternatives |
| Multivariate | 2^n | Testing multiple elements simultaneously |
| Bandit | Dynamic | Optimizing during the test |
Statistical Concepts
| Concept | Description | Rule of Thumb |
|---|---|---|
| Significance (alpha) | False positive rate | 0.05 (5%) |
| Power (1-beta) | Probability of detecting real effect | 0.80 (80%) |
| MDE | Minimum Detectable Effect | Smallest meaningful change |
| Confidence interval | Range of plausible effect sizes | 95% CI |
Configuration
| Parameter | Default | Description |
|---|---|---|
alpha | 0.05 | Significance level |
power | 0.80 | Statistical power |
traffic_split | 50/50 | Traffic allocation |
min_duration_days | 7 | Minimum test duration |
max_duration_days | 28 | Maximum test duration |
primary_metric | — | Main success metric |
guardrail_metrics | [] | Metrics that must not degrade |
Best Practices
- One primary metric — don't optimize for multiple metrics simultaneously
- Calculate sample size before starting — commit to a duration, don't peek at results
- Run for full weeks — at minimum 7 days to avoid day-of-week effects
- Randomize at the user level — not session level, to avoid inconsistent experiences
- Monitor guardrail metrics — ensure the change doesn't break other important metrics
- Document everything — hypothesis, sample size, duration, and results for future reference
Common Issues
Test never reaches significance: Your MDE may be too small for your traffic volume. Either increase traffic, accept a larger MDE, or run longer. Never peek and stop early.
Results vary by segment: This is expected. Pre-register segment analyses to avoid p-hacking. Check if the effect holds across key segments (device, traffic source).
Conflicting metrics: Prioritize your primary metric. If treatment improves conversions but increases bounce rate, investigate why. The primary metric should be the deciding factor.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.