Why Analysts Need to Understand A/B Testing
A/B testing is how product teams make decisions based on data rather than opinion. As a data analyst, you'll be asked to design experiments, calculate required sample sizes, analyse results, and explain whether an observed difference is statistically meaningful. Getting this wrong leads to either false positives (shipping features that don't actually work) or false negatives (killing features that do).
This guide covers the complete A/B testing workflow from a practising analyst's perspective: hypothesis formulation, sample size calculation, statistical testing in Python, and communicating results to non-technical stakeholders.
Step 1 — Formulate a Testable Hypothesis
A proper A/B test hypothesis has three components: what you're changing, what metric you expect to change, and in which direction.
❌ Bad: "We think the new button design will perform better."
Choose your primary metric before the test starts — never after. Post-hoc metric selection is p-hacking and produces unreliable results. Define your success criterion: what minimum effect size is meaningful for the business?
Step 2 — Calculate Required Sample Size
The most common mistake in A/B testing is ending the test too early because "the results look good." Without proper sample size calculation, you're sampling noise. Use Python's statsmodels to calculate the required sample size before running the test.
from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
# Parameters:
baseline_rate = 0.05 # current conversion rate: 5%
min_effect = 0.01 # minimum detectable effect: +1pp (to 6%)
alpha = 0.05 # significance level (Type I error rate)
power = 0.80 # statistical power (1 - Type II error rate)
# Effect size (Cohen's h for proportions)
from statsmodels.stats.proportion import proportion_effectsize
effect_size = proportion_effectsize(
baseline_rate + min_effect,
baseline_rate
)
# Required sample size per group
n = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
print(f"Required sample size per group: {int(n) + 1:,}")
Step 3 — Analyse Results
Once your test has reached the required sample size, run a two-proportion z-test (for conversion rates) or t-test (for continuous metrics like revenue per user).
import numpy as np
from scipy import stats
# Experiment results
control_visitors = 10_420
control_conversions = 521 # 5.00%
variant_visitors = 10_380
variant_conversions = 572 # 5.51%
# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest
count = np.array([variant_conversions, control_conversions])
nobs = np.array([variant_visitors, control_visitors])
z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')
control_rate = control_conversions / control_visitors
variant_rate = variant_conversions / variant_visitors
relative_lift = (variant_rate - control_rate) / control_rate * 100
print(f"Control: {control_rate:.2%}")
print(f"Variant: {variant_rate:.2%}")
print(f"Relative lift: +{relative_lift:.1f}%")
print(f"p-value: {p_value:.4f}")
print(f"Result: {'Statistically significant ✓' if p_value < 0.05 else 'Not significant ✗'}")
Step 4 — Interpret Results Correctly
A p-value below 0.05 does not mean "there is a 95% probability the variant is better." It means: if there were actually no difference, we would see results this extreme only 5% of the time by chance. Common misinterpretations cause bad decisions.
| Scenario | What it means | Action |
|---|---|---|
| p < 0.05, positive lift | Statistically significant improvement | Ship if effect is practically meaningful |
| p < 0.05, negative lift | Statistically significant regression | Do not ship, investigate cause |
| p > 0.05 | Insufficient evidence of difference | Continue test or accept null hypothesis |
| p < 0.05 but tiny lift (0.1%) | Statistically significant but practically meaningless | Business decision: is it worth the engineering cost? |
Common A/B Testing Mistakes
Peeking: Checking results daily and stopping when p < 0.05 inflates false positives dramatically. Run to the pre-calculated sample size.
Multiple testing: Testing 10 metrics simultaneously means one will appear significant by chance. Apply Bonferroni correction or define one primary metric.
Simpson's paradox: Aggregate results can reverse when broken down by segment. Always segment your results by device, user type, and acquisition channel.
Novelty effect: Users interact differently with new things simply because they're new. For some tests, wait 2–3 weeks for the effect to stabilise before drawing conclusions.
