Python A/B TestingStatistics 2026-06-12 9 min

A/B Testing for Data Analysts: From Hypothesis to Decision

A/B testing is how product teams make decisions based on data rather than opinion. This guide covers the complete workflow — hypothesis formulation, sample size calculation, statistical testing in Python, and translating results into business decisions.

Isachenko Andrii
Isachenko Andrii
Data Analyst · Open to work

📋 Table of Contents

  1. Why Analysts Need to Understand A/B Testing
  2. Step 1 — Formulate a Testable Hypothesis
  3. Step 2 — Calculate Required Sample Size
  4. Step 3 — Analyse Results
  5. Step 4 — Interpret Results Correctly
  6. Common A/B Testing Mistakes

Why Analysts Need to Understand A/B Testing

A/B testing is how product teams make decisions based on data rather than opinion. As a data analyst, you'll be asked to design experiments, calculate required sample sizes, analyse results, and explain whether an observed difference is statistically meaningful. Getting this wrong leads to either false positives (shipping features that don't actually work) or false negatives (killing features that do).

This guide covers the complete A/B testing workflow from a practising analyst's perspective: hypothesis formulation, sample size calculation, statistical testing in Python, and communicating results to non-technical stakeholders.

Step 1 — Formulate a Testable Hypothesis

A proper A/B test hypothesis has three components: what you're changing, what metric you expect to change, and in which direction.

✅ Good: "Changing the CTA button colour from grey to blue on the checkout page will increase the click-through rate by at least 5%."
❌ Bad: "We think the new button design will perform better."

Choose your primary metric before the test starts — never after. Post-hoc metric selection is p-hacking and produces unreliable results. Define your success criterion: what minimum effect size is meaningful for the business?

Step 2 — Calculate Required Sample Size

The most common mistake in A/B testing is ending the test too early because "the results look good." Without proper sample size calculation, you're sampling noise. Use Python's statsmodels to calculate the required sample size before running the test.

from statsmodels.stats.power import NormalIndPower

analysis = NormalIndPower()

# Parameters:
baseline_rate = 0.05    # current conversion rate: 5%
min_effect    = 0.01    # minimum detectable effect: +1pp (to 6%)
alpha         = 0.05    # significance level (Type I error rate)
power         = 0.80    # statistical power (1 - Type II error rate)

# Effect size (Cohen's h for proportions)
from statsmodels.stats.proportion import proportion_effectsize
effect_size = proportion_effectsize(
    baseline_rate + min_effect,
    baseline_rate
)

# Required sample size per group
n = analysis.solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    alternative='two-sided'
)
print(f"Required sample size per group: {int(n) + 1:,}")
⚠️ Always calculate sample size for two-sided tests unless you have strong prior reasons to test one direction only. One-sided tests have higher false-positive rates in practice.

Step 3 — Analyse Results

Once your test has reached the required sample size, run a two-proportion z-test (for conversion rates) or t-test (for continuous metrics like revenue per user).

import numpy as np
from scipy import stats

# Experiment results
control_visitors   = 10_420
control_conversions = 521    # 5.00%

variant_visitors   = 10_380
variant_conversions = 572    # 5.51%

# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest

count = np.array([variant_conversions, control_conversions])
nobs  = np.array([variant_visitors, control_visitors])

z_stat, p_value = proportions_ztest(count, nobs, alternative='two-sided')

control_rate = control_conversions / control_visitors
variant_rate = variant_conversions / variant_visitors
relative_lift = (variant_rate - control_rate) / control_rate * 100

print(f"Control:  {control_rate:.2%}")
print(f"Variant:  {variant_rate:.2%}")
print(f"Relative lift: +{relative_lift:.1f}%")
print(f"p-value:  {p_value:.4f}")
print(f"Result:   {'Statistically significant ✓' if p_value < 0.05 else 'Not significant ✗'}")

Step 4 — Interpret Results Correctly

A p-value below 0.05 does not mean "there is a 95% probability the variant is better." It means: if there were actually no difference, we would see results this extreme only 5% of the time by chance. Common misinterpretations cause bad decisions.

ScenarioWhat it meansAction
p < 0.05, positive liftStatistically significant improvementShip if effect is practically meaningful
p < 0.05, negative liftStatistically significant regressionDo not ship, investigate cause
p > 0.05Insufficient evidence of differenceContinue test or accept null hypothesis
p < 0.05 but tiny lift (0.1%)Statistically significant but practically meaninglessBusiness decision: is it worth the engineering cost?

Common A/B Testing Mistakes

Peeking: Checking results daily and stopping when p < 0.05 inflates false positives dramatically. Run to the pre-calculated sample size.

Multiple testing: Testing 10 metrics simultaneously means one will appear significant by chance. Apply Bonferroni correction or define one primary metric.

Simpson's paradox: Aggregate results can reverse when broken down by segment. Always segment your results by device, user type, and acquisition channel.

Novelty effect: Users interact differently with new things simply because they're new. For some tests, wait 2–3 weeks for the effect to stabilise before drawing conclusions.

🎯 The business question isn't "is p < 0.05?" — it's "is the expected lift large enough to justify shipping this, given the development cost and risk?" Statistical significance is necessary but not sufficient for a launch decision.
Tags: Python A/B Testing Statistics Product Analytics scipy