Cohort Analysis & Retention in Python 2026: Full Guide

Retention is the single most important metric for any subscription or repeat-purchase business. Here's how to build a full cohort analysis from raw transaction data using Python — with code you can use today.

What is Cohort Analysis and Why it Matters

A cohort is a group of users who share a common characteristic within a defined time window — most commonly, the month they first made a purchase or registered. Cohort analysis tracks what percentage of each cohort returns in subsequent months.

Without cohort analysis, aggregate retention metrics lie to you. An overall "monthly active users" chart can show growth even as your product is getting worse — because new user acquisition masks increasing churn from earlier cohorts. Cohort analysis reveals the truth: are users who joined 6 months ago still active? Is retention improving or deteriorating across cohorts over time?

💡 If your Month-1 retention is below 20% for a consumer app, growth will eventually plateau regardless of acquisition spend. Cohort analysis is how you discover this early.

Understanding the Retention Matrix

The output of cohort analysis is a retention matrix — a table where rows are cohorts (acquisition month), columns are time periods (months since acquisition), and each cell shows the percentage of the original cohort still active at that period. The diagonal of the matrix represents the same calendar month observed from different cohort perspectives.

Cohort	Size	Month 0	Month 1	Month 2	Month 3	Month 4	Month 5
Jan 2025	1,240	100%	41%	28%	22%	18%	16%
Feb 2025	980	100%	46%	31%	25%	20%	—
Mar 2025	1,560	100%	52%	35%	27%	—	—
Apr 2025	1,180	100%	58%	38%	—	—	—
May 2025	1,420	100%	61%	—	—	—	—

Notice how Month-1 retention is improving across cohorts (41% → 61%). This tells us that whatever changes were made to the product or onboarding between January and May are working — more users return after the first month. This is exactly the insight that raw aggregate metrics would hide.

Data Preparation in Pandas

The starting point is a table of transactions or events with at least two columns: user_id and event_date. We need to derive two things for each user: their cohort month (first purchase month) and the order month of each subsequent purchase.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load transaction data
df = pd.read_csv('transactions.csv', parse_dates=['order_date'])

# Normalise to first day of month (cohort month)
df['order_month'] = df['order_date'].dt.to_period('M')

# Each user's cohort = their FIRST purchase month
df['cohort_month'] = df.groupby('user_id')['order_month'].transform('min')

# Period index: months since first purchase (0, 1, 2, ...)
df['period_number'] = (
    df['order_month'] - df['cohort_month']
).apply(lambda x: x.n)

df.head(3)

Building the Retention Matrix Step by Step

# Step 1: Count unique active users per cohort × period
cohort_data = (
    df.groupby(['cohort_month', 'period_number'])['user_id']
    .nunique()
    .reset_index()
)
cohort_data.columns = ['cohort_month', 'period_number', 'n_users']

# Step 2: Pivot into matrix format
cohort_matrix = cohort_data.pivot_table(
    index='cohort_month',
    columns='period_number',
    values='n_users'
)

# Step 3: Cohort sizes = column 0 (Month 0)
cohort_sizes = cohort_matrix.iloc[:, 0]

# Step 4: Divide each row by its cohort size → retention %
retention_matrix = cohort_matrix.divide(cohort_sizes, axis=0) * 100
retention_matrix = retention_matrix.round(1)

# Rename columns for clarity
retention_matrix.columns = [f'Month {i}' for i in retention_matrix.columns]

print(retention_matrix)

Visualising with a Heatmap

The retention matrix becomes dramatically more readable as a colour-coded heatmap. High retention values appear dark blue, low values appear light — making trends immediately visible to any stakeholder.

fig, ax = plt.subplots(figsize=(14, 7))

sns.heatmap(
    retention_matrix,
    annot=True,          # show % values in each cell
    fmt='.1f',           # one decimal place
    cmap='Blues',        # blue gradient
    vmin=0, vmax=100,   # fix scale to 0-100%
    linewidths=0.5,
    linecolor='white',
    ax=ax,
    cbar_kws={'label': 'Retention %'}
)

ax.set_title('Monthly Cohort Retention', fontsize=16, fontweight='bold', pad=15)
ax.set_xlabel('Months Since First Purchase', fontsize=12)
ax.set_ylabel('Cohort (First Purchase Month)', fontsize=12)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)

plt.tight_layout()
plt.savefig('cohort_retention_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

Calculating Churn Rate

Churn is the complement of retention: the percentage of users who did not return. Monthly churn for a cohort at period N is simply 100 - retention_at_N. But the more useful metric is incremental churn — the percentage lost between two consecutive periods.

# Month-over-month churn within each cohort
churn_matrix = retention_matrix.copy()

for col in range(1, len(retention_matrix.columns)):
    prev = retention_matrix.iloc[:, col - 1]
    curr = retention_matrix.iloc[:, col]
    # % of previous-period users who churned
    churn_matrix.iloc[:, col] = ((prev - curr) / prev * 100).round(1)

churn_matrix['Month 0'] = np.nan  # no churn at acquisition
print(churn_matrix)

# Average retention curve across all cohorts
avg_retention = retention_matrix.mean(axis=0).round(1)
print("\nAverage retention curve:")
print(avg_retention)

How to Interpret and Present the Results

A retention matrix is only useful if you can translate the numbers into business decisions. Here's a framework for interpretation:

Pattern you see	What it means	Action
Month-1 retention improving across cohorts	Onboarding or product improvements are working	Double down on what changed
Month-1 retention flat but Month-3 declining	Users start but lose habit over time	Improve re-engagement / habit loops
One cohort much worse than neighbours	Bad acquisition channel, bad campaign, or product bug that month	Investigate that specific period
Retention stabilises after Month 3 (~15%+)	You have a healthy retained core user base	Focus on growing this segment
Retention drops to near 0 by Month 2	Product-market fit problem	Talk to churned users, redesign core loop

Presenting to stakeholders

When presenting cohort retention to a non-technical audience, lead with the business implication, not the methodology. Instead of "Month-1 retention increased from 41% to 61% across January–May cohorts", say: "More than half of new customers now return for a second purchase within 30 days — up from 4 in 10 at the start of the year. This directly reduces our customer acquisition cost payback period."

🎯 The analyst's job isn't to produce a heatmap — it's to answer the question: "Is our product getting better or worse at keeping users?" Cohort analysis is the most reliable way to answer it.

Complete script — copy and run

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def build_cohort_retention(df, user_col='user_id', date_col='order_date'):
    """
    Build a monthly cohort retention matrix from transaction data.
    df must have columns: user_id, order_date (datetime)
    Returns: retention_matrix (DataFrame, % values)
    """
    df = df.copy()
    df['order_month']  = df[date_col].dt.to_period('M')
    df['cohort_month'] = df.groupby(user_col)['order_month'].transform('min')
    df['period']       = (df['order_month'] - df['cohort_month']).apply(lambda x: x.n)

    matrix = (
        df.groupby(['cohort_month', 'period'])[user_col]
        .nunique()
        .unstack()
    )
    sizes = matrix.iloc[:, 0]
    retention = (matrix.divide(sizes, axis=0) * 100).round(1)
    retention.columns = [f'M{c}' for c in retention.columns]
    return retention, sizes

# Usage:
# df = pd.read_csv('your_data.csv', parse_dates=['order_date'])
# retention, sizes = build_cohort_retention(df)
# sns.heatmap(retention, annot=True, fmt='.1f', cmap='Blues', vmin=0, vmax=100)

Cohort Analysis & User Retention in Python 2026: Complete Guide

📋 Table of Contents