Python Retention Product Analytics 2026-03-10

Cohort Analysis & User Retention in Python 2026: Complete Guide

Retention is the single most important metric for any subscription or repeat-purchase business. Here's how to build a full cohort analysis from raw transaction data using Python — with code you can use today.

Isachenko Andrii
Isachenko Andrii
Data Analyst · Open to work

📋 Table of Contents

  1. What is cohort analysis and why it matters
  2. Understanding the retention matrix
  3. Data preparation in pandas
  4. Building the retention matrix step by step
  5. Visualising with a heatmap
  6. Calculating churn rate
  7. How to interpret and present the results

What is Cohort Analysis and Why it Matters

A cohort is a group of users who share a common characteristic within a defined time window — most commonly, the month they first made a purchase or registered. Cohort analysis tracks what percentage of each cohort returns in subsequent months.

Without cohort analysis, aggregate retention metrics lie to you. An overall "monthly active users" chart can show growth even as your product is getting worse — because new user acquisition masks increasing churn from earlier cohorts. Cohort analysis reveals the truth: are users who joined 6 months ago still active? Is retention improving or deteriorating across cohorts over time?

💡 If your Month-1 retention is below 20% for a consumer app, growth will eventually plateau regardless of acquisition spend. Cohort analysis is how you discover this early.

Understanding the Retention Matrix

The output of cohort analysis is a retention matrix — a table where rows are cohorts (acquisition month), columns are time periods (months since acquisition), and each cell shows the percentage of the original cohort still active at that period. The diagonal of the matrix represents the same calendar month observed from different cohort perspectives.

Cohort Size Month 0 Month 1 Month 2 Month 3 Month 4 Month 5
Jan 20251,240100%41%28%22%18%16%
Feb 2025980100%46%31%25%20%
Mar 20251,560100%52%35%27%
Apr 20251,180100%58%38%
May 20251,420100%61%

Notice how Month-1 retention is improving across cohorts (41% → 61%). This tells us that whatever changes were made to the product or onboarding between January and May are working — more users return after the first month. This is exactly the insight that raw aggregate metrics would hide.

Data Preparation in Pandas

The starting point is a table of transactions or events with at least two columns: user_id and event_date. We need to derive two things for each user: their cohort month (first purchase month) and the order month of each subsequent purchase.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load transaction data df = pd.read_csv('transactions.csv', parse_dates=['order_date']) # Normalise to first day of month (cohort month) df['order_month'] = df['order_date'].dt.to_period('M') # Each user's cohort = their FIRST purchase month df['cohort_month'] = df.groupby('user_id')['order_month'].transform('min') # Period index: months since first purchase (0, 1, 2, ...) df['period_number'] = ( df['order_month'] - df['cohort_month'] ).apply(lambda x: x.n) df.head(3)

Building the Retention Matrix Step by Step

# Step 1: Count unique active users per cohort × period cohort_data = ( df.groupby(['cohort_month', 'period_number'])['user_id'] .nunique() .reset_index() ) cohort_data.columns = ['cohort_month', 'period_number', 'n_users'] # Step 2: Pivot into matrix format cohort_matrix = cohort_data.pivot_table( index='cohort_month', columns='period_number', values='n_users' ) # Step 3: Cohort sizes = column 0 (Month 0) cohort_sizes = cohort_matrix.iloc[:, 0] # Step 4: Divide each row by its cohort size → retention % retention_matrix = cohort_matrix.divide(cohort_sizes, axis=0) * 100 retention_matrix = retention_matrix.round(1) # Rename columns for clarity retention_matrix.columns = [f'Month {i}' for i in retention_matrix.columns] print(retention_matrix)

Visualising with a Heatmap

The retention matrix becomes dramatically more readable as a colour-coded heatmap. High retention values appear dark blue, low values appear light — making trends immediately visible to any stakeholder.

fig, ax = plt.subplots(figsize=(14, 7)) sns.heatmap( retention_matrix, annot=True, # show % values in each cell fmt='.1f', # one decimal place cmap='Blues', # blue gradient vmin=0, vmax=100, # fix scale to 0-100% linewidths=0.5, linecolor='white', ax=ax, cbar_kws={'label': 'Retention %'} ) ax.set_title('Monthly Cohort Retention', fontsize=16, fontweight='bold', pad=15) ax.set_xlabel('Months Since First Purchase', fontsize=12) ax.set_ylabel('Cohort (First Purchase Month)', fontsize=12) ax.set_yticklabels(ax.get_yticklabels(), rotation=0) plt.tight_layout() plt.savefig('cohort_retention_heatmap.png', dpi=150, bbox_inches='tight') plt.show()

Calculating Churn Rate

Churn is the complement of retention: the percentage of users who did not return. Monthly churn for a cohort at period N is simply 100 - retention_at_N. But the more useful metric is incremental churn — the percentage lost between two consecutive periods.

# Month-over-month churn within each cohort churn_matrix = retention_matrix.copy() for col in range(1, len(retention_matrix.columns)): prev = retention_matrix.iloc[:, col - 1] curr = retention_matrix.iloc[:, col] # % of previous-period users who churned churn_matrix.iloc[:, col] = ((prev - curr) / prev * 100).round(1) churn_matrix['Month 0'] = np.nan # no churn at acquisition print(churn_matrix) # Average retention curve across all cohorts avg_retention = retention_matrix.mean(axis=0).round(1) print("\nAverage retention curve:") print(avg_retention)

How to Interpret and Present the Results

A retention matrix is only useful if you can translate the numbers into business decisions. Here's a framework for interpretation:

Pattern you seeWhat it meansAction
Month-1 retention improving across cohortsOnboarding or product improvements are workingDouble down on what changed
Month-1 retention flat but Month-3 decliningUsers start but lose habit over timeImprove re-engagement / habit loops
One cohort much worse than neighboursBad acquisition channel, bad campaign, or product bug that monthInvestigate that specific period
Retention stabilises after Month 3 (~15%+)You have a healthy retained core user baseFocus on growing this segment
Retention drops to near 0 by Month 2Product-market fit problemTalk to churned users, redesign core loop

Presenting to stakeholders

When presenting cohort retention to a non-technical audience, lead with the business implication, not the methodology. Instead of "Month-1 retention increased from 41% to 61% across January–May cohorts", say: "More than half of new customers now return for a second purchase within 30 days — up from 4 in 10 at the start of the year. This directly reduces our customer acquisition cost payback period."

🎯 The analyst's job isn't to produce a heatmap — it's to answer the question: "Is our product getting better or worse at keeping users?" Cohort analysis is the most reliable way to answer it.

Complete script — copy and run

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns def build_cohort_retention(df, user_col='user_id', date_col='order_date'): """ Build a monthly cohort retention matrix from transaction data. df must have columns: user_id, order_date (datetime) Returns: retention_matrix (DataFrame, % values) """ df = df.copy() df['order_month'] = df[date_col].dt.to_period('M') df['cohort_month'] = df.groupby(user_col)['order_month'].transform('min') df['period'] = (df['order_month'] - df['cohort_month']).apply(lambda x: x.n) matrix = ( df.groupby(['cohort_month', 'period'])[user_col] .nunique() .unstack() ) sizes = matrix.iloc[:, 0] retention = (matrix.divide(sizes, axis=0) * 100).round(1) retention.columns = [f'M{c}' for c in retention.columns] return retention, sizes # Usage: # df = pd.read_csv('your_data.csv', parse_dates=['order_date']) # retention, sizes = build_cohort_retention(df) # sns.heatmap(retention, annot=True, fmt='.1f', cmap='Blues', vmin=0, vmax=100)
Tags: Python Retention Cohort Analysis Pandas Product Analytics Churn