Cohort Analysis with SQL

You've launched a new app. Users are signing up every day. But here's the million-dollar question: Are they coming back?

Cohort analysis overview with cohorts progressing across months

A single metric like "total active users" won't tell you the full story. You need to know: Are users who signed up in January still active in March? Are they more engaged than users who signed up in February?

This is where cohort analysis comes in—one of the most powerful techniques in product analytics. And the best part? You can do it entirely with SQL.

What is Cohort Analysis?

A cohort is a group of users who share a common characteristic during a specific time period. For example:

All users who signed up in January 2024.
All users who made their first purchase during Black Friday.

Cohort analysis tracks these groups over time to see how their behavior changes. The most common use case is retention analysis: measuring what percentage of users return after their first interaction.

Why Cohort Analysis Matters

Imagine you're running a SaaS product. You notice:

January cohort: 60% of users are still active after 3 months.
February cohort: Only 30% are still active after 3 months.

This tells you something changed between January and February. Maybe you released a buggy feature, or your onboarding flow broke. Without cohort analysis, you'd never catch this.

The Data Model

Let's use a simple e-commerce example. We have a user_activity table that tracks when users visit the site:

user_id	activity_date
1	2024-01-05
1	2024-01-12
1	2024-02-03
2	2024-01-10
2	2024-01-15

We want to answer: What percentage of users who joined in January came back in February?

Step 1: Identify the Cohort

First, we need to determine each user's cohort—the month they first became active.

SELECT 
  user_id,
  DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
FROM user_activity
GROUP BY user_id;

This gives us:

user_id	cohort_month
1	2024-01-01
2	2024-01-01

Both users belong to the January 2024 cohort.

Step 2: Calculate Activity Periods

Next, for each activity, we calculate how many months have passed since the user's cohort month. This is called the period number.

WITH cohorts AS (
  SELECT 
    user_id,
    DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
  FROM user_activity
  GROUP BY user_id
)
SELECT 
  ua.user_id,
  c.cohort_month,
  DATE_TRUNC('month', ua.activity_date) AS activity_month,
  (EXTRACT(YEAR FROM ua.activity_date) - EXTRACT(YEAR FROM c.cohort_month)) * 12 
    + (EXTRACT(MONTH FROM ua.activity_date) - EXTRACT(MONTH FROM c.cohort_month)) AS period_number
FROM user_activity ua
JOIN cohorts c ON ua.user_id = c.user_id;

This gives us:

user_id	cohort_month	activity_month	period_number
1	2024-01-01	2024-01-01	0
1	2024-01-01	2024-01-01	0
1	2024-01-01	2024-02-01	1
2	2024-01-01	2024-01-01	0

Period 0 = cohort month, Period 1 = 1 month later, etc.

Step 3: Build the Retention Matrix

Now we count how many users from each cohort were active in each period.

Interactive SQL

Reading the Results:

Cohort Month: When users first became active.
Period Number: Months since cohort month (0 = same month, 1 = next month, etc.).
Active Users: How many users from that cohort were active in that period.

For example:

2024-01-01, Period 0: 3 users (all January signups were active in January).
2024-01-01, Period 1: 2 users (2 out of 3 January signups came back in February).
2024-01-01, Period 2: 2 users (2 out of 3 came back in March).

Retention Rate = (Active Users in Period N) / (Active Users in Period 0) × 100%

So for the January cohort:

Month 1 Retention: 2/3 = 67%
Month 2 Retention: 2/3 = 67%

Visualizing Cohort Data

Retention heatmap showing cohort retention across Month 0, Month 1, and Month 2

In practice, you'd export this data to a tool like Excel, Google Sheets, or a BI platform to create a retention heatmap:

	Month 0	Month 1	Month 2
Jan 2024	100%	67%	67%
Feb 2024	100%	100%	100%

Green = high retention, Red = low retention.

Advanced: Calculating Cohort Metrics

Once you have the cohort structure, you can calculate other metrics:

Average Revenue Per Cohort

SELECT 
  c.cohort_month,
  AVG(o.order_total) AS avg_revenue
FROM orders o
JOIN cohorts c ON o.user_id = c.user_id
GROUP BY c.cohort_month;

Churn Rate

-- Users who were active in Period 0 but NOT in Period 1
SELECT 
  cohort_month,
  COUNT(DISTINCT CASE WHEN period_number = 0 THEN user_id END) AS period_0_users,
  COUNT(DISTINCT CASE WHEN period_number = 1 THEN user_id END) AS period_1_users,
  1.0 - (COUNT(DISTINCT CASE WHEN period_number = 1 THEN user_id END) * 1.0 / 
         COUNT(DISTINCT CASE WHEN period_number = 0 THEN user_id END)) AS churn_rate
FROM user_periods
GROUP BY cohort_month;

Best Practices

Choose the Right Cohort Window: Monthly cohorts are common, but you can also use weekly or daily cohorts for fast-moving products.
Define "Active" Clearly: Is a user "active" if they log in? Make a purchase? Engage with content? Be consistent.
Use CTEs for Readability: Cohort queries can get complex. Break them into steps with Common Table Expressions.
Automate with Views: Create a view for your cohort logic so analysts can query it easily.

Conclusion

Cohort analysis is essential for understanding user behavior over time. By grouping users based on when they joined and tracking their activity, you can:

Measure retention and identify drop-off points.
Compare cohorts to see the impact of product changes.
Predict churn and take action before users leave.

And with SQL, you have full control over the analysis—no need to wait for a third-party tool to add the feature you need.

Now go build some cohorts! 📊

You've launched a new app. Users are signing up every day. But here's the million-dollar question: Are they coming back?

This is where cohort analysis comes in—one of the most powerful techniques in product analytics. And the best part? You can do it entirely with SQL.

What is Cohort Analysis?

A cohort is a group of users who share a common characteristic during a specific time period. For example:

All users who signed up in January 2024.
All users who made their first purchase during Black Friday.

Why Cohort Analysis Matters

Imagine you're running a SaaS product. You notice:

January cohort: 60% of users are still active after 3 months.
February cohort: Only 30% are still active after 3 months.

This tells you something changed between January and February. Maybe you released a buggy feature, or your onboarding flow broke. Without cohort analysis, you'd never catch this.

The Data Model

Let's use a simple e-commerce example. We have a user_activity table that tracks when users visit the site:

user_id	activity_date
1	2024-01-05
1	2024-01-12
1	2024-02-03
2	2024-01-10
2	2024-01-15

We want to answer: What percentage of users who joined in January came back in February?

Step 1: Identify the Cohort

First, we need to determine each user's cohort—the month they first became active.

SELECT 
  user_id,
  DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
FROM user_activity
GROUP BY user_id;

This gives us:

user_id	cohort_month
1	2024-01-01
2	2024-01-01

Both users belong to the January 2024 cohort.

Step 2: Calculate Activity Periods

Next, for each activity, we calculate how many months have passed since the user's cohort month. This is called the period number.

WITH cohorts AS (
  SELECT 
    user_id,
    DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
  FROM user_activity
  GROUP BY user_id
)
SELECT 
  ua.user_id,
  c.cohort_month,
  DATE_TRUNC('month', ua.activity_date) AS activity_month,
  (EXTRACT(YEAR FROM ua.activity_date) - EXTRACT(YEAR FROM c.cohort_month)) * 12 
    + (EXTRACT(MONTH FROM ua.activity_date) - EXTRACT(MONTH FROM c.cohort_month)) AS period_number
FROM user_activity ua
JOIN cohorts c ON ua.user_id = c.user_id;

This gives us:

user_id	cohort_month	activity_month	period_number
1	2024-01-01	2024-01-01	0
1	2024-01-01	2024-01-01	0
1	2024-01-01	2024-02-01	1
2	2024-01-01	2024-01-01	0

Period 0 = cohort month, Period 1 = 1 month later, etc.

Step 3: Build the Retention Matrix

Now we count how many users from each cohort were active in each period.

Interactive SQL

Reading the Results:

Cohort Month: When users first became active.
Period Number: Months since cohort month (0 = same month, 1 = next month, etc.).
Active Users: How many users from that cohort were active in that period.

For example:

2024-01-01, Period 0: 3 users (all January signups were active in January).
2024-01-01, Period 1: 2 users (2 out of 3 January signups came back in February).
2024-01-01, Period 2: 2 users (2 out of 3 came back in March).

Retention Rate = (Active Users in Period N) / (Active Users in Period 0) × 100%

So for the January cohort:

Month 1 Retention: 2/3 = 67%
Month 2 Retention: 2/3 = 67%

Visualizing Cohort Data

In practice, you'd export this data to a tool like Excel, Google Sheets, or a BI platform to create a retention heatmap:

	Month 0	Month 1	Month 2
Jan 2024	100%	67%	67%
Feb 2024	100%	100%	100%

Green = high retention, Red = low retention.

Advanced: Calculating Cohort Metrics

Once you have the cohort structure, you can calculate other metrics:

Average Revenue Per Cohort

SELECT 
  c.cohort_month,
  AVG(o.order_total) AS avg_revenue
FROM orders o
JOIN cohorts c ON o.user_id = c.user_id
GROUP BY c.cohort_month;

Churn Rate

-- Users who were active in Period 0 but NOT in Period 1
SELECT 
  cohort_month,
  COUNT(DISTINCT CASE WHEN period_number = 0 THEN user_id END) AS period_0_users,
  COUNT(DISTINCT CASE WHEN period_number = 1 THEN user_id END) AS period_1_users,
  1.0 - (COUNT(DISTINCT CASE WHEN period_number = 1 THEN user_id END) * 1.0 / 
         COUNT(DISTINCT CASE WHEN period_number = 0 THEN user_id END)) AS churn_rate
FROM user_periods
GROUP BY cohort_month;

Best Practices

Choose the Right Cohort Window: Monthly cohorts are common, but you can also use weekly or daily cohorts for fast-moving products.
Define "Active" Clearly: Is a user "active" if they log in? Make a purchase? Engage with content? Be consistent.
Use CTEs for Readability: Cohort queries can get complex. Break them into steps with Common Table Expressions.
Automate with Views: Create a view for your cohort logic so analysts can query it easily.

Conclusion

Cohort analysis is essential for understanding user behavior over time. By grouping users based on when they joined and tracking their activity, you can:

Measure retention and identify drop-off points.
Compare cohorts to see the impact of product changes.
Predict churn and take action before users leave.

And with SQL, you have full control over the analysis—no need to wait for a third-party tool to add the feature you need.

Now go build some cohorts! 📊

Cohort Analysis with SQL

What is Cohort Analysis?

Why Cohort Analysis Matters

The Data Model

Step 1: Identify the Cohort

Step 2: Calculate Activity Periods

Step 3: Build the Retention Matrix

Visualizing Cohort Data

Advanced: Calculating Cohort Metrics

Average Revenue Per Cohort

Churn Rate

Best Practices

Conclusion

Related Articles

SQL for Data Analysis: The Ultimate Guide

Calculating Percentiles and Median in SQL

Ranking Data with SQL: RANK, DENSE_RANK, and ROW_NUMBER Explained

Comments

Cohort Analysis with SQL

What is Cohort Analysis?

Why Cohort Analysis Matters

The Data Model

Step 1: Identify the Cohort

Step 2: Calculate Activity Periods

Step 3: Build the Retention Matrix

Visualizing Cohort Data

Advanced: Calculating Cohort Metrics

Average Revenue Per Cohort

Churn Rate

Best Practices

Conclusion

Related Articles

SQL for Data Analysis: The Ultimate Guide

Calculating Percentiles and Median in SQL

Ranking Data with SQL: RANK, DENSE_RANK, and ROW_NUMBER Explained

Comments