You've launched a new app. Users are signing up every day. But here's the million-dollar question: Are they coming back?
A single metric like "total active users" won't tell you the full story. You need to know: Are users who signed up in January still active in March? Are they more engaged than users who signed up in February?
This is where cohort analysis comes in—one of the most powerful techniques in product analytics. And the best part? You can do it entirely with SQL.
What is Cohort Analysis?
A cohort is a group of users who share a common characteristic during a specific time period. For example:
- All users who signed up in January 2024.
- All users who made their first purchase during Black Friday.
Cohort analysis tracks these groups over time to see how their behavior changes. The most common use case is retention analysis: measuring what percentage of users return after their first interaction.
Why Cohort Analysis Matters
Imagine you're running a SaaS product. You notice:
- January cohort: 60% of users are still active after 3 months.
- February cohort: Only 30% are still active after 3 months.
This tells you something changed between January and February. Maybe you released a buggy feature, or your onboarding flow broke. Without cohort analysis, you'd never catch this.
The Data Model
Let's use a simple e-commerce example. We have a user_activity table that tracks when users visit the site:
| user_id | activity_date |
|---|---|
| 1 | 2024-01-05 |
| 1 | 2024-01-12 |
| 1 | 2024-02-03 |
| 2 | 2024-01-10 |
| 2 | 2024-01-15 |
We want to answer: What percentage of users who joined in January came back in February?
Step 1: Identify the Cohort
First, we need to determine each user's cohort—the month they first became active.
SELECT
user_id,
DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
FROM user_activity
GROUP BY user_id;
This gives us:
| user_id | cohort_month |
|---|---|
| 1 | 2024-01-01 |
| 2 | 2024-01-01 |
Both users belong to the January 2024 cohort.
Step 2: Calculate Activity Periods
Next, for each activity, we calculate how many months have passed since the user's cohort month. This is called the period number.
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(activity_date)) AS cohort_month
FROM user_activity
GROUP BY user_id
)
SELECT
ua.user_id,
c.cohort_month,
DATE_TRUNC('month', ua.activity_date) AS activity_month,
(EXTRACT(YEAR FROM ua.activity_date) - EXTRACT(YEAR FROM c.cohort_month)) * 12
+ (EXTRACT(MONTH FROM ua.activity_date) - EXTRACT(MONTH FROM c.cohort_month)) AS period_number
FROM user_activity ua
JOIN cohorts c ON ua.user_id = c.user_id;
This gives us:
| user_id | cohort_month | activity_month | period_number |
|---|---|---|---|
| 1 | 2024-01-01 | 2024-01-01 | 0 |
| 1 | 2024-01-01 | 2024-01-01 | 0 |
| 1 | 2024-01-01 | 2024-02-01 | 1 |
| 2 | 2024-01-01 | 2024-01-01 | 0 |
Period 0 = cohort month, Period 1 = 1 month later, etc.
Step 3: Build the Retention Matrix
Now we count how many users from each cohort were active in each period.
Reading the Results:
- Cohort Month: When users first became active.
- Period Number: Months since cohort month (0 = same month, 1 = next month, etc.).
- Active Users: How many users from that cohort were active in that period.
For example:
- 2024-01-01, Period 0: 3 users (all January signups were active in January).
- 2024-01-01, Period 1: 2 users (2 out of 3 January signups came back in February).
- 2024-01-01, Period 2: 2 users (2 out of 3 came back in March).
Retention Rate = (Active Users in Period N) / (Active Users in Period 0) × 100%
So for the January cohort:
- Month 1 Retention: 2/3 = 67%
- Month 2 Retention: 2/3 = 67%
Visualizing Cohort Data
In practice, you'd export this data to a tool like Excel, Google Sheets, or a BI platform to create a retention heatmap:
| Month 0 | Month 1 | Month 2 | |
|---|---|---|---|
| Jan 2024 | 100% | 67% | 67% |
| Feb 2024 | 100% | 100% | 100% |
Green = high retention, Red = low retention.
Advanced: Calculating Cohort Metrics
Once you have the cohort structure, you can calculate other metrics:
Average Revenue Per Cohort
SELECT
c.cohort_month,
AVG(o.order_total) AS avg_revenue
FROM orders o
JOIN cohorts c ON o.user_id = c.user_id
GROUP BY c.cohort_month;
Churn Rate
-- Users who were active in Period 0 but NOT in Period 1
SELECT
cohort_month,
COUNT(DISTINCT CASE WHEN period_number = 0 THEN user_id END) AS period_0_users,
COUNT(DISTINCT CASE WHEN period_number = 1 THEN user_id END) AS period_1_users,
1.0 - (COUNT(DISTINCT CASE WHEN period_number = 1 THEN user_id END) * 1.0 /
COUNT(DISTINCT CASE WHEN period_number = 0 THEN user_id END)) AS churn_rate
FROM user_periods
GROUP BY cohort_month;
Best Practices
-
Choose the Right Cohort Window: Monthly cohorts are common, but you can also use weekly or daily cohorts for fast-moving products.
-
Define "Active" Clearly: Is a user "active" if they log in? Make a purchase? Engage with content? Be consistent.
-
Use CTEs for Readability: Cohort queries can get complex. Break them into steps with Common Table Expressions.
-
Automate with Views: Create a view for your cohort logic so analysts can query it easily.
Conclusion
Cohort analysis is essential for understanding user behavior over time. By grouping users based on when they joined and tracking their activity, you can:
- Measure retention and identify drop-off points.
- Compare cohorts to see the impact of product changes.
- Predict churn and take action before users leave.
And with SQL, you have full control over the analysis—no need to wait for a third-party tool to add the feature you need.
Now go build some cohorts! 📊