The "Gaps and Islands" problem is a classic SQL puzzle that comes up frequently in real-world scenarios. It involves analyzing a sequence of data to find:
- Islands: Consecutive ranges of values (e.g., a user logging in 5 days in a row).
- Gaps: Missing values in a sequence (e.g., finding available appointment slots in a schedule).
Despite being a common problem, the solution isn't always intuitive. In this post, we'll explore the "Row Number Difference" technique, which is one of the most elegant ways to solve this.
The Scenario: Login Streaks
Imagine a table tracking user login dates. We want to find "Islands" of activity—consecutive days where a user logged in—to reward them for streaks.
| User | Login Date |
|---|---|
| Alice | 2024-01-01 |
| Alice | 2024-01-02 |
| Alice | 2024-01-03 |
| Alice | 2024-01-05 |
| Alice | 2024-01-06 |
Here, Alice has two streaks (islands):
- Jan 1 to Jan 3 (3 days)
- Jan 5 to Jan 6 (2 days) Jan 4 is a "gap".
The Solution: Row Number Difference
The core insight is this: If a sequence is consecutive, the difference between the value and its rank (row number) will be constant.
Let's verify this mathematically.
If we assign a ROW_NUMBER() to the dates:
| Date (D) | Row Num (RN) | D - RN |
|---|---|---|
| Jan 1 | 1 | Jan 0 |
| Jan 2 | 2 | Jan 0 |
| Jan 3 | 3 | Jan 0 |
| Jan 5 | 4 | Jan 1 |
| Jan 6 | 5 | Jan 1 |
Notice how D - RN is constant ("Jan 0") for the first group, and then changes to "Jan 1" for the second group. This "group identifier" allows us to group consecutive rows together!
Constructing the Query
- Generate Row Numbers: Order by date for each user.
- Calculate the Group ID: Subtract the row number (converted to days) from the login date.
- Group By the Difference: Aggregate to find the start and end of each streak.
WITH RankedLogs AS (
SELECT
username,
login_date,
ROW_NUMBER() OVER (PARTITION BY username ORDER BY login_date) as rn
FROM user_logins
),
GroupedLogs AS (
SELECT
username,
login_date,
-- Subtract RN days from login_date to get the 'Island ID'
date(login_date, '-' || rn || ' days') as island_id
FROM RankedLogs
)
SELECT
username,
MIN(login_date) as streak_start,
MAX(login_date) as streak_end,
COUNT(*) as streak_length
FROM GroupedLogs
GROUP BY username, island_id
ORDER BY username, streak_start;
Interactive Gaps & Islands Playground
Try it yourself! In this example, we look for "Islands" of consecutive manufacturing serial numbers to see batches.
Finding Gaps
Finding what's missing is just as important. To find gaps, you usually use LEAD() to look at the next value and check if it's more than 1 unit away.
SELECT
serial_number + 1 as gap_start,
next_serial - 1 as gap_end
FROM (
SELECT
serial_number,
LEAD(serial_number) OVER (ORDER BY serial_number) as next_serial
FROM manufacturing_log_gaps
) sub
WHERE next_serial - serial_number > 1;
Conclusion
The "Row Number Difference" grouping method is a powerful pattern to have in your SQL toolkit. While it requires a bit of mental gymnastics to understand initially, it is incredibly efficient and works on virtually any SQL database with window functions. Use it to analyze streaks, sessions, and availability ranges with ease.