Web analytics tools like Google Analytics automatically talk about "Sessions". But what if you are working with raw event logs (e.g., from Segment or Mixpanel) in your data warehouse? You have to build the logic yourself.
A Session is typically defined as a group of interactions by one user where no two interactions are more than X minutes apart (standard is 30 minutes).
The Logic
- Calculate Time Gaps: For each row, calculate the difference between the current timestamp and the previous timestamp for that user.
- Flag New Sessions: If the gap > 30 minutes, mark it as the start of a
new_session(flag = 1). Otherwise, it's part of the same session (flag = 0). - Accumulate Flags: A running sum of these flags gives you a unique Session ID for each user.
Step 1: Time Between Events
LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp)
We compare timestamp vs prev_timestamp.
Step 2: The New Session Flag
CASE
WHEN (timestamp - prev_timestamp) > (30 * 60) THEN 1 -- > 30 mins
ELSE 0
END as is_new_session
Interactive Playground
In this example, we'll see a user clicking around. We want to identify distinct "visits" based on a 30-minute timeout. Note: For text timestamps in SQLite, we use julianday math. 30 minutes is roughly 0.0208 days.
Why This Matters
Once you have session_id, you can calculate:
- Average Session Duration:
MAX(time) - MIN(time)per session. - Bounce Rate: Sessions with only 1 event.
- Conversion Rate: Sessions that end in a purchase.
Conclusion
Sessionization is a fundamental transformation in clickstream analytics. By mastering this pattern, you unlock the ability to analyze user behavior at the "visit" level, which is far more meaningful than independent "hits".