Find all queries that took longer than the average execution time. Return query_id, execution_time, and how much longer than average (in seconds).
This performance monitoring problem reflects Databricks' focus on query optimization and observability. Identifying slow queries is the first step in performance tuning, helping data engineers prioritize optimization efforts. This question tests your understanding of subqueries, aggregate functions, and calculated columns—basic but essential skills for database performance analysis.
Concepts tested: scalar subquery for calculating average, comparison operators in WHERE clause, calculated columns for difference, and understanding when subqueries are evaluated (once for scalar subqueries). This is a straightforward application of 'compare to aggregate' pattern that appears frequently in analytics.
Databricks customers use similar queries to: identify queries requiring optimization in their data pipelines, generate performance reports for data engineering teams, trigger alerts when query times exceed thresholds, calculate SLA compliance for data freshness, prioritize cluster scaling decisions, and feed automated query optimization recommendations.
When tackling this Databricks problem, the key is to understand the grain of the result. Are you returning one row per user, or one row per category? Always start by identifying your unique join keys and consider if filtered aggregations (CASE WHEN) are more efficient than multiple subqueries.
Be careful with NULL values in your JOIN conditions or aggregate functions. In interview scenarios, datasets often include edge cases like zero-count categories or duplicate entries that can throw off a simple COUNT(*) if not handled with DISTINCT.
Share your approach, optimized queries, or ask questions. Learning from others is the fastest way to master SQL.