SQL for Anomaly Detection: Finding Outliers

Your sales dashboard looks fine on average—revenue is up 5% month-over-month. But buried in your data are a few transactions that are wildly off: a $0.01 order, a sudden spike in refunds, a server metric that jumped 10x for one minute.

These are anomalies, and averages hide them. SQL can surface them.

Time-series line chart showing normal data points with two outlier spikes circled in red, illustrating SQL anomaly detection

In this post, we'll explore three practical methods to detect outliers directly in SQL: the Z-Score method, the Interquartile Range (IQR) method, and using moving averages to catch time-series spikes.

What is an Anomaly?

An anomaly (or outlier) is a data point that deviates significantly from expected behavior. Anomaly detection matters for:

Fraud detection: A transaction 100x larger than a user's typical order.
System monitoring: A server latency spike at 3am.
Data quality: Accidentally inserted negative prices or future timestamps.
Business intelligence: An inexplicable dip in conversion rates.

The challenge is defining "significantly different"—and SQL gives us several statistical tools to do exactly that.

The Sample Dataset

Let's create a table of daily sales transactions. Two rows are clearly suspicious: row #7 is a $1,850 spike, and row #10 is a $4.99 near-zero value.

Interactive SQL

Now let's catch these programmatically—without eyeballing the data.

Method 1: Z-Score

The Z-Score measures how many standard deviations a value is from the mean:

Z = (x − μ) / σ

Where μ is the mean and σ is the standard deviation. A Z-Score above 2 or 3 (or below -2 / -3) is typically flagged as an anomaly.

Interactive SQL

Notice that only the $1,850 spike is flagged here. Because that outlier is so extreme, it dramatically inflates the standard deviation, which actually protects the $4.99 value from being caught. This is the core weakness of the Z-Score method.

Limitation: A single extreme outlier can inflate σ, making other moderate outliers appear normal.

Method 2: Interquartile Range (IQR)

The IQR method is more robust to extreme values. It defines the normal band as:

Lower Fence = Q1 − 1.5 × IQR
Upper Fence = Q3 + 1.5 × IQR

Where IQR = Q3 − Q1. Values outside these fences are outliers.

SQLite doesn't have a built-in PERCENTILE_CONT function, but NTILE(4) gives us a solid approximation:

Interactive SQL

The IQR method catches both the $1,850 spike and the $4.99 dip because it isn't distorted by either extreme. This makes it the go-to method for skewed financial data.

Method 3: Moving Average Deviation

For time-series data, comparing each value to a rolling window is often the most intuitive approach. A value is anomalous if it deviates significantly from its recent neighborhood:

Interactive SQL

The window ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING compares each day's sales to the previous 4 days. The January 7th spike stands out sharply against the $118–$142 range of the preceding days.

Which Method Should You Use?

Method	Best For	Weakness
Z-Score	Normally distributed data	One extreme outlier distorts $\sigma$
IQR	Skewed data, robust detection	Misses gradual drift over time
Moving Average	Time-series, sequential events	Needs sufficient historical data

In practice, you can combine methods: flag only records that are anomalous by two or more methods out of three. This dramatically reduces false positives.

Practical Tips

Segment before analyzing: Compute statistics per product_category or region—not across all data—so you're comparing apples to apples.
Tune thresholds by domain: A 3-sigma rule for financial data; a lower threshold may work better for operational metrics.
Log, don't delete: When you find an anomaly, write it to a review table rather than silently discarding it.
Schedule it: Run your anomaly queries daily or hourly as part of your data pipeline. Stale anomaly detection is no anomaly detection.

Conclusion

Anomaly detection in SQL doesn't require machine learning. With Z-Scores, IQR fences, and rolling window comparisons, you can surface outliers directly in your data warehouse using tools you already know.

Start with the IQR method for a robust, distribution-agnostic approach. Layer in moving averages when working with time-series data. Your future self—staring at a fraud alert at 2am—will thank you.

These are anomalies, and averages hide them. SQL can surface them.

What is an Anomaly?

An anomaly (or outlier) is a data point that deviates significantly from expected behavior. Anomaly detection matters for:

Fraud detection: A transaction 100x larger than a user's typical order.
System monitoring: A server latency spike at 3am.
Data quality: Accidentally inserted negative prices or future timestamps.
Business intelligence: An inexplicable dip in conversion rates.

The challenge is defining "significantly different"—and SQL gives us several statistical tools to do exactly that.

The Sample Dataset

Let's create a table of daily sales transactions. Two rows are clearly suspicious: row #7 is a $1,850 spike, and row #10 is a $4.99 near-zero value.

Interactive SQL

Now let's catch these programmatically—without eyeballing the data.

Method 1: Z-Score

The Z-Score measures how many standard deviations a value is from the mean:

Z = (x − μ) / σ

Where μ is the mean and σ is the standard deviation. A Z-Score above 2 or 3 (or below -2 / -3) is typically flagged as an anomaly.

Interactive SQL

Limitation: A single extreme outlier can inflate σ, making other moderate outliers appear normal.

Method 2: Interquartile Range (IQR)

The IQR method is more robust to extreme values. It defines the normal band as:

Lower Fence = Q1 − 1.5 × IQR
Upper Fence = Q3 + 1.5 × IQR

Where IQR = Q3 − Q1. Values outside these fences are outliers.

SQLite doesn't have a built-in PERCENTILE_CONT function, but NTILE(4) gives us a solid approximation:

Interactive SQL

The IQR method catches both the $1,850 spike and the $4.99 dip because it isn't distorted by either extreme. This makes it the go-to method for skewed financial data.

Method 3: Moving Average Deviation

For time-series data, comparing each value to a rolling window is often the most intuitive approach. A value is anomalous if it deviates significantly from its recent neighborhood:

Interactive SQL

The window ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING compares each day's sales to the previous 4 days. The January 7th spike stands out sharply against the $118–$142 range of the preceding days.

Which Method Should You Use?

Method	Best For	Weakness
Z-Score	Normally distributed data	One extreme outlier distorts $\sigma$
IQR	Skewed data, robust detection	Misses gradual drift over time
Moving Average	Time-series, sequential events	Needs sufficient historical data

In practice, you can combine methods: flag only records that are anomalous by two or more methods out of three. This dramatically reduces false positives.

Practical Tips

Segment before analyzing: Compute statistics per product_category or region—not across all data—so you're comparing apples to apples.
Tune thresholds by domain: A 3-sigma rule for financial data; a lower threshold may work better for operational metrics.
Log, don't delete: When you find an anomaly, write it to a review table rather than silently discarding it.
Schedule it: Run your anomaly queries daily or hourly as part of your data pipeline. Stale anomaly detection is no anomaly detection.

Conclusion

Start with the IQR method for a robust, distribution-agnostic approach. Layer in moving averages when working with time-series data. Your future self—staring at a fraud alert at 2am—will thank you.

SQL for Anomaly Detection: Finding Outliers

What is an Anomaly?

The Sample Dataset

Method 1: Z-Score

Method 2: Interquartile Range (IQR)

Method 3: Moving Average Deviation

Which Method Should You Use?

Practical Tips

Conclusion

Related Articles

SQL for Data Analysis: The Ultimate Guide

Calculating Percentiles and Median in SQL

Ranking Data with SQL: RANK, DENSE_RANK, and ROW_NUMBER Explained

Comments

SQL for Anomaly Detection: Finding Outliers

What is an Anomaly?

The Sample Dataset

Method 1: Z-Score

Method 2: Interquartile Range (IQR)

Method 3: Moving Average Deviation

Which Method Should You Use?

Practical Tips

Conclusion

Related Articles

SQL for Data Analysis: The Ultimate Guide

Calculating Percentiles and Median in SQL

Ranking Data with SQL: RANK, DENSE_RANK, and ROW_NUMBER Explained

Comments