In the real world, data is rarely pristine. It comes with trailing spaces, inconsistent capitalization, strange characters, and missing values.
Before you can analyze data, you must clean it. Here are the 4 essential tools in your SQL cleaning kit.
1. COALESCE: The Null-Buster
NULL values can ruin math and look ugly in reports. COALESCE() returns the first non-null value in a list.
SELECT COALESCE(phone_number, 'No Phone') FROM users;
2. TRIM: The Whitespace Whacker
Spaces at the start or end of strings are invisible enemies. They make 'Alice' imply not equal to 'Alice '.
SELECT TRIM(name) FROM users;
3. LOWER / UPPER: The Case Normalizer
'iPad', 'IPad', and 'ipad' are three different strings to a computer. Normalize them to fix grouping.
SELECT LOWER(device_type) FROM devices;
4. REPLACE: The Substitution Surgeon
Great for fixing inconsistent formats, like removing dashes from phone numbers or fixing typos.
SELECT REPLACE(phone, '-', '') FROM users;
Putting It All Together: Interactive Cleaning
Let's clean a messy customer list.
The Horror Show (Raw Data):
- Inconsistent names (alice, Bob, CHARLIE)
- Messy emails with spaces (
alice@email.com) - NULL phone numbers
- NULL status (should be 'Active')
The Cleanup Query:
Tip: Depending on your SQL dialect (PostgreSQL, SQL Server), you might have INITCAP() to capitalize just the first letter (e.g., 'Alice'). Standard SQL doesn't always have INITCAP, so we often settle for UPPER or complex SUBSTRING logic for names.
Deduplication
Another huge part of cleaning is removing duplicates.
The best way to deduplicate? Use ROW_NUMBER().
WITH flagged_dupes AS (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY email ORDER BY id DESC) as rn
FROM users
)
SELECT * FROM flagged_dupes WHERE rn = 1; -- Keep only the latest entry
Conclusion
Dirty data leads to dirty insights. By mastering TRIM, COALESCE, and normalization functions, you ensure your analysis is built on a solid foundation.