Data Cleaning with SQL: From Messy to Masterpiece

In the real world, data is rarely pristine. It comes with trailing spaces, inconsistent capitalization, strange characters, and missing values.

Before you can analyze data, you must clean it. Here are the 4 essential tools in your SQL cleaning kit.

Messy data cleaned into standardized fields

1. COALESCE: The Null-Buster

NULL values can ruin math and look ugly in reports. COALESCE() returns the first non-null value in a list.

SELECT COALESCE(phone_number, 'No Phone') FROM users;

2. TRIM: The Whitespace Whacker

Spaces at the start or end of strings are invisible enemies. They make 'Alice' imply not equal to 'Alice '.

SELECT TRIM(name) FROM users;

3. LOWER / UPPER: The Case Normalizer

'iPad', 'IPad', and 'ipad' are three different strings to a computer. Normalize them to fix grouping.

SELECT LOWER(device_type) FROM devices;

4. REPLACE: The Substitution Surgeon

Great for fixing inconsistent formats, like removing dashes from phone numbers or fixing typos.

SELECT REPLACE(phone, '-', '') FROM users;

Putting It All Together: Interactive Cleaning

Let's clean a messy customer list.

The Horror Show (Raw Data):

Inconsistent names (alice, Bob, CHARLIE)
Messy emails with spaces ( alice@email.com)
NULL phone numbers
NULL status (should be 'Active')

Interactive SQL

The Cleanup Query:

Interactive SQL

Tip: Depending on your SQL dialect (PostgreSQL, SQL Server), you might have INITCAP() to capitalize just the first letter (e.g., 'Alice'). Standard SQL doesn't always have INITCAP, so we often settle for UPPER or complex SUBSTRING logic for names.

Deduplication

Another huge part of cleaning is removing duplicates.

The best way to deduplicate? Use ROW_NUMBER().

WITH flagged_dupes AS (
  SELECT 
    *,
    ROW_NUMBER() OVER(PARTITION BY email ORDER BY id DESC) as rn
  FROM users
)
SELECT * FROM flagged_dupes WHERE rn = 1; -- Keep only the latest entry

Conclusion

Dirty data leads to dirty insights. By mastering TRIM, COALESCE, and normalization functions, you ensure your analysis is built on a solid foundation.

In the real world, data is rarely pristine. It comes with trailing spaces, inconsistent capitalization, strange characters, and missing values.

Before you can analyze data, you must clean it. Here are the 4 essential tools in your SQL cleaning kit.

1. COALESCE: The Null-Buster

NULL values can ruin math and look ugly in reports. COALESCE() returns the first non-null value in a list.

SELECT COALESCE(phone_number, 'No Phone') FROM users;

2. TRIM: The Whitespace Whacker

Spaces at the start or end of strings are invisible enemies. They make 'Alice' imply not equal to 'Alice '.

SELECT TRIM(name) FROM users;

3. LOWER / UPPER: The Case Normalizer

'iPad', 'IPad', and 'ipad' are three different strings to a computer. Normalize them to fix grouping.

SELECT LOWER(device_type) FROM devices;

4. REPLACE: The Substitution Surgeon

Great for fixing inconsistent formats, like removing dashes from phone numbers or fixing typos.

SELECT REPLACE(phone, '-', '') FROM users;

Putting It All Together: Interactive Cleaning

Let's clean a messy customer list.

The Horror Show (Raw Data):

Inconsistent names (alice, Bob, CHARLIE)
Messy emails with spaces ( alice@email.com)
NULL phone numbers
NULL status (should be 'Active')

Interactive SQL

The Cleanup Query:

Interactive SQL

Deduplication

Another huge part of cleaning is removing duplicates.

The best way to deduplicate? Use ROW_NUMBER().

WITH flagged_dupes AS (
  SELECT 
    *,
    ROW_NUMBER() OVER(PARTITION BY email ORDER BY id DESC) as rn
  FROM users
)
SELECT * FROM flagged_dupes WHERE rn = 1; -- Keep only the latest entry

Conclusion

Dirty data leads to dirty insights. By mastering TRIM, COALESCE, and normalization functions, you ensure your analysis is built on a solid foundation.

Data Cleaning with SQL: From Messy to Masterpiece

1. COALESCE: The Null-Buster

2. TRIM: The Whitespace Whacker

3. LOWER / UPPER: The Case Normalizer

4. REPLACE: The Substitution Surgeon

Putting It All Together: Interactive Cleaning

Deduplication

Conclusion

Related Articles

Removing Duplicate Rows in SQL: A Complete Guide

Essential SQL String Functions for Data Cleaning and Analysis

SQL Window Frames: ROWS vs RANGE

Comments

Data Cleaning with SQL: From Messy to Masterpiece

1. COALESCE: The Null-Buster

2. TRIM: The Whitespace Whacker

3. LOWER / UPPER: The Case Normalizer

4. REPLACE: The Substitution Surgeon

Putting It All Together: Interactive Cleaning

Deduplication

Conclusion

Related Articles

Removing Duplicate Rows in SQL: A Complete Guide

Essential SQL String Functions for Data Cleaning and Analysis

SQL Window Frames: ROWS vs RANGE

Comments