Removing Duplicate Rows in SQL: A Complete Guide

Duplicate rows sneak into databases through data imports, buggy applications, or missing constraints. Cleaning them up is a critical skill for any data professional.

This guide covers multiple approaches, from identifying duplicates to safely deleting them.

Step 1: Finding Duplicates

Before deleting anything, let's identify what we're dealing with.

Count Duplicates with GROUP BY

Interactive SQL

Now we know alice@example.com appears twice and bob@example.com appears three times.

Step 2: See All Duplicate Rows

Use a subquery to fetch the full details:

Interactive SQL

Strategy 1: Keep the First (or Last) Record

The most common approach uses ROW_NUMBER() to mark which duplicate to keep.

Using ROW_NUMBER to Identify Duplicates

Interactive SQL

row_num = 1: The first record (keep this one)
row_num > 1: Duplicates (delete these)

Delete Using CTE

Now we can delete the duplicates:

-- SQLite syntax for deleting duplicates
DELETE FROM users
WHERE id IN (
  SELECT id FROM (
    SELECT 
      id,
      ROW_NUMBER() OVER (
        PARTITION BY email 
        ORDER BY created_at
      ) as row_num
    FROM users
  )
  WHERE row_num > 1
);

Tip: Change ORDER BY created_at to ORDER BY created_at DESC to keep the latest record instead.

Strategy 2: Keep Record with Most Data

Sometimes you want to keep the most "complete" record:

-- Keep the record with the longest name (most data)
SELECT id FROM (
  SELECT 
    id,
    ROW_NUMBER() OVER (
      PARTITION BY email 
      ORDER BY LENGTH(name) DESC
    ) as row_num
  FROM users
)
WHERE row_num > 1;

Strategy 3: SELECT DISTINCT INTO New Table

For heavily duplicated tables, it's often easier to create a clean copy:

-- Create new table with unique records only
CREATE TABLE users_clean AS
SELECT DISTINCT email, name
FROM users;

-- Or keep the first of each duplicate
CREATE TABLE users_clean AS
SELECT * FROM (
  SELECT 
    *,
    ROW_NUMBER() OVER (
      PARTITION BY email 
      ORDER BY created_at
    ) as rn
  FROM users
)
WHERE rn = 1;

Strategy 4: Aggregating Duplicates

Sometimes duplicates have different values you want to combine:

Interactive SQL

Preventing Future Duplicates

Add UNIQUE Constraint

-- Prevent duplicate emails
CREATE UNIQUE INDEX idx_unique_email ON users(email);

-- Or during table creation
CREATE TABLE users (
  id INTEGER PRIMARY KEY,
  email TEXT UNIQUE,
  name TEXT
);

Use INSERT OR IGNORE

-- Skip inserts that would create duplicates
INSERT OR IGNORE INTO users (email, name)
VALUES ('alice@example.com', 'Alice');

Quick Reference: Which Strategy?

Scenario	Best Approach
Keep oldest record	`ROW_NUMBER() ORDER BY created_at`
Keep newest record	`ROW_NUMBER() ORDER BY created_at DESC`
Keep most complete	`ROW_NUMBER() ORDER BY LENGTH(field) DESC`
Combine values	`GROUP BY` with `SUM`/`MAX`
Heavy duplication	Create new clean table
Prevent future dupes	`UNIQUE` constraint

Safety Tips

Always backup first: Run SELECT before DELETE to verify what will be removed.
Use transactions: Wrap deletions in BEGIN/COMMIT so you can ROLLBACK if needed.
Check foreign keys: Deleting a row might break references in other tables.
Log deleted rows: Insert them into an archive table before deleting.

Conclusion

Duplicate removal is a multi-step process:

Identify duplicates with GROUP BY HAVING COUNT(*) > 1
Examine them to decide which to keep
Mark duplicates using ROW_NUMBER() OVER (PARTITION BY...)
Delete rows where row_num > 1
Prevent future duplicates with UNIQUE constraints

Master these patterns and you'll handle any data deduplication challenge.

Duplicate rows sneak into databases through data imports, buggy applications, or missing constraints. Cleaning them up is a critical skill for any data professional.

This guide covers multiple approaches, from identifying duplicates to safely deleting them.

Step 1: Finding Duplicates

Before deleting anything, let's identify what we're dealing with.

Count Duplicates with GROUP BY

Interactive SQL

Now we know alice@example.com appears twice and bob@example.com appears three times.

Step 2: See All Duplicate Rows

Use a subquery to fetch the full details:

Interactive SQL

Strategy 1: Keep the First (or Last) Record

The most common approach uses ROW_NUMBER() to mark which duplicate to keep.

Using ROW_NUMBER to Identify Duplicates

Interactive SQL

row_num = 1: The first record (keep this one)
row_num > 1: Duplicates (delete these)

Delete Using CTE

Now we can delete the duplicates:

-- SQLite syntax for deleting duplicates
DELETE FROM users
WHERE id IN (
  SELECT id FROM (
    SELECT 
      id,
      ROW_NUMBER() OVER (
        PARTITION BY email 
        ORDER BY created_at
      ) as row_num
    FROM users
  )
  WHERE row_num > 1
);

Tip: Change ORDER BY created_at to ORDER BY created_at DESC to keep the latest record instead.

Strategy 2: Keep Record with Most Data

Sometimes you want to keep the most "complete" record:

-- Keep the record with the longest name (most data)
SELECT id FROM (
  SELECT 
    id,
    ROW_NUMBER() OVER (
      PARTITION BY email 
      ORDER BY LENGTH(name) DESC
    ) as row_num
  FROM users
)
WHERE row_num > 1;

Strategy 3: SELECT DISTINCT INTO New Table

For heavily duplicated tables, it's often easier to create a clean copy:

-- Create new table with unique records only
CREATE TABLE users_clean AS
SELECT DISTINCT email, name
FROM users;

-- Or keep the first of each duplicate
CREATE TABLE users_clean AS
SELECT * FROM (
  SELECT 
    *,
    ROW_NUMBER() OVER (
      PARTITION BY email 
      ORDER BY created_at
    ) as rn
  FROM users
)
WHERE rn = 1;

-- Prevent duplicate emails
CREATE UNIQUE INDEX idx_unique_email ON users(email);

-- Or during table creation
CREATE TABLE users (
  id INTEGER PRIMARY KEY,
  email TEXT UNIQUE,
  name TEXT
);

Use INSERT OR IGNORE

-- Skip inserts that would create duplicates
INSERT OR IGNORE INTO users (email, name)
VALUES ('alice@example.com', 'Alice');

Quick Reference: Which Strategy?

Scenario	Best Approach
Keep oldest record	`ROW_NUMBER() ORDER BY created_at`
Keep newest record	`ROW_NUMBER() ORDER BY created_at DESC`
Keep most complete	`ROW_NUMBER() ORDER BY LENGTH(field) DESC`
Combine values	`GROUP BY` with `SUM`/`MAX`
Heavy duplication	Create new clean table
Prevent future dupes	`UNIQUE` constraint

Safety Tips

Always backup first: Run SELECT before DELETE to verify what will be removed.
Use transactions: Wrap deletions in BEGIN/COMMIT so you can ROLLBACK if needed.
Check foreign keys: Deleting a row might break references in other tables.
Log deleted rows: Insert them into an archive table before deleting.

Conclusion

Duplicate removal is a multi-step process:

Identify duplicates with GROUP BY HAVING COUNT(*) > 1
Examine them to decide which to keep
Mark duplicates using ROW_NUMBER() OVER (PARTITION BY...)
Delete rows where row_num > 1
Prevent future duplicates with UNIQUE constraints

Master these patterns and you'll handle any data deduplication challenge.

Related Articles

Data Cleaning with SQL: From Messy to Masterpiece

SQL Window Frames: ROWS vs RANGE

Mastering Slowly Changing Dimensions (SCD Type 2) in SQL

Comments

Related Articles

Data Cleaning with SQL: From Messy to Masterpiece

SQL Window Frames: ROWS vs RANGE

Mastering Slowly Changing Dimensions (SCD Type 2) in SQL

Comments