Duplicate rows sneak into databases through data imports, buggy applications, or missing constraints. Cleaning them up is a critical skill for any data professional.
This guide covers multiple approaches, from identifying duplicates to safely deleting them.
Step 1: Finding Duplicates
Before deleting anything, let's identify what we're dealing with.
Count Duplicates with GROUP BY
Now we know alice@example.com appears twice and bob@example.com appears three times.
Step 2: See All Duplicate Rows
Use a subquery to fetch the full details:
Strategy 1: Keep the First (or Last) Record
The most common approach uses ROW_NUMBER() to mark which duplicate to keep.
Using ROW_NUMBER to Identify Duplicates
row_num = 1: The first record (keep this one)row_num > 1: Duplicates (delete these)
Delete Using CTE
Now we can delete the duplicates:
-- SQLite syntax for deleting duplicates
DELETE FROM users
WHERE id IN (
SELECT id FROM (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at
) as row_num
FROM users
)
WHERE row_num > 1
);
Tip: Change ORDER BY created_at to ORDER BY created_at DESC to keep the latest record instead.
Strategy 2: Keep Record with Most Data
Sometimes you want to keep the most "complete" record:
-- Keep the record with the longest name (most data)
SELECT id FROM (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY LENGTH(name) DESC
) as row_num
FROM users
)
WHERE row_num > 1;
Strategy 3: SELECT DISTINCT INTO New Table
For heavily duplicated tables, it's often easier to create a clean copy:
-- Create new table with unique records only
CREATE TABLE users_clean AS
SELECT DISTINCT email, name
FROM users;
-- Or keep the first of each duplicate
CREATE TABLE users_clean AS
SELECT * FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at
) as rn
FROM users
)
WHERE rn = 1;
Strategy 4: Aggregating Duplicates
Sometimes duplicates have different values you want to combine:
Preventing Future Duplicates
Add UNIQUE Constraint
-- Prevent duplicate emails
CREATE UNIQUE INDEX idx_unique_email ON users(email);
-- Or during table creation
CREATE TABLE users (
id INTEGER PRIMARY KEY,
email TEXT UNIQUE,
name TEXT
);
Use INSERT OR IGNORE
-- Skip inserts that would create duplicates
INSERT OR IGNORE INTO users (email, name)
VALUES ('alice@example.com', 'Alice');
Quick Reference: Which Strategy?
| Scenario | Best Approach |
|---|---|
| Keep oldest record | ROW_NUMBER() ORDER BY created_at |
| Keep newest record | ROW_NUMBER() ORDER BY created_at DESC |
| Keep most complete | ROW_NUMBER() ORDER BY LENGTH(field) DESC |
| Combine values | GROUP BY with SUM/MAX |
| Heavy duplication | Create new clean table |
| Prevent future dupes | UNIQUE constraint |
Safety Tips
-
Always backup first: Run
SELECTbeforeDELETEto verify what will be removed. -
Use transactions: Wrap deletions in
BEGIN/COMMITso you canROLLBACKif needed. -
Check foreign keys: Deleting a row might break references in other tables.
-
Log deleted rows: Insert them into an archive table before deleting.
Conclusion
Duplicate removal is a multi-step process:
- Identify duplicates with
GROUP BY HAVING COUNT(*) > 1 - Examine them to decide which to keep
- Mark duplicates using
ROW_NUMBER() OVER (PARTITION BY...) - Delete rows where
row_num > 1 - Prevent future duplicates with
UNIQUEconstraints
Master these patterns and you'll handle any data deduplication challenge.