As developers and analysts, we often need real data to test features or run reports. But using production data with real names, emails, and phone numbers (PII) is a massive security risk and a violation of privacy laws like GDPR and CCPA.
In this guide, we'll cover SQL techniques to "Mask" or "Sanitize" data so you can use it safely.
1. Partial Redaction (Masking)
Useful for customer support UI—showing enough to verify (e.g., "ends in 4432") without revealing the whole value.
-- Convert 'john.doe@email.com' -> 'j*******@email.com'
SELECT
SUBSTR(email, 1, 1) || '*******' || SUBSTR(email, INSTR(email, '@'))
FROM users;
2. Hashing (Pseudo-Anonymization)
If you need to join tables (e.g., orders to users) without exposing the user ID, you can use a cryptographic hash.
-- SHA2 returns a hex string. Same input always gives same output.
SELECT SHA2(email, 256) as user_hash FROM users;
Note: This is "pseudonymized", not fully anonymized, as rainbow tables can crack it. For better security, add a "salt" (random string) before hashing.
3. Data Generalization
Instead of showing exact values, show ranges.
- Age 24 -> "20-30"
- Zip 90210 -> "90xxx"
Interactive Playground
Here is a production_users table full of sensitive PII. Let's create a "Safe View" for the analytics team.
Conclusion
Data privacy is not optional. By building these masking transformations into your ETL pipelines or database views, you ensure that analysts and developers can do their jobs without becoming a liability.