In an era where data privacy is paramount, safeguarding sensitive information isn't just a best practice—it's a necessity. Regulatory frameworks like GDPR and HIPAA mandate strict controls over personally identifiable information (PII). This is where data masking and data redaction become essential tools in a database administrator's toolkit.
What is Data Masking?
According to Wikipedia, data masking is the process of modifying sensitive data in a way that it retains its usability for software and authorized users but holds little to no value for unauthorized individuals. The key here is creating fictitious yet realistic data. For example, replacing a real name like 'Amul Sul' with 'Jeevan Chalke' or a real credit card number with a fake but valid-looking one. The original data is permanently changed in the database.
This is a crucial distinction from data redaction. While redaction also limits sensitive data exposure, it doesn't change the data itself. Instead, it dynamically hides or replaces it as it's displayed to certain users. For instance, data redaction can show the last four digits of a Social Security Number while redacting the rest. The original, full SSN remains untouched in the database.
The Power of PostgreSQL Anonymizer (pg_anonymizer)
PostgreSQL Anonymizer is an extension for masking or replacing sensitive data in PostgreSQL, EPAS and PGE. It offers a unique "privacy by design" approach, allowing you to embed masking rules directly into your database schema using PostgreSQL's Data Definition Language (DDL).
This extension provides five different ways to apply anonymization rules:
Anonymous Dumps: Use pg_dump to export masked data into an SQL file, ideal for creating sanitized backups or test environments.
Static Masking: Permanently modify the original data in the database, directly replacing sensitive information with artificial values.
Dynamic Masking: Hide original data for specific "masked" roles, while other roles can still access the real data. This is a form of data redaction with realistic, masked values.
Masking Views: Create dedicated views that present masked data to users, adding another layer of security and data control.
Masking Data Wrappers: Apply masking rules to external data sources using Foreign Data Wrappers.
Main anonymization functions available in the extension:
Destruction: Permanently erases data.
Adding Noise: Adds random variations to data.
Randomization: Generates random values within a specified range.
Faking: Replaces real data with realistic but fake values (e.g., names, addresses).
Advanced Faking: Provides more complex faking capabilities.
Pseudonymization: Replaces identifying data with a reversible pseudonym.
Generic Hashing: Creates a one-way hash of the data.
Partial Scrambling: Scrambles only part of a data string.
Conditional Masking: Applies masks based on a specific condition.
Generalization: Broadens data to a less specific category (e.g., replacing an exact age with an age range).
Using pg_catalog Functions: Leverages built-in Postgres functions for masking.
Image Blurring: Applies blurring to images.
Write your own Masks!: You can create and use your own custom masking functions.
Practical Examples: Dynamic vs. Static Masking
Let's look at how to implement two of the most common masking methods.
Dynamic Masking
This method is great for development or training environments where you want different user roles to see different data.
First, create a masked role and set a security label:
\connect edb edb
CREATE ROLE masked_user LOGIN;
SECURITY LABEL FOR anon ON ROLE masked_user IS 'MASKED';
GRANT pg_read_all_data to masked_user;
Next, enable dynamic masking and apply the masking function to a specific column. Here, we'll use the anon.fake_first_name() function.
ALTER DATABASE edb SET anon.transparent_dynamic_masking TO true;
SECURITY LABEL FOR anon ON COLUMN test.name IS 'MASKED WITH FUNCTION anon.fake_first_name()';
When a regular user queries the table, they see the original data:
\connect edb edb
SELECT * FROM test;
name | phone
------+-------
ABCD | 12345
EFGH | 67890
However, a user with the masked_user role sees the data transformed:
\connect - masked_user
SELECT * FROM test;
name | phone
---------+-------
Hayley | 12345
Olivia | 67890
Static Masking
Static masking is for when you need to permanently anonymize data. For example, before handing over a production database dump to a development team.
With the edb role, the original data is visible:
\connect - edb
SELECT * FROM test;
name | phone
------+-------
ABCD | 12345
EFGH | 67890
To permanently anonymize the data, simply run the anonymize_table function:
SELECT anon.anonymize_table('test');
anonymize_table
-----------------
t
The data in the table is now permanently changed.
SELECT * FROM test;
name | phone
--------+-------
Phillip| 12345
Mariah | 67890
The Shuffle Concept: Mixing Data Vertically
A unique function in the pg_anonymizer extension is shuffle_column(). This is not a traditional masking function because it operates vertically, mixing values within the same column. It cannot be used as Dynamic Masking.
For example, if you have a table with names and IDs, shuffling the ID column will reassign the IDs to different names, making it difficult to link a specific name to its original ID.
SELECT * FROM test;
name | alpha | id
------+-------+----
A | a | 1
B | b | 2
C | c | 3
D | d | 4
After shuffling the id column with SELECT anon.shuffle_column('test', 'id', 'id');:
SELECT * FROM test;
name | alpha | id
------+-------+----
C | c | 3
A | a | 2
B | b | 4
D | d | 1
Use Cases for Data Masking
Data masking is more than a security measure; it's a strategic tool for various business and development needs:
Software Development: Developers can work with realistic data without risking a data leak from a production copy.
Software Testing: Testers can run rigorous tests on data that mimics a production environment without compromising privacy.
User Training & Sales Demos: Create realistic scenarios for training or sales presentations using anonymized data.
Compliance & Governance: Meet strict regulatory requirements by ensuring sensitive data is protected in non-production environments.
Conclusion
Data masking is an essential practice for protecting sensitive information in modern database management. By using the declarative approach of the pg_anonymizer extension, organizations can build privacy directly into their database design, ensuring compliance and security across their development, testing, and production environments. And for, full-fledged data redaction feature you can refer DBMS_REDACT in EPAS for more information.
References
- Detailed information about pg_anonymizer extension: PostgreSQL_Anonymizer
- EDB Blogs on Data Masking & Redaction: Here, here, and here
- pg_anonymizer installation steps in EPAS & PGE: Installing pg_anonymizer