GDPR and CCPA compliance with Delta Lake

This article describes how you can use Delta Lake on Databricks to manage General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for your data lake. Compliance often requires point deletes, or deleting individual records within a large collection of data. Delta Lake speeds up point deletes in large data lakes with ACID transactions, allowing you to locate and remove personally idenfiable information (PII) in response to consumer GDPR or CCPA requests.

Plan your data model for compliance

Modeling your data for compliance is an important step in dealing with PII. There are numerous viable approaches depending on the needs of your data consumers.

One frequently applied approach is pseudonymization, or reversible tokenization of personal information elements (identifiers) to keys (pseudonyms) that cannot be externally identified. Compliance through pseudonymization requires careful planning, including the following:

Storage of information in a manner linked to pseudonyms rather than identifiers.
Maintenance of strict policies for the access and usage of data that combine the identifiers and pseudonyms.
Pipelines or storage policies to remove raw data.
Logic to locate and delete the linkage between the pseudonyms and identifiers.

How Delta Lake simplifies point deletes

Delta Lake has many data skipping optimizations built in. To accelerate point deletes, Databricks recommends using Z-order on fields that you use during DELETE operations.

Delta Lake retains table history and makes it available for point-in-time queries and rollbacks. The VACUUM function removes data files that are no longer referenced by a Delta table and are older than a specified retention threshold, permanently deleting the data. To learn more about defaults and recommendations, see Work with Delta Lake table history.

Note

For tables with deletion vectors enabled, you must also run REORG TABLE ... APPLY (PURGE) to permanently delete underlying records. See Apply changes to Parquet data files.