This article describes how you can use Delta Lake on Databricks to manage General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for your data lake. Compliance often requires point deletes, or deleting individual records within a large collection of data. Delta Lake speeds up point deletes in large data lakes with ACID transactions, allowing you to locate and remove personally idenfiable information (PII) in response to consumer GDPR or CCPA requests.
Modeling your data for compliance is an important step in dealing with PII. There are numerous viable approaches depending on the needs of your data consumers.
One frequently applied approach is pseudonymization, or reversible tokenization of personal information elements (identifiers) to keys (pseudonyms) that cannot be externally identified. Compliance through pseudonymization requires careful planning, including the following:
Storage of information in a manner linked to pseudonyms rather than identifiers.
Maintenance of strict polices for the access and usage of data that combine the identifiers and pseudonyms.
Pipelines or storage policies to remove raw data.
Logic to locate and delete the linkage between the pseudonyms and identifiers.
Delta Lake has many data skipping optimizations built in. To accelerate point deletes, Databricks recommends using Z-order on fields that you use during
By default, Delta Lake retains table history for 30 days and makes it available for “time travel” and rollbacks. You can use the VACUUM function to remove files that are no longer referenced by a Delta table and are older than a specified retention threshold, permanently deleting the data.