Data governance overview

This article describes the need for data governance and shares best practices and strategies you can use to implement these techniques across your organization.

Why is data governance important?

Data governance is the oversight to ensure that data brings value and supports your business strategy. Data governance encapsulates the policies and practices implemented to securely manage the data assets within an organization. As the amount and complexity of data are growing, more and more organizations are looking at data governance to ensure the core business outcomes:

  • Consistent and high data quality as a foundation for analytics and machine learning.

  • Reduced time to insight.

  • Data democratization, that is enabling everybody in an organization to make data-driven decisions.

  • Support for risk and compliance for industry regulations such as HIPPA, FedRAMP, GDPR, or CCPA.

  • Cost optimization, for example by preventing users to start up large clusters and creating guardrails for using expensive GPU instances.

What does a good data governance solution look like?

Data-driven companies typically build their data architectures for analytics on the lakehouse. A data lakehouse is an architecture that enables efficient and secure data engineering, machine learning, data warehousing, and business intelligence directly on vast amounts of data stored in data lakes. Data governance for a data lakehouse provides the following key capabilities:

  • Unified catalog: A unified catalog stores all your data, ML models, and analytics artifacts, in addition to metadata for each data object. The unified catalog also blends in data from other catalogs such as an existing Hive metastore.

  • Unified data access controls: A single and unified permissions model across all data assets and all clouds. This includes attribute bases access control (ABAC) for personally identifiable information (PII).

  • Data auditing: Data access is centrally audited with alerts and monitoring capabilities to promote accountability.

  • Data quality management: Robust data quality management with built-in quality controls, testing, monitoring, and enforcement to ensure accurate and useful data is available for downstream BI, analytics, and machine learning workloads.

  • Data lineage: Data lineage to get end-to-end visibility into how data flows in lakehouse from source to consumption.

  • Data discovery: Easy data discovery to enable data scientists, data analysts, and data engineers to quickly discover and reference relevant data and accelerate time to value.

  • Data sharing: Data can be shared across clouds and platforms.

Data governance and Databricks

Databricks provides a number of features to help you meet your data governance needs.

Manage access to data and objects:

Manage cluster configurations:

  • Cluster policies enable administrators to control access to compute resources.

Audit data access:

  • Audit logs provide visibility into actions and operations across your account and workspaces.

The following sections illustrate how to use these Databricks features to implement a governance solution.

Manage access to data and objects

To manage access to data and objects, you enable access control and implement fine-grained control of individual tables and objects.

You can enable table access control in a workspace to programmatically grant, deny, and revoke access to your data from the Spark SQL API. You can control access to securable objects like databases, tables, views and functions. Consider a scenario where your company has a database to store financial data. You might want your analysts to create financial reports using that data. However, there might be sensitive information in another table in the database that analysts should not access. You can provide the user or group the privileges required to read data from one table, but deny all privileges to access the second table.

In the following illustration, Alice is an admin who owns the shared_data and private_data tables in the Finance database. Alice then provides Oscar, an analyst, with the privileges required to read from shared_data but denies all privileges to private_data.

Grant select

Alice grants SELECT privileges to Oscar to read from shared_data:

Grant select table

Alice denies all privileges to Oscar to access private_data:

Deny statement

You can take this one step further by defining fine-grained access controls to a subset of a table or by setting privileges on derived views of a table.

Deny table

Manage cluster configurations

You can use cluster policies to provision clusters automatically, manage their permissions, and control costs.

Cluster policies allow Databricks administrators to define cluster attributes that are allowed on a cluster, such as instance types, number of nodes, custom tags, and many more. When an admin creates a policy and assigns it to a user or a group, those users can only create clusters based on the policy they have access to. This gives administrators a much higher degree of control on what types of clusters can be created.

You define policies in a JSON policy definition and then create cluster policies using the cluster policies UI or Cluster Policies API 2.0. A user can create a cluster only if they have the create_cluster permission or access to at least one cluster policy. Extending your requirements for the new analytics project team, as described above, administrators can now create a cluster policy and assign it to one or more users within the project team who can now create clusters for the team limited to the rules specified in the cluster policy. The image below provides an example of a user that has access to the Project Team Cluster Policy creating a cluster based on the policy definition.

Cluster policy

Automatically provision clusters and grant permissions

With the addition of endpoints for both clusters and permissions, the Databricks REST API 2.0 makes it easy to both provision and grant permission to cluster resources for users and groups at any scale. You can use the Clusters API 2.0 to create and configure clusters for your specific use case.

You can then use the Permissions API 2.0 to apply access controls to the cluster.

The following is an example of a configuration that might suit a new analytics project team.

The requirements are as follows:

  • Support the interactive workloads of this team, who are mostly SQL and Python users.

  • Provision a data source in object storage with credentials that give the team access to the data tied to the role.

  • Ensure that users get an equal share of the cluster’s resources.

  • Provision larger, memory optimized instance types.

  • Grant permissions to the cluster such that only this new project team has access to it.

  • Tag this cluster to make sure you can properly do chargebacks on any compute costs incurred.

Deployment script

You deploy this configuration by using the API endpoints in the Clusters and Permissions APIs.

Provision cluster

Endpoint - https://<databricks-instance>/api/2.0/clusters/create

Note

Cost control is enabled by using the preemptible executors option.

{
  "autoscale": {
     "min_workers": 2,
     "max_workers": 50
  },
  "cluster_name": "project team interactive cluster",
  "spark_version": "7.5.x-scala2.12",
  "spark_conf": {
     "spark.databricks.cluster.profile": "serverless",
     "spark.databricks.repl.allowedLanguages": "sql,python,r"
  },
  "gcp_attributes": {
      "use_preemptible_executors": true
  },
  "node_type_id": "n1-highmem-4",
  "ssh_public_keys": [],
  "custom_tags": {
     "ResourceClass": "Serverless",
     "team": "new-project-team"
  },
  "spark_env_vars": {
     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  },
  "autotermination_minutes": 60,
  "enable_elastic_disk": "false",
  "init_scripts": []
}

Grant cluster permission

Endpoint - https://<databricks-instance>/api/2.0/permissions/clusters/<cluster_id>

{
  "access_control_list": [
    {
      "group_name": "project team",
      "permission_level": "CAN_MANAGE"
    }
  ]
}

Instantly you have a cluster that has been provisioned with secure access to critical data in the lake, locked down to all but the corresponding team, tagged for chargebacks, and configured to meet the requirements of the project. There are additional configuration steps within your host cloud provider account required to implement this solution, though too, can be automated to meet the requirements of scale.

Audit access

Configuring access control in Databricks and controlling data access in storage is the first step towards an efficient data governance solution. However, a complete solution requires auditing access to data and providing alerting and monitoring capabilities. Databricks provides a comprehensive set of audit events to log activities provided by Databricks users, allowing enterprises to monitor detailed usage patterns on the platform.

Make sure you configure audit logs. This involves configuring the right access policy so that Databricks can deliver audit logs to an Google Cloud Storage bucket that you provide. Audit logs are typically logged within one hour.

Learn more

Here are some resources to help you build a comprehensive data governance solution that meets your organization’s needs:

  • The Databricks Security and Trust Center, which provides information about the ways in which security is built into every layer of the Databricks Lakehouse Platform.

  • Table access control lets you apply data governance controls for your data.

  • Keep data secure with secrets, for information on how to use Databricks secrets to store your credentials and reference them in notebooks and jobs. You should never hard code secrets or store them in plain text.