Best practices: Data governance

Note

Credential passthrough is unavailable on Databricks on Google Cloud as of this release.

This document describes the need for data governance and shares best practices and strategies you can use to implement these techniques across your organization. It demonstrates a typical deployment workflow you can employ using Databricks and cloud-native solutions to secure and monitor each layer from the application down to storage.

Why is data governance important?

Data governance is an umbrella term that encapsulates the policies and practices implemented to securely manage the data assets within an organization. As one of the key tenets of any successful data governance practice, data security is likely to be top of mind at any large organization. Key to data security is the ability for data teams to have superior visibility and auditability of user data access patterns across their organization. Implementing an effective data governance solution helps companies protect their data from unauthorized access and ensures that they have rules in place to comply with regulatory requirements.

Governance challenges

Whether you’re managing the data of a startup or a large corporation, security teams and platform owners have the singular challenge of ensuring that this data is secure and is being managed according to the internal controls of the organization. Regulatory bodies the world over are changing the way we think about how data is both captured and stored. These compliance risks only add further complexity to an already tough problem. How then, do you open your data to those who can drive the use cases of the future? Ultimately, you should be adopting data policies and practices that help the business to realize value through the meaningful application of what can often be vast stores of data, stores that are growing all the time. We get solutions to the world’s toughest problems when data teams have access to many and disparate sources of data.

Typical challenges when considering the security and availability of your data in the cloud:

  • Do your current data and analytics tools support access controls on your data in the cloud? Do they provide robust logging of actions taken on the data as it moves through the given tool?
  • Will the security and monitoring solution you put in place now scale as demand on the data in your data lake grows? It can be easy enough to provision and monitor data access for a small number of users. What happens when you want to open up your data lake to hundreds of users? To thousands?
  • Is there anything you can do to be proactive in ensuring that your data access policies are being observed? It is not enough to simply monitor; that is just more data. If data availability is merely a challenge of data security, you should have a solution in place to actively monitor and track access to this information across the organization.
  • What steps can you take to identify gaps in your existing data governance solution?

How Databricks addresses these challenges

  • Access control: Rich suite of access control all the way down to the storage layer. Databricks can take advantage of its cloud backbone by utilizing state-of-the-art Google Cloud security services right in the platform.
  • API first: Automate provisioning and permission management with the Databricks REST API.
  • Cluster policies: Enable administrators to control access to compute resources.

The following sections illustrate how to use these Databricks features to implement a governance solution.

Set up access control

To set up access control, you secure access to storage and implement fine-grained control of individual tables.

Implement table access control

You can enable table access control on Databricks to programmatically grant, deny, and revoke access to your data from the Spark SQL API. You can control access to securable objects like databases, tables, views and functions. Consider a scenario where your company has a database to store financial data. You might want your analysts to create financial reports using that data. However, there might be sensitive information in another table in the database that analysts should not access. You can provide the user or group the privileges required to read data from one table, but deny all privileges to access the second table.

In the following illustration, Alice is an admin who owns the shared_data and private_data tables in the Finance database. Alice then provides Oscar, an analyst, with the privileges required to read from shared_data but denies all privileges to private_data.

Grant select

Alice grants SELECT privileges to Oscar to read from shared_data:

Grant select table

Alice denies all privileges to Oscar to access private_data:

Deny statement

You can take this one step further by defining fine-grained access controls to a subset of a table or by setting privileges on derived views of a table.

Deny table

Manage cluster configurations

Cluster policies allow Databricks administrators to define cluster attributes that are allowed on a cluster, such as instance types, number of nodes, custom tags, and many more. When an admin creates a policy and assigns it to a user or a group, those users can only create clusters based on the policy they have access to. This gives administrators a much higher degree of control on what types of clusters can be created.

You define policies in a JSON policy definition and then create cluster policies using the cluster policies UI or Cluster Policies API. A user can create a cluster only if they have the create_cluster permission or access to at least one cluster policy. Extending your requirements for the new analytics project team, as described above, administrators can now create a cluster policy and assign it to one or more users within the project team who can now create clusters for the team limited to the rules specified in the cluster policy. The image below provides an example of a user that has access to the Project Team Cluster Policy creating a cluster based on the policy definition.

Cluster policy

Automatically provision clusters and grant permissions

With the addition of endpoints for both clusters and permissions, the Databricks REST API 2.0 makes it easy to both provision and grant permission to cluster resources for users and groups at any scale. You can use the Clusters API to create and configure clusters for your specific use case.

You can then use the [/dev-tools/api/latest/permissions.md] to apply access controls to the cluster.

The following is an example of a configuration that might suit a new analytics project team.

The requirements are:

  • Support the interactive workloads of this team, who are mostly SQL and Python users.
  • Provision a data source in object storage with credentials that give the team access to the data tied to the role.
  • Ensure that users get an equal share of the cluster’s resources.
  • Provision larger, memory optimized instance types.
  • Grant permissions to the cluster such that only this new project team has access to it.
  • Tag this cluster to make sure you can properly do chargebacks on any compute costs incurred.

Deployment script

You deploy this configuration by using the API endpoints in the Clusters and Permissions APIs.

Provision cluster

Endpoint - https:///<databricks-instance>/api/2.0/clusters/create

Note

Cost control is enabled by using the preemptible executors option.

{
  "autoscale": {
     "min_workers": 2,
     "max_workers": 50
  },
  "cluster_name": "project team interactive cluster",
  "spark_version": "7.5.x-scala2.12",
  "spark_conf": {
     "spark.databricks.cluster.profile": "serverless",
     "spark.databricks.repl.allowedLanguages": "sql,python,r"
  },
  "gcp_attributes": {
      "use_preemptible_executors": true
  },
  "node_type_id": "n1-highmem-4",
  "ssh_public_keys": [],
  "custom_tags": {
     "ResourceClass": "Serverless",
     "team": "new-project-team"
  },
  "spark_env_vars": {
     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  },
  "autotermination_minutes": 60,
  "enable_elastic_disk": "false",
  "init_scripts": []
}

Grant cluster permission

Endpoint - https://<databricks-instance>/api/2.0/permissions/clusters/<cluster_id>

{
  "access_control_list": [
    {
      "group_name": "project team",
      "permission_level": "CAN_MANAGE"
    }
  ]
}

Instantly you have a cluster that has been provisioned with secure access to critical data in the lake, locked down to all but the corresponding team, tagged for chargebacks, and configured to meet the requirements of the project. There are additional configuration steps within your host cloud provider account required to implement this solution, though too, can be automated to meet the requirements of scale.

Learn more

Here are some resources to help you build a comprehensive data governance solution that meets your organization’s needs: