Create a storage credential for connecting to Google Cloud Storage
This article describes how to create a storage credential in Unity Catalog to connect to Google Cloud Storage.
To manage access to the underlying cloud storage that holds tables and volumes, Unity Catalog uses the following object types:
Storage credentials encapsulate a long-term cloud credential that provides access to cloud storage.
External locations contain a reference to a storage credential and a cloud storage path.
For more information, see Manage access to cloud storage using Unity Catalog.
Unity Catalog supports two cloud storage options for Databricks on Google Cloud: Google Cloud Storage (GCS) buckets and Cloudflare R2 buckets. Cloudflare R2 is intended primarily for Delta Sharing use cases in which you want to avoid data egress fees. GCS is appropriate for most other use cases. This article focuses on creating storage credentials for GCS. For Cloudflare R2, see Create a storage credential for connecting to Cloudflare R2.
To create a storage credential for access to a GCS bucket, you give Unity Catalog the ability to read and write to the bucket by assigning IAM roles on that bucket to a Databricks-generated Google Cloud service account.
Requirements
In Databricks:
Databricks workspace enabled for Unity Catalog.
CREATE STORAGE CREDENTIAL
privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.
In your Google Cloud account:
A GCS bucket in the same region as the workspaces you want to access the data from.
Permission to modify the access policy for that bucket.
Generate a Google Cloud service account using Catalog Explorer
Log in to your Unity Catalog-enabled Databricks workspace as a user who has the
CREATE STORAGE CREDENTIAL
privilege on the metastore.The metastore admin and account admin roles both include this privilege.
In the sidebar, click Catalog.
On the Quick access page, click the External data > button, go to the Credentials tab, and select Create credential.
Select a Credential Type of GCP Service Account.
Enter a Storage credential name and an optional comment.
(Optional) If you want users to have read-only access to the external locations that use this storage credential, select Read only. For more information, see Mark a storage credential as read-only.
Click Create.
Databricks creates the storage credential and generates a Google Cloud service account.
On the Storage credential created dialog, make a note of the service account ID, which is in the form of an email address, and click Done.
(Optional) Bind the storage credential to specific workspaces.
By default, any privileged user can use the storage credential on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See (Optional) Assign a storage credential to specific workspaces.
Configure permissions for the service account
Go to the Google Cloud console and open the GCS bucket that you want to access from your Databricks workspace.
The bucket should be in the same region as your Databricks workspace.
On the Permission tab, click + Grant access and assign the service account the following roles:
Storage Legacy Bucket Reader
Storage Object Admin
Use the service account’s email address as the principal identifier.
Configure permissions for file events
Note
This step is optional but highly recommended. If you do not grant Databricks access to configure file events on your behalf, you must configure file events manually for each location. If you do not, you will have limited access to critical features that Databricks may release in the future.
The steps below allow Databricks to set up a complete notification pipeline to publish event notification messages from your GCS buckets to Google Cloud Pub/Sub. They assume that you have a GCP project with a GCS bucket and have enabled the Pub/Sub API.
Create a custom IAM role for file events.
In the Google Cloud console for the project containing your GCS bucket, navigate to IAM & Admin > Roles.
If you already have a custom IAM role, select it and click Edit Role. Otherwise, create a new role by clicking + Create Role from the Roles page.
On the Create Role or Edit Role screen, add the following permissions to your custom IAM role and save the changes. For detailed instructions, see the GCP documentation.
pubsub.subscriptions.consume pubsub.subscriptions.create pubsub.subscriptions.delete pubsub.subscriptions.get pubsub.subscriptions.list pubsub.subscriptions.update pubsub.topics.attachSubscription pubsub.topics.create pubsub.topics.delete pubsub.topics.get pubsub.topics.list pubsub.topics.update storage.buckets.update
Grant access to the role.
Navigate to IAM & Admin > IAM.
Click Grant Access.
Enter your service account as the principal.
Select your custom IAM role.
Click Save.
Grant permissions to the Cloud Storage Service Agent
Find the service agent account email by following these steps in the Google Cloud documentation.
In the Google Cloud console, navigate to IAM & Admin > IAM > Grant Access.
Enter the service agent account email and assign the Pub/Sub Publisher* role.
You can now create an external location that references this storage credential.
(Optional) Assign a storage credential to specific workspaces
Preview
This feature is in Public Preview.
By default, a storage credential is accessible from all of the workspaces in the metastore. This means that if a user has been granted a privilege (such as CREATE EXTERNAL LOCATION
) on that storage credential, they can exercise that privilege from any workspace attached to the metastore. If you use workspaces to isolate user data access, you may want to allow access to a storage credential only from specific workspaces. This feature is known as workspace binding or storage credential isolation.
A typical use case for binding a storage credential to specific workspaces is the scenario in which a cloud admin configures a storage credential using a production cloud account credential, and you want to ensure that Databricks users use this credential to create external locations only in the production workspace.
For more information about workspace binding, see (Optional) Assign an external location to specific workspaces and Limit catalog access to specific workspaces.
Note
Workspace bindings are referenced when privileges against storage credentials are exercised. For example, if a user creates an external location using a storage credential, the workspace binding on the storage credential is checked only when the external location is created. After the external location is created, it will function independently of the workspace bindings configured on the storage credential.
Bind a storage credential to one or more workspaces
To assign a storage credential to specific workspaces, you can use Catalog Explorer or the Databricks CLI.
Permissions required: Metastore admin or storage credential owner.
Note
Metastore admins can see all storage credentials in a metastore using Catalog Explorer—and storage credential owners can see all storage credentials that they own in a metastore—regardless of whether the storage credential is assigned to the current workspace. Storage credentials that are not assigned to the workspace appear grayed out.
Log in to a workspace that is linked to the metastore.
In the sidebar, click Catalog.
On the Quick access page, click the External data > button and go to the Credentials tab.
Select the storage credential and go to the Workspaces tab.
On the Workspaces tab, clear the All workspaces have access checkbox.
If your storage credential is already bound to one or more workspaces, this checkbox is already cleared.
Click Assign to workspaces and enter or find the workspaces you want to assign.
To revoke access, go to the Workspaces tab, select the workspace, and click Revoke. To allow access from all workspaces, select the All workspaces have access checkbox.
There are two Databricks CLI command groups and two steps required to assign a storage credential to a workspace.
In the following examples, replace <profile-name>
with the name of your Databricks authentication configuration profile. It should include the value of a personal access token, in addition to the workspace instance name and workspace ID of the workspace where you generated the personal access token. See Databricks personal access token authentication.
Use the
storage-credentials
command group’supdate
command to set the storage credential’sisolation mode
toISOLATED
:databricks storage-credentials update <my-storage-credential> \ --isolation-mode ISOLATED \ --profile <profile-name>
The default
isolation-mode
isOPEN
to all workspaces attached to the metastore.Use the
workspace-bindings
command group’supdate-bindings
command to assign the workspaces to the storage credential:databricks workspace-bindings update-bindings storage-credential <my-storage-credential> \ --json '{ "add": [{"workspace_id": <workspace-id>}...], "remove": [{"workspace_id": <workspace-id>}...] }' --profile <profile-name>
Use the
"add"
and"remove"
properties to add or remove workspace bindings.Note
Read-only binding (
BINDING_TYPE_READ_ONLY
) is not available for storage credentials. Therefore there is no reason to setbinding_type
for the storage credentials binding.
To list all workspace assignments for a storage credential, use the workspace-bindings
command group’s get-bindings
command:
databricks workspace-bindings get-bindings storage-credential <my-storage-credential> \
--profile <profile-name>
Unbind a storage credential from a workspace
Instructions for revoking workspace access to a storage credential using Catalog Explorer or the workspace-bindings
CLI command group are included in Bind a storage credential to one or more workspaces.
Next steps
You can view, update, delete, and grant other users permission to use storage credentials. See Manage storage credentials.
You can define external locations using storage credentials. See Create an external location to connect cloud storage to Databricks.