The customer-managed VPC feature is in Public Preview.
A workspace is a Databricks deployment in a cloud service account. It provides a unified environment for working with Databricks assets for a specified set of users. This article describes how to create and manage workspaces.
Databricks charges for Databricks usage in Databricks Units (DBUs). The number of DBUs a workload consumes varies based on a number of factors, including Databricks compute type (all-purpose or jobs) and Google Cloud machine type. For details, see the pricing page. If you have questions about pricing, contact your Databricks representative.
Additional costs are incurred in your Google Cloud account:
Google Cloud charges you an additional per-workspace cost for the GKE cluster that Databricks creates for Databricks infrastructure in your account. As of March 30, 2021, the cost for this GKE cluster is approximately $200/month, prorated to the days in the month that the GKE cluster runs. Prices can change, so check the latest prices.
The GKE cluster cost applies even if Databricks clusters are idle. To reduce this idle-time cost, Databricks deletes the GKE cluster in your account if no Databricks Runtime clusters are active for five days. Other resources, such as VPC and GCS buckets, remain unchanged. The next time a Databricks Runtime cluster starts, Databricks recreates the GKE cluster, which adds to the initial Databricks Runtime cluster launch time. For an example of how GKE cluster deletion reduces monthly costs, let’s say you used a Databricks Runtime cluster on the first of the month but not again for the rest of the month: your GKE usage would be the five days before the idle timeout takes effect and nothing more, costing approximately $33 for the month.
Databricks does not support configuration changes to a running GKE cluster. If you customize a GKE cluster configuration after it is created, and that cluster is deleted due to idle timeout, the recreated cluster will not include your customizations.
You can use the account console to create a workspace. Be sure that you understand all configuration settings before you create a new workspace. You cannot modify a workspace configuration after attempting to create the workspace.
To create a workspace:
Choose a network type for your new workspace:
Databricks-managed VPC (default): Databricks creates and manages the lifecycle of the VPC. If you choose this network type, there are no additional steps to perform now.
Customer-managed VPC (Public Preview): Create and specify your own customer-managed VPC for your new Databricks workspace to use. This feature is provided as a Public Preview. If you choose this network type, perform the following steps now:
As a Databricks account owner or account admin, log in to the account console and click the Workspaces icon. This is the account console default view.
Click Create Workspace.
In the Workspace Name field, enter a human-readable name for this workspace. Only alphanumeric characters, underscores, and hyphens are allowed, and the name must be 3-30 characters long.
In the Region field, select a region for your workspace’s network and clusters. For the list of supported regions, see Supported Databricks regions.
In the Google cloud project ID field, enter your Google Cloud project ID. To learn how to get your project ID, see Requirements.
If you plan to use a customer-managed VPC for this workspace:
Network setup. This step varies based on the workspace’s network type.
For a customer-managed VPC, click the Customer-managed VPC tab.
Optionally specify custom subnet sizes. If you leave these fields blank, Databricks uses defaults.
Configure the GKE subnets used by your Databricks workspace accurately. You cannot change them after your workspace is deployed. If the address ranges for your Databricks subnets are too small, then the workspace exhausts its IP space, which in turn causes your Databricks jobs to fail. To determine the address range sizes that you need, Databricks provides a subnet calculator as a Microsoft Excel spreadsheet.
Click Advanced configurations to specify custom IP ranges in CIDR format. The IP ranges for these fields must not overlap. All IP addresses must be entirely within the following ranges:
The sizes of these IP ranges affect the maximum number of nodes for the workspace.
In the Subnet CIDR field, type the IP range in CIDR format to use for the subnet. Nodes of the GKE cluster come from this IP range. This is also the IP range of the subnet where the GKE cluster lives. Range must be no bigger than
/9and no smaller than
In the Pod address range field, type the IP range in CIDR format to use as the secondary IP range for GKE pods. Range must be no bigger than
/9and no smaller than
In the Service address range field, type the IP range in CIDR format to use as the secondary IP range for GKE services. Range must be no bigger than
/16and no smaller than
Specify a network configuration that represents your VPC and its subnets:
Network Mode: Set this to Customer-managed network.
Network configuration: Select your network configuration’s name.
(Optional) Configure details about private GKE clusters.
By default, Databricks creates a private GKE cluster instead of a public GKE cluster. A private cluster’s GKE nodes have no public IP that is routable in the public internet. This option requires that Databricks create an additional Google Cloud cloud NAT. For a private cluster, you can optionally set a custom value for the IP range for GKE master resources. Click Advanced configurations then set the IP range for GKE master resources field. All IP addresses must be entirely within the following ranges:
240.0.0.0/4. The range must have the size
To instead use a public GKE cluster, click Advanced configurations and deselect Enable private cluster.
If this is the first time that you have created a workspace, a Google popup window asks you to select your Google account. Complete the following instructions.
If you do not see the Google account popup:
If the page does not change, you may have a popup blocker in your web browser. Look for a notification about blocking a popup window. Configure your popup blocker to allow popup windows from domain
If you do not see the Google dialog but your browser now shows a list of workspaces, continue to the next step.
In the Google dialog, select the Google account with which you signed into the account console.
On the next screen, reply to the consent request that asks you for additional scopes. Click Allow.
The consent screen is shown the first time you attempt to create a workspace. For successive new workspaces, Google does not show the consent screen. If you use Google account tools to revoke the consent granted to Databricks, Google displays the consent screen again.
Confirm that your workspace was created successfully. Next to your workspace in the list of workspaces, click Open. To view workspace status and test the workspace, see View workspace status and test your new workspace.
Secure the workspace’s GCS buckets. See Secure the workspace’s GCS buckets in your project.
When you create a workspace, Databricks on Google Cloud creates two Google Cloud Storage (GCS) buckets in your Google Cloud project. Databricks strongly recommends that you secure these GCS buckets so that they are not accessible from outside Databricks on Google Cloud.
During workspace creation, Databricks enables some required Google APIs on the project, if they are not already enabled. See Enabling Google APIs on a workspace’s project.
During workspace creation, Databricks automatically enables the following required Google APIs on the Google Cloud project if they are not already enabled:
These APIs are not disabled automatically during workspace deletion.
After you create a workspace (or update a failed workspace configuration), you can view it on the Workspaces page. To check the workspace creation status:
View the Status column for your new workspace:
Provisioning: In progress. Wait a few minutes and refresh the page.
Running: Successful workspace deployment. Continue to the next step in this procedure.
Failed: Failed deployment.
Banned: Contact your Databricks representative.
Cancelling: In the process of cancellation.
When your new workspace is Running, test your workspace:
From the Actions menu in the Workspace row, select Visit Workspace.
Log in with your account owner or account admin email address and password.
If the status for your new workspace is Failed, click the workspace to view a detailed error message. If you do not understand the error, contact your Databricks representative.
You cannot update the configuration of a failed workspace. You must delete it and try to create a new workspace.
As the user who created the workspace, log into the account console and click the Workspaces icon.
On the row that displays your workspace, click Actions, then Visit Workspace. Alternatively, click the workspace name, then click the link under the URL label.
Log in with your account owner or account admin email address and password. If you configured single-sign on (SSO), click the Single Sign On tab, and then click the large blue Single Sign On button.
When you create a workspace, Databricks on Google Cloud creates two Google Cloud Storage GCS buckets in your GCP project:
One GCS bucket stores system data that is generated as you use various Databricks features such as creating notebooks. This bucket includes notebook revisions, job run details, command results, and Spark logs.
Another GCS bucket store is your workspace’s root storage for the Databricks File System (DBFS). Your DBFS root bucket is not intended for storage of production customer data. Create other data sources and storage for production customer data in additional GCS buckets. You can optionally mount the additional GCS buckets as the Databricks File System (DBFS) mounts. See Google Cloud Storage.
Databricks strongly recommends that you secure these GCS buckets so that they are not accessible from outside Databricks on Google Cloud.
To secure these GCS buckets:
In a browser, go to the GCP Cloud Console.
Select the Google Cloud project that hosts your Databricks workspace.
Go to that project’s Storage Service page.
Look for the buckets for your new workspace. Their names are:
For each bucket:
Click on the bucket to view details.
Click the Permissions tab.
Review all the entries of the Members list and determine if access is expected for each member.
Check the IAM Condition column. Some permissions, such as those named “Databricks service account for workspace”, have IAM Conditions that restrict them to certain buckets. The Google Cloud console UI does not evaluate the condition, so it may show roles that would not actually be able to access the bucket.
Pay special attention to roles without any IAM Condition. Consider adding restrictions on these:
When adding Storage permissions at the project level or above, use IAM Conditions to exclude Databricks buckets or to only allow specific buckets.
Choose the minimal set of permissions needed. For example, if only read access is needed, specify Storage Viewer instead of Storage Admin.
Do not use Basic Roles because they are too broad.
Enable Google Cloud Data Access audit logging. Databricks strongly recommends that you enable Data Access audit logging for the GCS buckets that Databricks creates. This enables faster investigation of any issues that may come up. Be aware that Data Access audit logging can increase GCP usage costs. For instructions, see Configuring Data Access audit logs.
If you have questions about securing these GCS buckets, contact your Databricks representative.
Go to the account console and click the Workspaces icon.
On the row with your workspace, click Actions, then Delete. Alternatively, click the workspace name, click the Configure button, and select Delete Workspace.
In the confirmation dialog, type the workspace name and click Confirm Delete.
Workspace deletion is not reversible.
Review the cleanup steps that might be necessary after you delete a workspace. See Clean up Google Cloud objects after deleting a workspace.
After you delete a workspace, the two GCS buckets that Databricks created may not be deleted automatically if they are not empty. For example, there might be files that you added directly or indirectly, like libraries or other files, in the bucket that contains your workspace’s DBFS root.
After you delete the workspace, you can find and delete remaining objects manually in the Google Cloud Console for your project. Go to the following page and replace <project-id> with your project ID: