Databricks on Google Cloud features
Some features available in Databricks on other clouds are not available in this release of Databricks on Google Cloud. This article lists available features and features unsupported as of the current release. For detailed date-based release notes, see Databricks platform release notes.
Features in this release
The following table lists the major features included in Databricks Runtime on Google Cloud.
Feature |
Description and links |
---|---|
Databricks Runtime |
Databricks Runtime 7.3 LTS and above. Databricks Runtime 8.0 for Machine Learning and above. See Databricks Runtime release notes versions and compatibility. |
Apache Spark |
Spark 3 only |
Supported regions |
|
Databricks SQL |
Databricks SQL provides SQL analysts with an intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake. |
Unity Catalog |
Unity Catalog provides centralized access control, auditing, and data discovery capabilities across Databricks workspaces. Data lineage is not included in this release. |
Delta Sharing |
Delta Sharing is a secure data sharing platform that lets you share data in Databricks with users outside your organization. |
Optimized Delta Lake |
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns. See What is Delta Lake?. |
Cluster Autopilot |
Cluster autoscaling options. See Create a cluster. |
Cluster policies |
Cluster policies are admin-defined, reusable cluster templates that enforce rules on cluster attributes and thus ensure that users create clusters that conform to those rules. As a Databricks admin, you can now create cluster policies and give users policy permissions. By doing that, you have more control over the resources created, give users the level of flexibility they need to do their work, and considerably simplify the cluster creation experience. See Create and manage compute policies. |
Delta Live Tables (Public Preview) |
Delta Live Tables is a framework for building reliable, maintainable, and scalable data processing pipelines. See What is Delta Live Tables?. |
High-performance clusters |
Support for high-concurrency clusters, high-memory instance types (N2 family), and local SSDs on some instance types. See Create a cluster. |
Notebooks and collaboration |
A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. See Introduction to Databricks notebooks. |
Job |
A job is a way to run non-interactive code in a Databricks cluster. Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. See Create and run Databricks Jobs. |
Optimized autoscaling |
Automatically add and remove worker nodes in response to changing workloads to optimize resource usage. See Enable autoscaling. |
Admin console |
Workspace admin tasks. See Databricks administration introduction. |
Single node clusters |
A single node cluster is a cluster consists of a Spark driver and no Spark workers. Single node clusters support Spark jobs and all Spark data sources, including Delta Lake. Single node clusters are helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis. |
Single sign-on (SSO) |
Databricks workspace users authenticate with their Google Cloud Identity account (or GSuite account) using Google’s OAuth 2.0 implementation, which conforms to the OpenID Connect spec and is OpenID certified. Databricks provides the openid profile scope values in the authentication request to Google. Optionally, customers can configure their Google Cloud Identity account (or GSuite account) to federate with an external SAML 2.0 Identity Provider (IdP) to verify user credentials. Google Cloud Identity can federate with Microsoft Entra ID (formerly Azure Active Directory), Okta, Ping, and other IdPs. However, Databricks only directly interacts with the Google Identity Platform APIs. See Identity best practices. |
Role-based access control |
Use access control lists (ACLs) to configure permission to access workspace objects (folders, notebooks, experiments, and models), clusters, pools, tables, and jobs. See Access control overview. |
Token management |
Create a personal access token that you can use to authenticate a REST API request. Workspace admins can also monitor tokens, control which non-admin users can create tokens, and set maximum lifetimes for new tokens. See Manage personal access tokens. |
Google Kubernetes Engine (GKE) compute plane |
There is a Google Cloud VPC + subnet in the customer account that contains the worker network environment for the workspace. All Databricks Runtime clusters in a workspace are launched inside a private, regional Google Kubernetes Engine (GKE) cluster. GKE is a managed Kubernetes service. See the Google documentation for GKE. |
Integration with Google Cloud Identity |
Databricks workspace users authenticate with their Google Cloud Identity account (or GSuite account) using Google’s OAuth 2.0 implementation, which conforms to the OpenID Connect spec and is OpenID certified. Databricks provides the openid profile scope values in the authentication request to Google. Optionally, customers can configure their Google Cloud Identity account (or GSuite account) to federate with an external SAML 2.0 Identity Provider (IdP) to verify user credentials. Google Cloud Identity can federate with Microsoft Entra ID, Okta, Ping, and other IdPs. However, Databricks only directly interacts with the Google Identity Platform APIs. See Identity best practices. |
BigQuery connector |
You can read from and write to Google BigQuery tables in Databricks. See Google BigQuery. |
Google Cloud Storage connector (DBFS and direct) |
Read from and write to Google Cloud Storage (GCS) buckets in Databricks using either the Databricks File System (DBFS) or direct connections with |
MLflow |
MLflow is an open source platform for managing the end-to-end machine learning lifecycle. Managed MLflow on Databricks offers an integrated experience for tracking and securing machine learning model training runs and running machine learning projects. Support for managed MLflow was added on March 22, 2021 and requires Databricks Runtime 8.1 and above. Support for Model Serving was added on January 10, 2022. |
Repos for Git integration |
Sync your work in Databricks with a remote Git repository. This makes it easier to implement development best practices. Databricks supports integrations with GitHub, Bitbucket, and GitLab. See Git integration with Databricks Repos. |
Databricks Connect |
Connect your favorite IDE (IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio), notebook server (such as Zeppelin), and other custom applications to Databricks clusters. See Databricks Connect. |
Integrate with BI tools |
Integration with Power BI, Tableau, TIBCO, Looker, and SQL Workbench. See Technology partners. |
Support for GPU-enabled clusters |
Use GPU-enabled clusters. See GPU-enabled clusters. |
Customer-managed VPC |
Deploy a Databricks workspace into a VPC that you create and manage. See Customer-managed VPC. |
Databricks CLI (Experimental) |
The Databricks command-line interface provides convenient access to many Databricks APIs from the command line. The CLI is experimental. Some commands and options have not been tested on Databricks on Google Cloud. |
Notable features that are not included in this release
Accounts:
Billable usage reports do not support delivery to a GCS bucket but you can call a REST API to download them.
Notebooks:
Hosted Jupyter notebooks. You can, however, export a Databricks notebook to Jupyter
Clusters:
Cluster log delivery
Different pools for drivers and worker nodes
Integrations:
R Studio Server
Databricks REST API endpoints:
POST /api/2.0/instance-profiles/add
POST /api/2.0/instance-profiles/edit
GET /api/2.0/instance-profiles/list
POST /api/2.0/instance-profiles/remove
GET /api/2.0/policy-families
GET /api/2.0/policy-families/{policy_family_id}
POST /api/2.0/sql/statements
GET /api/2.0/sql/statements/{statement_id}
POST /api/2.0/sql/statements/{statement_id}/cancel
GET /api/2.0/sql/statements/{statement_id}/result/chunks/{chunk_index}
POST /api/2.0/token-management/on-behalf-of/tokens
GET /api/2.0/accounts/{account_id}/credentials
POST /api/2.0/accounts/{account_id}/credentials
GET /api/2.0/accounts/{account_id}/credentials/{credentials_id}
DELETE /api/2.0/accounts/{account_id}/credentials/{credentials_id}
GET /api/2.0/accounts/{account_id}/log-delivery
POST /api/2.0/accounts/{account_id}/log-delivery
GET /api/2.0/accounts/{account_id}/log-delivery/{log_delivery_configuration_id}
PATCH /api/2.0/accounts/{account_id}/log-delivery/{log_delivery_configuration_id}
GET /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations
POST /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations
GET /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations/{integration_id}
PATCH /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations/{integration_id}
DELETE /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations/{integration_id}
GET /api/2.0/accounts/{account_id}/oauth2/enrollment
POST /api/2.0/accounts/{account_id}/oauth2/enrollment
GET /api/2.0/accounts/{account_id}/oauth2/published-app-integrations
POST /api/2.0/accounts/{account_id}/oauth2/published-app-integrations
GET /api/2.0/accounts/{account_id}/oauth2/published-app-integrations/{integration_id}
PATCH /api/2.0/accounts/{account_id}/oauth2/published-app-integrations/{integration_id}
DELETE /api/2.0/accounts/{account_id}/oauth2/published-app-integrations/{integration_id}
GET /api/2.0/accounts/{account_id}/servicePrincipals/{service_principal_id}/credentials/secrets
POST /api/2.0/accounts/{account_id}/servicePrincipals/{service_principal_id}/credentials/secrets
DELETE /api/2.0/accounts/{account_id}/servicePrincipals/{service_principal_id}/credentials/secrets/{secret_id}
Known issues
Clusters with instance types that you have not yet used may start slowly. This is more likely to happen on a workspace that has just been provisioned.
For Workload Identity, Databricks supports only Service Accounts from the same project that was used to deploy the Databricks workspace.
At the Google Cloud organization level, if you use Google Organizational Policies to restrict identities by domain, inform your Databricks account team before you provision your Databricks workspace.
Databricks supports a maximum of 256 running clusters per workspace.
Your GCP Cluster Event log page may include the message “Attempting to resize cluster to its target of ‘<number>’ workers”. This is expected behavior. Your cluster is marked “running” after 50% of the requested number of workers is available. More workers continue to be added until the cluster reaches the requested number. Temporarily having fewer workers than the target number typically does not block notebook or Apache Spark command runs.
If you delete a workspace, the two GCS buckets that Databricks created may not be deleted automatically if they are not empty. After workspace deletion, you can delete those objects manually in the Google Cloud Console for your project. Go to the following page, replacing
<project-id>
with your Google Cloud Platform project ID:https://console.cloud.google.com/storage/browser?project=<project-id>
.Maven libraries are supported only on Databricks Runtime 7.3 LTS (no other 7.x versions) and Databricks Runtime releases 8.1 and above.
In isolated cases, a Single Node cluster may fail to start, returning
Unexpected state for cluster
errors. If you experience this issue, contact support.You cannot create a new GPU cluster when you schedule a job from a notebook. You can run a job on an existing GPU cluster only if it was created from the Clusters page.