Databricks on Google Cloud features

Some features available in Databricks on other clouds are not available in this release of Databricks on Google Cloud. This article lists available features and features unsupported as of the current release. For detailed date-based release notes, see Databricks platform release notes.

Features in this release

The following table lists the major features included in Databricks Runtime on Google Cloud.

Feature

Description and links

Databricks Runtime

Databricks Runtime 7.3 LTS and above. Databricks Runtime 8.0 for Machine Learning and above. See Databricks Runtime release notes versions and compatibility.

Apache Spark

Spark 3 only

Supported regions

See Databricks clouds and regions.

Databricks SQL

Databricks SQL provides SQL analysts with an intuitive environment for running ad-hoc queries and creating dashboards on data stored in your data lake.

Unity Catalog

Unity Catalog provides centralized access control, auditing, and data discovery capabilities across Databricks workspaces. Data lineage is not included in this release.

Delta Sharing

Delta Sharing is a secure data sharing platform that lets you share data in Databricks with users outside your organization.

Optimized Delta Lake

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns. See What is Delta Lake?.

Cluster Autopilot

Cluster autoscaling options. See Compute configuration reference.

Cluster policies

Cluster policies are admin-defined, reusable cluster templates that enforce rules on cluster attributes and thus ensure that users create clusters that conform to those rules. As a Databricks admin, you can now create cluster policies and give users policy permissions. By doing that, you have more control over the resources created, give users the level of flexibility they need to do their work, and considerably simplify the cluster creation experience. See Create and manage compute policies.

Delta Live Tables (Public Preview)

Delta Live Tables is a framework for building reliable, maintainable, and scalable data processing pipelines. See What is Delta Live Tables?.

High-performance clusters

Support for high-concurrency clusters, high-memory instance types (N2 family), and local SSDs on some instance types. See Compute configuration reference.

Notebooks and collaboration

A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. See Introduction to Databricks notebooks.

Job

A job is a way to run non-interactive code in a Databricks cluster. Your job can consist of a single task or can be a large, multi-task workflow with complex dependencies. See Create and run Databricks Jobs.

Optimized autoscaling

Automatically add and remove worker nodes in response to changing workloads to optimize resource usage. See Enable autoscaling.

Admin console

Workspace admin tasks. See Databricks administration introduction.

Single node clusters

A single node cluster is a cluster consists of a Spark driver and no Spark workers. Single node clusters support Spark jobs and all Spark data sources, including Delta Lake. Single node clusters are helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis.

Single sign-on (SSO)

Databricks workspace users authenticate with their Google Cloud Identity account (or GSuite account) using Google’s OAuth 2.0 implementation, which conforms to the OpenID Connect spec and is OpenID certified. Databricks provides the openid profile scope values in the authentication request to Google. Optionally, customers can configure their Google Cloud Identity account (or GSuite account) to federate with an external SAML 2.0 Identity Provider (IdP) to verify user credentials. Google Cloud Identity can federate with Microsoft Entra ID (formerly Azure Active Directory), Okta, Ping, and other IdPs. However, Databricks only directly interacts with the Google Identity Platform APIs. See Identity best practices.

Role-based access control

Use access control lists (ACLs) to configure permission to access workspace objects (folders, notebooks, experiments, and models), clusters, pools, tables, and jobs. See Access control lists.

Token management

Create a personal access token that you can use to authenticate a REST API request. Workspace admins can also monitor tokens, control which non-admin users can create tokens, and set maximum lifetimes for new tokens. See Monitor and manage personal access tokens.

Google Kubernetes Engine (GKE) compute plane

There is a Google Cloud VPC + subnet in the customer account that contains the worker network environment for the workspace. All Databricks Runtime clusters in a workspace are launched inside a private, regional Google Kubernetes Engine (GKE) cluster. GKE is a managed Kubernetes service. See the Google documentation for GKE.

Integration with Google Cloud Identity

Databricks workspace users authenticate with their Google Cloud Identity account (or GSuite account) using Google’s OAuth 2.0 implementation, which conforms to the OpenID Connect spec and is OpenID certified. Databricks provides the openid profile scope values in the authentication request to Google. Optionally, customers can configure their Google Cloud Identity account (or GSuite account) to federate with an external SAML 2.0 Identity Provider (IdP) to verify user credentials. Google Cloud Identity can federate with Microsoft Entra ID (formerly Azure Active Directory), Okta, Ping, and other IdPs. However, Databricks only directly interacts with the Google Identity Platform APIs. See Identity best practices.

BigQuery connector

You can read from and write to Google BigQuery tables in Databricks. See Google BigQuery.

Google Cloud Storage connector (DBFS and direct)

Read from and write to Google Cloud Storage (GCS) buckets in Databricks using either the Databricks File System (DBFS) or direct connections with gs: URLs. See Connect to Google Cloud Storage. You can use GCS bucket mounts with local file system APIs and shell commands.

MLflow

MLflow is an open source platform for managing the end-to-end machine learning lifecycle. Managed MLflow on Databricks offers an integrated experience for tracking and securing machine learning model training runs and running machine learning projects. Support for managed MLflow was added on March 22, 2021 and requires Databricks Runtime 8.1 and above. Support for Model Serving was added on January 10, 2022.

Repos for Git integration

Sync your work in Databricks with a remote Git repository. This makes it easier to implement development best practices. Databricks supports integrations with GitHub, Bitbucket, and GitLab. See Git integration with Databricks Git folders.

Databricks Connect

Connect your favorite IDE (IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio), notebook server (such as Zeppelin), and other custom applications to Databricks clusters. See Databricks Connect.

Integrate with BI tools

Integration with Power BI, Tableau, TIBCO, Looker, and SQL Workbench. See Technology partners.

Support for GPU-enabled clusters

Use GPU-enabled clusters. See GPU-enabled compute.

Customer-managed VPC

Deploy a Databricks workspace into a VPC that you create and manage. See Configure a customer-managed VPC.

Databricks CLI (Experimental)

The Databricks command-line interface provides convenient access to many Databricks APIs from the command line. The CLI is experimental. Some commands and options have not been tested on Databricks on Google Cloud.

Notable features that are not included in this release

Accounts:

Notebooks:

Clusters:

  • Cluster log delivery

  • Different pools for drivers and worker nodes

Integrations:

  • R Studio Server

Databricks REST API endpoints:

  • POST /api/2.0/instance-profiles/add

  • POST /api/2.0/instance-profiles/edit

  • GET /api/2.0/instance-profiles/list

  • POST /api/2.0/instance-profiles/remove

  • GET /api/2.0/policy-families

  • GET /api/2.0/policy-families/{policy_family_id}

  • POST /api/2.0/sql/statements

  • GET /api/2.0/sql/statements/{statement_id}

  • POST /api/2.0/sql/statements/{statement_id}/cancel

  • GET /api/2.0/sql/statements/{statement_id}/result/chunks/{chunk_index}

  • POST /api/2.0/token-management/on-behalf-of/tokens

  • GET /api/2.0/accounts/{account_id}/credentials

  • POST /api/2.0/accounts/{account_id}/credentials

  • GET /api/2.0/accounts/{account_id}/credentials/{credentials_id}

  • DELETE /api/2.0/accounts/{account_id}/credentials/{credentials_id}

  • GET /api/2.0/accounts/{account_id}/log-delivery

  • POST /api/2.0/accounts/{account_id}/log-delivery

  • GET /api/2.0/accounts/{account_id}/log-delivery/{log_delivery_configuration_id}

  • PATCH /api/2.0/accounts/{account_id}/log-delivery/{log_delivery_configuration_id}

  • GET /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations

  • POST /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations

  • GET /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations/{integration_id}

  • PATCH /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations/{integration_id}

  • DELETE /api/2.0/accounts/{account_id}/oauth2/custom-app-integrations/{integration_id}

  • GET /api/2.0/accounts/{account_id}/oauth2/enrollment

  • POST /api/2.0/accounts/{account_id}/oauth2/enrollment

  • GET /api/2.0/accounts/{account_id}/oauth2/published-app-integrations

  • POST /api/2.0/accounts/{account_id}/oauth2/published-app-integrations

  • GET /api/2.0/accounts/{account_id}/oauth2/published-app-integrations/{integration_id}

  • PATCH /api/2.0/accounts/{account_id}/oauth2/published-app-integrations/{integration_id}

  • DELETE /api/2.0/accounts/{account_id}/oauth2/published-app-integrations/{integration_id}

  • GET /api/2.0/accounts/{account_id}/servicePrincipals/{service_principal_id}/credentials/secrets

  • POST /api/2.0/accounts/{account_id}/servicePrincipals/{service_principal_id}/credentials/secrets

  • DELETE /api/2.0/accounts/{account_id}/servicePrincipals/{service_principal_id}/credentials/secrets/{secret_id}

Known issues

  • Clusters with instance types that you have not yet used may start slowly. This is more likely to happen on a workspace that has just been provisioned.

  • For Workload Identity, Databricks supports only Service Accounts from the same project that was used to deploy the Databricks workspace.

  • At the Google Cloud organization level, if you use Google Organizational Policies to restrict identities by domain, inform your Databricks account team before you provision your Databricks workspace.

  • Databricks supports a maximum of 256 running clusters per workspace.

  • Your GCP Cluster Event log page may include the message “Attempting to resize cluster to its target of ‘<number>’ workers”. This is expected behavior. Your cluster is marked “running” after 50% of the requested number of workers is available. More workers continue to be added until the cluster reaches the requested number. Temporarily having fewer workers than the target number typically does not block notebook or Apache Spark command runs.

  • If you delete a workspace, the two GCS buckets that Databricks created may not be deleted automatically if they are not empty. After workspace deletion, you can delete those objects manually in the Google Cloud Console for your project. Go to the following page, replacing <project-id> with your Google Cloud Platform project ID: https://console.cloud.google.com/storage/browser?project=<project-id>.

  • Maven libraries are supported only on Databricks Runtime 7.3 LTS (no other 7.x versions) and Databricks Runtime releases 8.1 and above.

  • In isolated cases, a Single Node cluster may fail to start, returning Unexpected state for cluster errors. If you experience this issue, contact support.

  • You cannot create a new GPU cluster when you schedule a job from a notebook. You can run a job on an existing GPU cluster only if it was created from the Clusters page.