Databricks on Google Cloud features

Some features available in Databricks on other clouds are not available in this release of Databricks on Google Cloud. This article lists available features and features unsupported as of the current release. For detailed date-based release notes, see Databricks platform release notes.

Features in this release

The following table lists the major features included in Databricks Runtime on Google Cloud.

Feature Description and links
Databricks Runtime Databricks Runtime 7.3 LTS and above. Databricks Runtime 8.0 for Machine Learning and above. See Databricks runtime releases.
Apache Spark Spark 3 only
Regions See Supported Databricks regions.
Optimized Delta Lake Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns. See Delta Lake and Delta Engine guide.
Cluster Autopilot Cluster autoscaling options. See Configure clusters.
Cluster policies Cluster policies are admin-defined, reusable cluster templates that enforce rules on cluster attributes and thus ensure that users create clusters that conform to those rules. As a Databricks admin, you can now create cluster policies and give users policy permissions. By doing that, you have more control over the resources created, give users the level of flexibility they need to do their work, and considerably simplify the cluster creation experience. See Manage cluster policies.
Jobs scheduling and workflow A job is a way of running a notebook or JAR either immediately or on a scheduled basis. The other way to run a notebook is interactively in the notebook UI. See Jobs.
Jobs with multiple tasks (Public Preview) You can use Databricks jobs to orchestrate multiple tasks in a data processing workflow. See Jobs with multiple tasks.
Delta Live Tables (Public Preview) Delta Live Tables is a framework for building reliable, maintainable, and scalable data processing pipelines. See Delta Live Tables.
High-performance clusters Support for high-concurrency clusters, high-memory instance types (N2 family), and local SSDs on some instance types. See Configure clusters.
Notebooks and collaboration A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. See Notebooks.
Optimized autoscaling Autoscaling automatically adds and removes worker nodes in response to changing workloads to optimize resource usage. See Cluster instance types with local SSDs.
Admin console Workspace admin tasks. See Administration guide.
Single Node clusters A single node cluster is a cluster consists of a Spark driver and no Spark workers. Single node clusters support Spark jobs and all Spark data sources, including Delta Lake. Single node clusters are helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis.
Single sign-on (SSO) Databricks workspace users authenticate with their Google Cloud Identity account (or GSuite account) using Google’s OAuth 2.0 implementation, which conforms to the OpenID Connect spec and is OpenID certified. Databricks provides the openid profile scope values in the authentication request to Google. Optionally, customers can configure their Google Cloud Identity account (or GSuite account) to federate with an external SAML 2.0 Identity Provider (IdP) to verify user credentials. Google Cloud Identity can federate with Azure Active Directory, Okta, Ping, and other IdPs. However, Databricks only directly interacts with the Google Identity Platform APIs. See Single sign-on.
Role-based access control In Databricks, you can use access control lists (ACLs) to configure permission to access workspace objects (folders, notebooks, experiments, and models), clusters, pools, tables, and jobs. See Access control.
Token management To authenticate to the Databricks REST API, a user can create a personal access token and use it in their REST API request. Workspace administrators can also monitor tokens, control which non-admin users can create tokens, and set maximum lifetimes for new tokens. See Manage personal access tokens.
Google Kubernetes Engine (GKE) data plane There is a Google Cloud VPC + subnet in the customer account that contains the worker network environment for the workspace. All Databricks Runtime clusters in a workspace are launched inside a private, regional Google Kubernetes Engine (GKE) cluster. GKE is a managed Kubernetes service. See the Google documentation for GKE.
Integration with Google Cloud Identity Databricks workspace users authenticate with their Google Cloud Identity account (or GSuite account) using Google’s OAuth 2.0 implementation, which conforms to the OpenID Connect spec and is OpenID certified. Databricks provides the openid profile scope values in the authentication request to Google. Optionally, customers can configure their Google Cloud Identity account (or GSuite account) to federate with an external SAML 2.0 Identity Provider (IdP) to verify user credentials. Google Cloud Identity can federate with Azure Active Directory, Okta, Ping, and other IdPs. However, Databricks only directly interacts with the Google Identity Platform APIs. See Single sign-on.
BigQuery connector You can read from and write to Google BigQuery tables in Databricks. See Google BigQuery.
Google Cloud Storage connector (DBFS and direct) You can read from and write to Google Cloud Storage (GCS) buckets in Databricks using either the Databricks File System (DBFS) or direct connections with gs: URLs. See Google Cloud Storage. You can use GCS bucket mounts with local file system APIs and shell commands.
MLflow MLflow is an open source platform for managing the end-to-end machine learning lifecycle. Managed MLflow on Databricks offers an integrated experience for tracking and securing machine learning model training runs and running machine learning projects. Support for managed MLflow was added on March 22, 2021 and requires Databricks Runtime 8.1 and above. Model serving is not supported in this release.
Repos for Git integration (Public Preview) You can sync your work in Databricks with a remote Git repository. This makes it easier to implement development best practices. Databricks supports integrations with GitHub, Bitbucket, and GitLab. See Repos for Git integration.
Databricks Connect Connect your favorite IDE (IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio), notebook server (such as Zeppelin), and other custom applications to Databricks clusters. See the Databricks Connect.
Integrate with BI tools Integration with Power BI, Tableau, TIBCO, Looker, and SQL Workbench. See Databricks integrations guide.
Support for GPU-enabled clusters You can now use GPU-enabled clusters. See GPU-enabled clusters.

Notable features that are not included in this release

General:

  • Certain Delta Lake features
  • Certain managed MLflow features, including model serving
  • Certain partner integrations
  • HIPAA compliance
  • Databricks SQL (preview)
  • Databricks Runtime for Genomics

Accounts:

  • Billable usage delivery to GCS bucket
  • API workspace management

Workspaces:

  • Customer-managed VPC
  • Customer-managed keys

Notebooks:

Clusters:

  • Storage autoscaling
  • Credential passthrough
  • Container Services (bring your own container)
  • Cluster log delivery
  • Using different pools for drivers and worker nodes

Integrations:

  • CLI support
  • R Studio Server

Notable APIs that are not in this release:

  • Account API
  • DBFS API

Third-party data sources that do not support Apache Spark 3.0:

  • Couchbase
  • Neo4j
  • Redis
  • Riak Time Series

Known issues

  • Clusters with instance types that you have not yet used may start slowly. This is more likely to happen on a workspace that has just been provisioned.

  • For Workload Identity, Databricks supports only Service Accounts from the same project that was used to deploy the Databricks workspace.

  • At the Google Cloud organization level, if you use Google Organizational Policies to restrict identities by domain, inform your Databricks account team before you provision your Databricks workspace.

  • Databricks supports a maximum of 256 running clusters per workspace.

  • Your GCP Cluster Event log page may include the message “Attempting to resize cluster to its target of ‘<number>’ workers”. This is expected behavior. Your cluster is marked “running” after 50% of the requested number of workers is available. More workers continue to be added until the cluster reaches the requested number. Temporarily having fewer workers than the target number typically does not block notebook or Apache Spark command runs.

  • If you delete a workspace, the two GCS buckets that Databricks created may not be deleted automatically if they are not empty. After workspace deletion, you can delete those objects manually in the Google Cloud Console for your project. Go to the following page, replacing <project-id> with your Google Cloud Platform project ID: https://console.cloud.google.com/storage/browser?project=<project-id>.

  • Maven libraries are supported only on Databricks Runtime 7.3 LTS (no other 7.x versions) and Databricks Runtime releases 8.1 and above.

  • In isolated cases, a Single Node cluster may fail to start, returning Unexpected state for cluster errors. If you experience this issue, contact support.

  • You cannot create a new GPU cluster when you schedule a job from a notebook. You can run a job on an existing GPU cluster that was created from the Clusters page.