May 2021
These features and Databricks platform improvements were released in May 2021.
Note
Releases are staged. Your Databricks account may not be updated until a week or more after the initial release date.
Databricks Mosaic AI: a data-native and collaborative solution for the full ML lifecycle
May 27, 2021
The new Machine Learning persona, selectable in the sidebar of the Databricks UI, gives you easy access to a new purpose-built environment for ML, including the model registry and four new features in Public Preview:
A new dashboard page with convenient resources, recents, and getting started links.
A new Experiments page that centralizes experiment discovery and management.
AutoML, a way to automatically generate ML models from data and accelerate the path to production.
Feature Store, a way to catalog ML features and make them available for training and serving, increasing reuse. With a data-lineage–based feature search that leverages automatically-logged data sources, you can make features available for training and serving with simplified model deployment that doesn’t require changes to the client application.
For details, see AI and machine learning on Databricks.
Additional instance types and more local SSD support
May 27, 2021
There are more instance types available and most instance types now have local SSDs. For the latest list of instance types, the prices of each, and the size of the local SSDs, see the GCP pricing estimator. As of May 27th, 2021, all instance types except the E2
family have local SSDs.
Faster local SSD performance with disk caching
May 27, 2021
Disk caching now works by default with all instance types that have local SSDs except highcpu
instance types. Cache sizes on all supported instance types are set automatically, so you do not need to set the disk usage explicitly. For clusters that use a highcpu
instance type with local SSD, you must enable disk caching by setting a Spark config on the cluster.
Audit log delivery (Public Preview)
May 20, 2021
You can now capture a detailed audit trail of the activities that your users perform in your Databricks account and its workspaces. Configure audit log delivery to a Google Cloud Storage bucket using the account console. Configure audit logs by June 3, 2021 to receive a backfill of audit logs since May 4, 2021.
Increased limit for the number of terminated all-purpose clusters
May 18, 2021: Version 3.46
You can now have up to 150 terminated all-purpose clusters in a Databricks workspace. Previously the limit was 120. For details, see Terminate a compute. The limit on the number of terminated all-purpose clusters returned by the Clusters API request is also now 150.
Increased limit for the number of pinned clusters
May 18, 2021: Version 3.46
You can now have up to 70 pinned clusters in a Databricks workspace. Previously the limit was 50. For details, see Pin a compute
Manage where notebook results are stored (Public Preview)
May 18, 2021: Version 3.46
You can now choose to store all notebook results in your GCS bucket for system data regardless of size or run type. By default, some results for interactive notebooks are stored in Databricks. A new configuration enables you to store these in GCS bucket for system data in your own account. For details, see Configure notebook result storage location.
This feature has no impact on notebooks run as jobs, whose results are always stored in GCS bucket for system data.
Spark UI displays information about terminated clusters
May 18, 2021
The Spark UI now displays detailed information about terminated clusters, not just active clusters.
Create and manage pools using the UI
May 18, 2021
You can now create and manage Pool configuration reference using the UI in addition to the Instance Pool API.
Databricks on Google Cloud is GA
May 4, 2021
Databricks is pleased to announced the GA release of Databricks on Google Cloud, which brings deep integration with Google Cloud technologies. To get started, go to Google Cloud Platform Marketplace, select Databricks, and follow the instructions in Get started with Databricks.
Databricks on Google Cloud runs on Google Kubernetes Engine (GKE) and provides a built-in integration with Google Cloud technologies:
Google Cloud Identity: Databricks workspace users can authenticate with their Google Cloud Identity account (or GSuite account) using Google’s OAuth 2.0 implementation, which conforms to the OpenID Connect spec and is OpenID certified.
Google Cloud Storage: Databricks notebooks can use Google Cloud Storage (GCS) buckets as data sources:
Access GCS buckets as DBFS mounts: You can create Databricks File System (DBFS) mounts. See Access a GCS bucket through DBFS.
Access GCS buckets directly: You can use
gs:/
paths directly in notebooks. See Access a GCS bucket directly with a Google Cloud service account key.
BigQuery: Databricks on Google Cloud notebooks can read and write to BigQuery as a data source.
The GA release of Databricks on Google Cloud includes all of the features provided in the March 22 public preview, with these additions:
General features:
Clusters and notebooks:
Integrations:
Git integration for Databricks Git folders (Public Preview)
Some features available in Databricks on other clouds are not available in Databricks on Google Cloud as of this release.
Subnet address range size calculator for new workspaces
May 4, 2021
Sizing the GKE subnet ranges used by a new Databricks workspace is key, because you cannot change the subnets after your workspace is deployed. If the address ranges for your Databricks subnets are too small, then the workspace exhausts its IP space, which in turn causes your Databricks jobs to fail. To determine the address range sizes that you need, Databricks now provides a calculator. See Calculate subnet sizes for a new workspace.
Databricks Runtime 7.4 series support ends
May 3, 2021
Support for Databricks Runtime 7.4, Databricks Runtime 7.4 for Machine Learning, and Databricks Runtime 7.4 for Genomics ended on May 3. See Databricks support lifecycles.
Jobs service stability and scalability improvements (Public Preview)
May 3-10, 2021: Version 3.45
To increase the stability and scalability of the Jobs service, each new job and run will now be assigned a unique, non-sequential identifier that may not monotonically increase. Clients that use the Jobs API and depend on sequential or monotonically increasing identifiers must be modified to accept non-sequential and unordered IDs.