What is Unity Catalog?
This article introduces Unity Catalog, a unified governance solution for data and AI assets on the Lakehouse.
Overview of Unity Catalog
Unity Catalog provides centralized access control, auditing, and data discovery capabilities across Databricks workspaces.
Key features of Unity Catalog include:
Define once, secure everywhere: Unity Catalog offers a single place to administer data access policies that apply across all workspaces and personas.
Standards-compliant security model: Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views.
Built-in auditing: Unity Catalog automatically captures user-level audit logs that record access to your data.
Data discovery: Unity Catalog lets you tag and document data assets, and provides a search interface to help data consumers find data.
The Unity Catalog object model
In Unity Catalog, the hierarchy of primary data objects flows from metastore to table:
Metastore: The top-level container for metadata. Each metastore exposes a three-level namespace (
table) that organizes your data.
Catalog: The first layer of the object hierarchy, used to organize your data assets.
Schema: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.
Table: At the lowest level in the object hierarchy are tables and views.
This is a simplified view of securable Unity Catalog objects. For more details, see Securable objects in Unity Catalog.
You reference all data in Unity Catalog using a three-level namespace.
A metastore is the top-level container of objects in Unity Catalog. It stores metadata about data assets (tables and views) and the permissions that govern access to them. Databricks account admins can create a metastore for each region in which they operate and assign them to Databricks workspaces in the same region. For a workspace to use Unity Catalog, it must have a Unity Catalog metastore attached.
Each metastore is configured with a root storage location in a GCS bucket in your Google Cloud account. This storage location is used by default for storing data for managed tables.
This metastore is distinct from the Hive metastore included in Databricks workspaces that have not been enabled for Unity Catalog. If your workspace includes a legacy Hive metastore, the data in that metastore will still be available alongside data defined in Unity Catalog, in a catalog named
hive_metastore. Note that the
hive_metastore catalog is not managed by Unity Catalog and does not benefit from the same feature set as catalogs defined in Unity Catalog.
A catalog is the first layer of Unity Catalog’s three-level namespace. It’s used to organize your data assets. Users can see all catalogs on which they have been assigned the
USE CATALOG data permission.
A schema (also called a database) is the second layer of Unity Catalog’s three-level namespace. A schema organizes tables and views. To access (or list) a table or view in a schema, users must have the
USE SCEHMA data permission on the schema and its parent catalog, and they must have the
SELECT permission on the table or view.
A table resides in the third layer of Unity Catalog’s three-level namespace. It contains rows of data. To create a table, users must have
USE SCHEMA permissions on the schema, and they must have the
USE CATALOG permission on its parent catalog. To query a table, users must have the
SELECT permission on the table, the
USE SCHEMA permission on its parent schema, and the
USE CATALOG permission on its parent catalog.
A table can be managed or external.
Managed tables are the default way to create tables in Unity Catalog. Unity Catalog manages the lifecycle and file layout for these tables. You should not use tools outside of Databricks to manipulate files in these tables directly.
By default, managed tables are stored in the root storage location that you configure when you create a metastore. You can optionally specify managed table storage locations at the catalog or schema levels, overriding the root storage location. Managed tables always use the Delta table format.
When a managed table is dropped, its underlying data is deleted from your cloud tenant within 30 days.
See Managed tables.
External tables are tables whose data lifecycle and file layout are not managed by Unity Catalog. Use external tables to register large amounts of existing data in Unity Catalog, or if you require direct access to the data using tools outside of Databricks clusters or Databricks SQL warehouses.
When you drop an external table, Unity Catalog does not delete the underlying data. You can manage privileges on external tables and use them in queries in the same way as managed tables.
External tables can use the following file formats:
See External tables.
Storage credentials and external locations
To manage access to the underlying cloud storage for an external table, Unity Catalog introduces the following object types:
Storage credentials encapsulate a long-term cloud credential that provides access to cloud storage. For example, a service account that can access GCS buckets.
External locations contain a reference to a storage credential and a cloud storage path.
A view is a read-only object created from one or more tables and views in a metastore. It resides in the third layer of Unity Catalog’s three-level namespace. A view can be created from tables and other views in multiple schemas and catalogs. You can create dynamic views to enable row- and column-level permissions.
Identity management for Unity Catalog
Unity Catalog uses the identities in the Databricks account to resolve users, service principals, and groups, and to enforce permissions.
To configure identities in the account, follow the instructions in Manage users, service principals, and groups. Refer to those users, service principals, and groups when you create access-control policies in Unity Catalog.
Unity Catalog users, service principals, and groups must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks SQL query, Data Explorer, or a REST API command. The assignment of users, service principals, and groups to workspaces is called identity federation.
All workspaces that have a Unity Catalog metastore attached to them are enabled for identity federation.
Special considerations for groups
Any groups that already exist in the workspace are labeled Workspace local in the account console. These workspace-local groups cannot be used in Unity Catalog to define access policies. You must use account-level groups. If a workspace-local group is referenced in a command, that command will return an error that the group was not found. If you previously used workspace-local groups to manage access to notebooks and other artifacts, these permissions remain in effect.
See Manage groups.
Admin roles for Unity Catalog
The following admin roles are required for managing Unity Catalog:
Account admins can manage identities, cloud resources and the creation of workspaces and Unity Catalog metastores.
Account admins can enable workspaces for Unity Catalog. They can grant both workspace and metastore admin permissions.
Metastore admins can manage privileges and ownership for all securable objects within a metastore, such as who can create catalogs or query a table.
The account admin who creates the Unity Catalog metastore becomes the initial metastore admin. The metastore admin can also choose to delegate this role to another user or group. We recommend assigning the metastore admin to a group, in which case any member of the group receives the privileges of the metastore admin. See (Recommended) Transfer ownership of your metastore to a group.
Workspace admins can add users to a Databricks workspace, assign them the workspace admin role, and manage access to objects and functionality in the workspace, such as the ability to create clusters and change job ownership.
Data permissions in Unity Catalog
In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward.
You can assign and revoke permissions using Data Explorer, SQL commands, or REST APIs.
Cluster access modes for Unity Catalog
To access data in Unity Catalog, clusters must be configured with the correct access mode. Unity Catalog is secure by default. If a cluster is not configured with one of the Unity-Catalog-capable access modes (that is, shared or single user), the cluster can’t access data in Unity Catalog.
See Create clusters & SQL warehouses with Unity Catalog access.
How do I set up Unity Catalog for my organization?
To set up Unity Catalog for your organization, you do the following:
Configure a GCS bucket that Unity Catalog can use to store and access data in your GCP account.
As part of metastore creation (in the next step), Databricks generates a service account that you will use grant access to this GCS bucket.
Create a metastore for each region in which your organization operates.
Attach workspaces to the metastore. Each workspace will have the same view of the data you manage in Unity Catalog.
If you have a new account, add users, groups, and service principals to your Databricks account.
Next, you create and grant access to catalogs, schemas, and tables.
For complete setup instructions, see Get started using Unity Catalog.
Unity Catalog is supported on clusters that run Databricks Runtime 11.3 LTS or above. Unity Catalog is supported by default on all SQL warehouse compute versions.
Clusters running on earlier versions of Databricks Runtime do not provide support for all Unity Catalog GA features and functionality.
For information about updated Unity Catalog functionality in later Databricks Runtime versions, see the release notes for those versions.
For the list of regions that support Unity Catalog, see Databricks clouds and regions.
Supported data file formats
Unity Catalog supports the following table formats:
Managed tables must use the
External tables can use
Unity Catalog has the following limitations.
If your cluster is running on a Databricks Runtime version below 11.3 LTS, there may be additional limitations, not listed here. Unity Catalog is supported on Databricks Runtime 11.3 LTS or above.
Scala, R, and workloads using Databricks Runtime for Machine Learning are supported only on clusters using the single user access mode. Workloads in these languages do not support the use of dynamic views for row-level or column-level security.
Shallow clones are not supported when you use Unity Catalog as the source or target of the clone.
Bucketing is not supported for Unity Catalog tables. If you run commands that try to create a bucketed table in Unity Catalog, it will throw an exception.
Writing to the same path or Delta Lake table from workspaces in multiple regions can lead to unreliable performance if some clusters access Unity Catalog and others do not.
Custom partition schemes created using commands like
ALTER TABLE ADD PARTITIONare not supported for tables in Unity Catalog. Unity Catalog can access tables that use directory-style partitioning.
Overwrite mode for DataFrame write operations into Unity Catalog is supported only for Delta tables, not for other file formats. The user must have the
CREATEprivilege on the parent schema and must be the owner of the existing object or have the
MODIFYprivilege on the object.
Referencing Unity Catalog tables from Delta Live Tables pipelines is supported in Private Preview. Contact your account team for access.
Spark-submit jobs are supported on single user clusters but not shared clusters. See What is cluster access mode?.
Python UDF support on shared clusters is supported in Private Preview. Contact your account team for access.
Groups that were previously created in a workspace (that is, workspace-level groups) cannot be used in Unity Catalog GRANT statements. This is to ensure a consistent view of groups that can span across workspaces. To use groups in GRANT statements, create your groups at the account level and update any automation for principal or group management (such as SCIM, Okta and AAD connectors, and Terraform) to reference account endpoints instead of workspace endpoints. See Difference between account groups and workspace-local groups.
Standard Scala thread pools are not supported. Instead, use the special thread pools in
org.apache.spark.util.ThreadUtils, for example,
org.apache.spark.util.ThreadUtils.newDaemonFixedThreadPool. However, the following thread pools in
ThreadUtilsare not supported:
Structured Streaming support
Support for Structured Streaming on Unity Catalog tables (managed or external) depends on the Databricks Runtime version that you are running and on whether you are using shared or single user clusters.
Support for shared clusters requires 12.2 LTS and above, with the following limitations:
Continuous streaming is not supported.
applyInPandasWithStateis not supported.
Working with socket sources is not supported.
StreamingQueryListenercannot use credentials or interact with objects managed by Unity Catalog.
sourceArchiveDirmust be in the same external location as the source when you use
option("cleanSource", "archive")with a data source managed by Unity Catalog.
For Kafka sources and sinks, the following options are not supported:
Support for single user clusters is available on Databricks Runtime 11.3 LTS and above, with the following limitations:
Continuous streaming is not supported.
StreamingQueryListenercannot use credentials or interact with objects managed by Unity Catalog.
Asynchronous checkpointing is not supported in Databricks Runtime 11.3 LTS and below. It is supported in Databricks Runtime 12.0 and above.
Unity Catalog enforces resource quotas on all securable objects. Limits respect the same hierarchical organization throughout Unity Catalog. If you expect to exceed these resource limits, contact your Databricks account representative.
Quota values below are expressed relative to the parent object in the Unity Catalog.
For Delta Sharing limits, see Resource quotas.