This article explains how to use the per-workspace Hive metastore when your Databricks workspace is enabled for Unity Catalog.
If your workspace was in service before it was enabled for Unity Catalog, it likely has a Hive metastore that contains data that you want to continue to use. Databricks recommends that you migrate the tables managed by the Hive metastore to the Unity Catalog metastore, but if you choose not to, this article explains how to work with data managed by both metastores.
The Unity Catalog metastore is additive, meaning it can be used with the per-workspace Hive metastore in Databricks. The Hive metastore appears as a top-level catalog called
hive_metastore in the three-level namespace.
For example, you can refer to a table called
sales_raw in the
sales schema in the legacy Hive metastore by using the following notation:
SELECT * from hive_metastore.sales.sales_raw;
You can also specify the catalog and schema with a
SELECT * from sales_raw;
If you configured table access control on the Hive metastore, Databricks continues to enforce those access controls for data in the
hive_metastore catalog for clusters running in the shared access mode. The Unity Catalog access model differs slightly from legacy access controls, like no
DENY statements. The Hive metastore is a workspace-level object. Permissions defined within the
hive_metastore catalog always refer to the local users and groups in the workspace. See Differences from table access control.
Unity Catalog has the following key differences from using table access controls in the legacy Hive metastore in each workspace.
The access control model in Unity Catalog has the following differences from table access control:
Account groups: Access control policies in Unity Catalog are applied to account groups, while access control policies for the Hive metastore are applied to workspace-local groups. See Difference between account groups and workspace-local groups.
USE SCHEMApermissions are required on the catalog and schema for all operations on objects inside the catalog or schema: Regardless of a principal’s privileges on a table, the principal must also have the
USE CATALOGprivilege on its parent catalog to access the schema and the
USE SCHEMAprivilege to access objects within the schema. With workspace-level table access controls, on the other hand, granting
USAGEon the root catalog automatically grants
USAGEon all databases, but
USAGEon the root catalog is not required.
Views: In Unity Catalog, the owner of a view does not need to be an owner of the view’s referenced tables and views. Having the
SELECTprivilege is sufficient, along with
USE SCHEMAon the views’ parent schema and
USE CATALOGon the parent catalog. With workspace-level table access controls, a view’s owner needs to be an owner of all referenced tables and views.
No support for
ANONYMOUS FUNCTIONs: In Unity Catalog, there is no concept of an
ANONYMOUS FUNCTIONpermission. These permissions could be used to circumvent access control restrictions by allowing an unprivileged user to run privileged code.
By using three-level namespace notation, you can join data in a Unity Catalog metastore with data in the legacy Hive metastore.
A join with data in the legacy Hive metastore will only work on the workspace where that data resides. Trying to run such a join in another workspace results in an error. Databricks recommends that you upgrade legacy tables and views to Unity Catalog.
The following example joins results from the
sales_current table in the legacy Hive metastore with the
sales_historical table in the Unity Catalog metastore when the
order_id fields are equal.
SELECT * FROM hive_metastore.sales.sales_current
ON hive_metastore.sales.sales_current.order_id = main.shared_sales.sales_historical.order_id;
dfCurrent = spark.table("hive_metastore.sales.sales_current")
dfHistorical = spark.table("main.shared_sales.sales_historical")
other = dfHistorical,
on = dfCurrent.order_id == dfHistorical.order_id
dfCurrent = tableToDF("hive_metastore.sales.sales_current")
dfHistorical = tableToDF("main.shared_sales.sales_historical")
x = dfCurrent,
y = dfHistorical,
joinExpr = dfCurrent$order_id == dfHistorical$order_id))
val dfCurrent = spark.table("hive_metastore.sales.sales_current")
val dfHistorical = spark.table("main.shared_sales.sales_historical")
right = dfHistorical,
joinExprs = dfCurrent("order_id") === dfHistorical("order_id")
A default catalog is configured for each workspace that is enabled for Unity Catalog.
If you omit the top-level catalog name when you perform data operations, the default catalog is assumed.
When a workspace is enabled for Unity Catalog, the
hive_metastore catalog is set as the default catalog.
If you are transitioning from the Hive metastore to Unity Catalog within an existing workspace, it typically makes sense to keep
hive_metastore as the default catalog to avoid impacting existing code that references the hive metastore.
To learn how to get and switch the default catalog, see Manage the default catalog
When you use the Hive metastore alongside Unity Catalog, data access credentials associated with the cluster are used to access Hive metastore data but not data registered in Unity Catalog.
If users access paths that are outside Unity Catalog (such as a path not registered as a table or external location) then the access credentials assigned to the cluster are used.
Tables in the Hive metastore do not benefit from the full set of security and governance features that Unity Catalog introduces, such as built-in auditing and access control. Databricks recommends that you upgrade your legacy tables by adding them to Unity Catalog.