Clone a table on Databricks

You can create a copy of an existing Delta Lake table on Databricks at a specific version using the clone command. Clones can be either deep or shallow.

Clone is a Databricks-exclusive feature enabled in the Databricks Runtime by default.

Clone types

  • A deep clone is a clone that copies the source table data to the clone target in addition to the metadata of the existing table. Additionally, stream metadata is also cloned such that a stream that writes to the Delta table can be stopped on a source table and continued on the target of a clone from where it left off.

  • A shallow clone is a clone that does not copy the data files to the clone target. The table metadata is equivalent to the source. These clones are cheaper to create.

Any changes made to either deep or shallow clones affect only the clones themselves and not the source table.

The metadata that is cloned includes: schema, partitioning information, invariants, nullability. For deep clones only, stream and COPY INTO metadata are also cloned. Metadata not cloned are the table description and user-defined commit metadata.

Important

  • Shallow clones reference data files in the source directory. If you run vacuum on the source table clients will no longer be able to read the referenced data files and a FileNotFoundException will be thrown. In this case, running clone with replace over the shallow clone will repair the clone. If this occurs often, consider using a deep clone instead which does not depend on the source table.

  • Deep clones do not depend on the source from which they were cloned, but are expensive to create because a deep clone copies the data as well as the metadata.

  • Cloning with replace to a target that already has a table at that path creates a Delta log if one does not exist at that path. You can clean up any existing data by running vacuum.

  • If an existing Delta table exists, a new commit is created that includes the new metadata and new data from the source table. This new commit is incremental, meaning that only new changes since the last clone are committed to the table.

  • Cloning a table is not the same as Create Table As Select or CTAS. A clone copies the metadata of the source table in addition to the data. Cloning also has simpler syntax: you don’t need to specify partitioning, format, invariants, nullability and so on as they are taken from the source table.

  • A cloned table has an independent history from its source table. Time travel queries on a cloned table will not work with the same inputs as they work on its source table.

CREATE TABLE delta.`/data/target/` CLONE delta.`/data/source/` -- Create a deep clone of /data/source at /data/target

CREATE OR REPLACE TABLE db.target_table CLONE db.source_table -- Replace the target

CREATE TABLE IF NOT EXISTS delta.`/data/target/` CLONE db.source_table -- No-op if the target table exists

CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source`

CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` VERSION AS OF version

CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` TIMESTAMP AS OF timestamp_expression -- timestamp can be like “2019-01-01” or like date_sub(current_date(), 1)
from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)  # path-based tables, or
deltaTable = DeltaTable.forName(spark, tableName)    # Hive metastore-based tables

deltaTable.clone(target, isShallow, replace) # clone the source at latest version

deltaTable.cloneAtVersion(version, target, isShallow, replace) # clone the source at a specific version

# clone the source at a specific timestamp such as timestamp=“2019-01-01”
deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace)
import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)
val deltaTable = DeltaTable.forName(spark, tableName)

deltaTable.clone(target, isShallow, replace) // clone the source at latest version

deltaTable.cloneAtVersion(version, target, isShallow, replace) // clone the source at a specific version

deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace) // clone the source at a specific timestamp
import io.delta.tables.*;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);
DeltaTable deltaTable = DeltaTable.forName(spark, tableName);

deltaTable.clone(target, isShallow, replace) // clone the source at latest version

deltaTable.cloneAtVersion(version, target, isShallow, replace) // clone the source at a specific version

deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace) // clone the source at a specific timestamp

For syntax details, see CREATE TABLE CLONE.

Clone metrics

Note

Available in Databricks Runtime 8.2 and above.

CLONE reports the following metrics as a single row DataFrame once the operation is complete:

  • source_table_size: Size of the source table that’s being cloned in bytes.

  • source_num_of_files: The number of files in the source table.

  • num_removed_files: If the table is being replaced, how many files are removed from the current table.

  • num_copied_files: Number of files that were copied from the source (0 for shallow clones).

  • removed_files_size: Size in bytes of the files that are being removed from the current table.

  • copied_files_size: Size in bytes of the files copied to the table.

Clone metrics example

Permissions

You must configure permissions for Databricks table access control and your cloud provider.

Table access control

The following permissions are required for both deep and shallow clones:

  • SELECT permission on the source table.

  • If you are using CLONE to create a new table, CREATE permission on the database in which you are creating the table.

  • If you are using CLONE to replace a table, you must have MODIFY permission on the table.

Cloud provider permissions

If you have created a deep clone, any user that reads the deep clone must have read access to the clone’s directory. To make changes to the clone, users must have write access to the clone’s directory.

If you have created a shallow clone, any user that reads the shallow clone needs permission to read the files in the original table, since the data files remain in the source table with shallow clones, as well as the clone’s directory. To make changes to the clone, users will need write access to the clone’s directory.

Use clone for data archiving

Data may need to be kept for longer than is feasible with time travel or for disaster recovery. In these cases, you can create a deep clone to preserve the state of a table at a certain point in time for archival. Incremental archiving is also possible to keep a continually updating state of a source table for disaster recovery.

-- Every month run
CREATE OR REPLACE TABLE delta.`/some/archive/path` CLONE my_prod_table

Use clone for ML model reproduction

When doing machine learning, you may want to archive a certain version of a table on which you trained an ML model. Future models can be tested using this archived data set.

-- Trained model on version 15 of Delta table
CREATE TABLE delta.`/model/dataset` CLONE entire_dataset VERSION AS OF 15

Use clone for short-term experiments on a production table

To test a workflow on a production table without corrupting the table, you can easily create a shallow clone. This allows you to run arbitrary workflows on the cloned table that contains all the production data but does not affect any production workloads.

-- Perform shallow clone
CREATE OR REPLACE TABLE my_test SHALLOW CLONE my_prod_table;

UPDATE my_test WHERE user_id is null SET invalid=true;
-- Run a bunch of validations. Once happy:

-- This should leverage the update information in the clone to prune to only
-- changed files in the clone if possible
MERGE INTO my_prod_table
USING my_test
ON my_test.user_id <=> my_prod_table.user_id
WHEN MATCHED AND my_test.user_id is null THEN UPDATE *;

DROP TABLE my_test;

Use clone for data sharing

Other business units within a single organization may want to access the same data but may not require the latest updates. Instead of giving access to the source table directly, you can provide clones with different permissions for different business units. The performance of the clone can exceed that of a simple view.

-- Perform deep clone
CREATE OR REPLACE TABLE shared_table CLONE my_prod_table;

-- Grant other users access to the shared table
GRANT SELECT ON shared_table TO `<user-name>@<user-domain>.com`;

Use clone to override table properties

Note

Available in Databricks Runtime 7.5 and above.

Table property overrides are particularly useful for:

  • Annotating tables with owner or user information when sharing data with different business units.

  • Archiving Delta tables and time travel is required. You can specify the log retention period independently for the archive table. For example:

CREATE OR REPLACE TABLE archive.my_table CLONE prod.my_table
TBLPROPERTIES (
delta.logRetentionDuration = '3650 days',
delta.deletedFileRetentionDuration = '3650 days'
)
LOCATION 'xx://archive/my_table'
dt = DeltaTable.forName(spark, "prod.my_table")
tblProps = {
"delta.logRetentionDuration": "3650 days",
"delta.deletedFileRetentionDuration": "3650 days"
}
dt.clone('xx://archive/my_table', isShallow=False, replace=True, tblProps)
val dt = DeltaTable.forName(spark, "prod.my_table")
val tblProps = Map(
"delta.logRetentionDuration" -> "3650 days",
"delta.deletedFileRetentionDuration" -> "3650 days"
)
dt.clone("xx://archive/my_table", isShallow = false, replace = true, properties = tblProps)