Frequently asked questions (FAQ)

What is Delta Lake?

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns and provides optimized layouts and indexes for fast interactive queries.

What format does Delta Lake use to store data?

Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

How can I read and write data with Delta Lake?

You can use your favorite Apache Spark APIs to read and write data with Delta Lake. See Read a table and Write to a table.

Where does Delta Lake store the data?

When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that location in Parquet format.

Can I copy my Delta Lake table to another location?

Yes you can copy your Delta Lake table to another location. Remember to copy files without changing the timestamps to ensure that the time travel with timestamps will be consistent.

Can I stream data directly into and from Delta tables?

Yes, you can use Structured Streaming to directly write data into Delta tables and read from Delta tables. See Stream data into Delta tables and Stream data from Delta tables.

Does Delta Lake support writes or reads using the Spark Streaming DStream API?

Delta does not support the DStream API. We recommend Table streaming reads and writes.

When I use Delta Lake, will I be able to port my code to other Spark platforms easily?

Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to other Spark platforms. To port your code, replace delta format with parquet format.

How do Delta tables compare to Hive SerDe tables?

Delta tables are managed to a greater degree. In particular, there are several Hive SerDe parameters that Delta Lake manages on your behalf that you should never specify manually:

  • ROWFORMAT

  • SERDE

  • OUTPUTFORMAT AND INPUTFORMAT

  • COMPRESSION

  • STORED AS

What DDL and DML features does Delta Lake not support?

  • Unsupported DDL features:

    • ANALYZE TABLE PARTITION

    • ALTER TABLE [ADD|DROP] PARTITION

    • ALTER TABLE RECOVER PARTITIONS

    • ALTER TABLE SET SERDEPROPERTIES

    • CREATE TABLE LIKE

    • INSERT OVERWRITE DIRECTORY

    • LOAD DATA

  • Unsupported DML features:

    • INSERT INTO [OVERWRITE] table with static partitions

    • INSERT OVERWRITE TABLE for table with dynamic partitions

    • Bucketing

    • Specifying a schema when reading from a table

    • Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE

Does Delta Lake support multi-table transactions?

Delta Lake does not support multi-table transactions and foreign keys. Delta Lake supports transactions at the table level.

How can I change the type of a column?

Changing a column’s type or dropping a column requires rewriting the table. For an example, see Change column type.

What does it mean that Delta Lake supports multi-cluster writes?

It means that Delta Lake does locking to make sure that queries writing to a table from multiple clusters at the same time won’t corrupt the table. However, it does not mean that if there is a write conflict (for example, update and delete the same thing) that they will both succeed. Instead, one of writes will fail atomically and the error will tell you to retry the operation.

Can I access Delta tables outside of Databricks Runtime?

There are two cases to consider: external reads and external writes.