Low shuffle merge on Databricks

Note

Low shuffle merge is generally available (GA) in Databricks Runtime 10.3 and above and in Public Preview in Databricks Runtime 9.1 LTS. We recommend that Preview customers migrate to Databricks Runtime 10.3 or above.

The MERGE command is used to perform simultaneous updates, insertions, and deletions from a Delta Lake table. Databricks has an optimized implementation of MERGE that improves performance substantially for common workloads by reducing the number of shuffle operations.

Databricks low shuffle merge provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance. Low shuffle merge also reduces the need for users to re-run the OPTIMIZE ZORDER BY command after performing a MERGE operation.

Optimized performance

Many MERGE workloads only update a relatively small number of rows in a table. However, Delta tables can only be updated on a per-file basis. When the MERGE command needs to update or delete a small number of rows that are stored in a particular file, then it must also process and rewrite all remaining rows that are stored in the same file, even though these rows are unmodified. Low shuffle merge optimizes the processing of unmodified rows. Previously, they were processed in the same way as modified rows, passing them through multiple shuffle stages and expensive calculations. In low shuffle merge, the unmodified rows are instead processed without any shuffles, expensive processing, or other added overhead.

Optimized data layout

In addition to being faster to run, low shuffle merge benefits subsequent operations as well. The earlier MERGE implementation caused the data layout of unmodified data to be changed entirely, resulting in lower performance on subsequent operations. Low shuffle merge tries to preserve the existing data layout of the unmodified records, including Z-order optimization on a best-effort basis. Hence, with low shuffle merge, the performance of operations on a Delta table will degrade more slowly after running one or more MERGE commands.

Note

Low shuffle merge tries to preserve the data layout on existing data that is not modified. The data layout of updated or newly inserted data may not be optimal, so it may still be necessary to run the OPTIMIZE or OPTIMIZE ZORDER BY commands.

Availability

Low shuffle merge is enabled by default in Databricks Runtime 10.4 and above. In earlier supported Databricks Runtime versions it can be enabled by setting the configuration spark.databricks.delta.merge.enableLowShuffle to true. This flag has no effect in Databricks Runtime 10.4 and above.