What is predictive I/O?

Predictive I/O is a collection of Databricks optimizations that improve performance for data interactions. Predictive I/O capabilities are grouped into the following categories:

  • Accelerated reads reduce the time it takes to scan and read data.

  • Accelerated updates reduce the amount of data that needs to be rewritten during updates, deletes, and merges.

Predictive I/O is exclusive to the Photon engine on Databricks.

Use predictive I/O to accelerate reads

Predictive I/O is used to accelerate data scanning and filtering performance for all operations on supported compute types.

Important

Predictive I/O reads are supported by the serverless and pro types of SQL warehouses, and Photon-accelerated clusters running Databricks Runtime 11.3 LTS and above.

Predictive I/O improves scanning performance by applying deep learning techniques to do the following:

  • Determine the most efficient access pattern to read the data and only scanning the data that is actually needed.

  • Eliminate the decoding of columns and rows that are not required to generate query results.

  • Calculate the probabilities of the search criteria in selective queries matching a row. As queries run, we use these probabilities to anticipate where the next matching row would occur and only read that data from cloud storage.

Use predictive I/O to accelerate updates

Predictive I/O for updates are used automatically for all tables that have deletion vectors enabled using the following Photon-enabled compute types:

  • Serverless SQL warehouses.

  • Pro SQL warehouses.

  • Clusters running Databricks Runtime 14.0 and above.

Note

Support for predictive I/O for updates is present in Databricks Runtime 12.2 LTS and above, but Databricks recommends using 14.0 and above for best performance.

See What are deletion vectors?.

Important

A workspace admin setting controls whether deletion vectors are auto-enabled for new Delta tables. See Auto-enable deletion vectors.

You enable support for deletion vectors on a Delta Lake table by setting a Delta Lake table property. You enable deletion vectors during table creation or alter an existing table, as in the following examples:

CREATE TABLE <table-name> [options] TBLPROPERTIES ('delta.enableDeletionVectors' = true);

ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

Warning

When you enable deletion vectors, the table protocol version is upgraded. After upgrading, the table will not be readable by Delta Lake clients that do not support deletion vectors. See How does Databricks manage Delta Lake feature compatibility?.

For a list of clients that support deletion vectors, see Compatibility with Delta clients.

In Databricks Runtime 14.1 and above, you can drop the deletion vectors table feature to enable compatibility with other Delta clients. See Drop Delta table features.

Predictive I/O leverages deletion vectors to accelerate updates by reducing the frequency of full file rewrites during data modification on Delta tables. Predictive I/O optimizes DELETE, MERGE, and UPDATE operations.

Rather than rewriting all records in a data file when any record is updated or deleted, predictive I/O uses deletion vectors to indicate records have been removed from the target data files. Supplemental data files are used to indicate updates.

Subsequent reads on the table resolve current table state by applying the noted changes to the most recent table version.

Important

Predictive I/O updates share all limitations with deletion vectors. In Databricks Runtime 12.2 LTS and greater, the following limitations exist:

  • Delta Sharing is not supported on tables with deletion vectors enabled.

  • You cannot generate a manifest file for a table with deletion vectors present. Run REORG TABLE ... APPLY (PURGE) and ensure no concurrent write operations are running in order to generate a manifest.

  • You cannot incrementally generate manifest files for a table with deletion vectors enabled.