Photon runtime

Photon is the native vectorized query engine on Databricks, written to be directly compatible with Apache Spark APIs so it works with your existing code. It is developed in C++ to take advantage of modern hardware, and uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications-—all natively on your data lake. Photon is part of a high-performance runtime that runs your existing SQL and DataFrame API calls faster and reduces your total cost per workload. Photon is used by default in Databricks SQL warehouses.

Databricks clusters

To access Photon on Databricks clusters you must explicitly select a runtime containing Photon when you create the cluster, using either the UI or the APIs (Clusters API 2.0 and Jobs API 2.1, specifying spark_version using the syntax <databricks-runtime-version>-photon-scala2.12). Photon is available for clusters running Databricks Runtime 9.1 LTS and above.

Photon supports a number of instance types on the driver and worker nodes. Photon instance types consume DBUs at a different rate than the same instance type running the non-Photon runtime.

The supported Google Cloud instance types are n2-highmem-4, n2-highmem-8, and n2-highmem-16.

For more information about Photon instances and DBU consumption, see the Databricks pricing page.

Photon coverage

Operators

  • Scan, Filter, Project

  • Hash Aggregate/Join/Shuffle

  • Nested-Loop Join

  • Null-Aware Anti Join

  • Union, Expand, ScalarSubquery

  • Delta/Parquet Write Sink

  • Sort

  • Window Function

Expressions

  • Comparison / Logic

  • Arithmetic / Math (most)

  • Conditional (IF, CASE, etc.)

  • String (common ones)

  • Casts

  • Aggregates(most common ones)

  • Date/Timestamp

Data types

  • Byte/Short/Int/Long

  • Boolean

  • String/Binary

  • Decimal

  • Float/Double

  • Date/Timestamp

  • Struct

  • Array

  • Map

Photon advantages

  • Supports SQL and equivalent DataFrame operations against Delta and Parquet tables.

  • Expected to accelerate queries that process a significant amount of data (100GB+) and include aggregations and joins.

  • Faster performance when data is accessed repeatedly from the Delta cache.

  • More robust scan performance on tables with many columns and many small files.

  • Faster Delta and Parquet writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT, especially for wide tables (hundreds to thousands of columns).

  • Replaces sort-merge joins with hash-joins.

Limitations

  • Does not support Spark Structured Streaming.

  • Does not support UDFs.

  • Does not support RDD APIs.

  • Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data.

Features not supported by Photon run the same way they would with Databricks Runtime; there is no performance advantage for those features.