Pandas API on Spark
Note
This feature is available on clusters that run Databricks Runtime 10.0 (EoS) and above. For clusters that run Databricks Runtime 9.1 LTS and below, use Koalas instead.
Commonly used by data scientists, pandas is a Python package that provides easy-to-use data structures and data analysis tools for the Python programming language. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark. Pandas API on Spark is useful not only for pandas users but also PySpark users, because pandas API on Spark supports many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.
Requirements
Pandas API on Spark is available beginning in Apache Spark 3.2 (which is included beginning in Databricks Runtime 10.0 (EoS)) by using the following import
statement:
import pyspark.pandas as ps