Apache Spark API reference
Databricks is built on top of Apache Spark, a unified analytics engine for big data and machine learning. For more information, see Apache Spark - What is Spark on the Databricks website.
Apache Spark has easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data. These APIs include:
PySpark APIs for Python developers. See Getting Started with PySpark. Key classes include:
SparkSession - The entry point to programming Spark with the Dataset and DataFrame API. See Spark Session APIs and Starting Point: SparkSession.
DataFrame - A distributed collection of data grouped into named columns. See Datasets and DataFrames, Creating DataFrames, DataFrame APIs, and DataFrame Functions.
SparkR APIs for R developers. See the SparkR (R on Spark) Developer Guide. Key classes include:
SparkSession - SparkSession is the entry point into SparkR. See Starting Point: SparkSession.
SparkDataFrame - A distributed collection of data grouped into named columns. See Datasets and DataFrames, Creating DataFrames, and Creating SparkDataFrames.
Scala APIs. Key classes include:
SparkSession - The entry point to programming Spark with the Dataset and DataFrame API. See Starting Point: SparkSession.
Dataset - A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each
Dataset
also has an untyped view called a DataFrame, which is aDataset
of Row. See Datasets and DataFrames, Creating Datasets, Creating DataFrames, and DataFrame functions.
Java APIs. Key classes include:
SparkSession - The entry point to programming Spark with the Dataset and DataFrame API. See Starting Point: SparkSession.
Dataset - A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each
Dataset
also has an untyped view called a DataFrame, which is aDataset
of Row. See Datasets and DataFrames, Creating Datasets, Creating DataFrames, and DataFrame functions.
To learn how to use the Apache Spark APIs on Databricks, see:
For Java, you can run Java code as a JAR job.