The Apache Spark DataFrame API provides a rich set of functions (select columns, filter, join, aggregate, and so on) that allow you to solve common data analysis problems efficiently. DataFrames also allow you to intermix operations seamlessly with custom Python, SQL, R, and Scala code. This tutorial module shows how to:
We also provide a sample notebook that you can import to access and run all of the code examples included in the module.
The easiest way to start working with DataFrames is to use an example Databricks dataset available in the
/databricks-datasets folder accessible within the Databricks workspace. To access the file that compares city population versus median sale prices of homes, load the file
Because the sample notebook is a SQL notebook, the next few commands use the
%python magic command.
# Use the Spark CSV datasource with options specifying: # - First line of file is a header # - Automatically infer the schema of the data %python data = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true", inferSchema="true") data.cache() # Cache data for faster reuse data = data.dropna() # drop rows with missing values
Now that you have created the
data DataFrame, you can quickly access the data using standard Spark commands such as
take(). For example, you can use the command
data.take(10) to view the first ten rows of the
To view this data in a tabular format, you can use the Databricks
display() command instead of exporting the data to a third-party tool.
Before you can issue SQL queries, you must save your
data DataFrame as a table or temporary view:
# Register table so it is accessible via SQL Context %python data.createOrReplaceTempView("data_geo")
Then, in a new cell, specify a SQL query to list the 2015 median sales price by state:
select `State Code`, `2015 median sales price` from data_geo
Or, query for population estimate in the state of Washington:
select City, `2014 Population estimate` from data_geo where `State Code` = 'WA';
An additional benefit of using the Databricks
display() command is that you can quickly view this data with a number of embedded visualizations. Click the down arrow next to the to display a list of visualization types:
Then, select the Map icon to create a map visualization of the sale price SQL query from the previous section:
To run these code examples, visualizations, and more, import the following notebook. For more DataFrame examples, see DataFrames and Datasets.