Get started with Apache Spark

This tutorial module helps you to get started quickly with using Apache Spark. We discuss key concepts briefly, so you can get right down to writing your first Apache Spark application. In the other tutorial modules in this guide, you will have the opportunity to go deeper into the article of your choice.

In this tutorial module, you will learn:

We also provide sample notebooks that you can import to access and run all of the code examples included in the module.

Spark interfaces

There are three key Apache Spark interfaces that you should know about: Resilient Distributed Dataset, DataFrame, and Dataset.

  • Resilient Distributed Dataset: The first Apache Spark abstraction was the Resilient Distributed Dataset (RDD). It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDDs can be created in a variety of ways and are the “lowest level” API available. While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. The RDD API is available in the Java, Python, and Scala languages.
  • DataFrame: These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. The DataFrame API is available in the Java, Python, R, and Scala languages.
  • Dataset: A combination of DataFrame and RDD. It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. The Dataset API is available in the Java and Scala languages.

In many scenarios, especially with the performance optimizations embedded in DataFrames and Datasets, it will not be necessary to work with RDDs. But it is important to understand the RDD abstraction because:

  • The RDD is the underlying infrastructure that allows Spark to run so fast and provide data lineage.
  • If you are diving into more advanced components of Spark, it may be necessary to use RDDs.
  • The visualizations within the Spark UI reference RDDs.

When you develop Spark applications, you typically use DataFrames tutorial and Datasets tutorial.

Write your first Apache Spark application

To write your first Apache Spark application, you add code to the cells of a Databricks notebook. This example uses Python. For more information, you can also reference the Apache Spark Quick Start Guide.

This first command lists the contents of a folder in the Databricks File System:

# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))
Folder contents

The next command uses spark, the SparkSession available in every notebook, to read the README.md text file and create a DataFrame named textFile:

textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")

To count the lines of the text file, apply the count action to the DataFrame:

textFile.count()
Text file line count

One thing you may notice is that the second command, reading the text file, does not generate any output while the third command, performing the count, does. The reason for this is that the first command is a transformation while the second one is an action. Transformations are lazy and run only when an action is run. This allows Spark to optimize for performance (for example, run a filter prior to a join), instead of running commands serially. For a complete list of transformations and actions, refer to the Apache Spark Programming Guide: Transformations and Actions.

Databricks datasets

Databricks includes a variety of datasets within the workspace that you can use to learn Spark or test out algorithms. You’ll see these throughout the getting started guide. The datasets are available in the /databricks-datasets folder.

Notebooks

To access these code examples and more, import the one of the following notebooks.

Apache Spark Quick Start Python notebook

Open notebook in new tab

Apache Spark Quick Start Scala notebook

Open notebook in new tab