Tutorial: Query data with notebooks

This tutorial walks you through using the Databricks Data Science & Engineering workspace to create a cluster and a notebook, create a table from a dataset, query the table, and display the query results.

Tip

As a supplement to this article, try the Quickstart Tutorial, available on your Databricks Data Science & Engineering landing page. It is a 5-minute hands-on introduction to Databricks. When you log in to Databricks, look for Guide: Quickstart tutorial on the home page and click Start Tutorial.

Quickstart icon and link

If you don’t see the tutorial, select Data Science & Engineering from the persona switcher in the sidebar.

You can also use the Databricks Terraform provider to create this article’s resources. See Create clusters, notebooks, and jobs with Terraform.

Requirements

You are logged into Databricks, and you’re in the Data Science & Engineering workspace.

Data Science & Engineering UI

Landing page

From the left sidebar and the Common Tasks list on the landing page, you access fundamental Databricks Data Science & Engineering entities: the Workspace, clusters, tables, notebooks, jobs, and libraries. The workspace is the special root folder that stores your Databricks assets, such as notebooks and libraries, and the data that you import.

Use the sidebar

You can access all of your Databricks assets using the sidebar. The sidebar’s contents depend on the selected persona: Data Science & Engineering, Machine Learning, or SQL.

  • By default, the sidebar appears in a collapsed state and only the icons are visible. Move your cursor over the sidebar to expand to the full view.

  • To change the persona, click the icon below the Databricks logo Databricks logo, and select a persona.

    change persona
  • To pin a persona so that it appears the next time you log in, click pin persona next to the persona. Click it again to remove the pin.

  • Use Menu options at the bottom of the sidebar to set the sidebar mode to Auto (default behavior), Expand, or Collapse.

  • When you open a machine learning-related page, the persona automatically switches to Machine Learning.

Get help

To get help, click Help icon Help in the lower left corner.

Help menu

Step 1: Create a cluster

A cluster is a collection of Databricks computation resources. To create a cluster:

  1. In the sidebar, click compute icon Compute.

  2. On the Compute page, click Create Compute.

  3. On the New Compute page, select 11.3 LTS ML (Scala 2.12, Spark 3.3.0) from the Databricks Runtime version dropdown.

  4. Click Create Cluster.

Step 2: Create a notebook

A notebook is a collection of cells that run computations on an Apache Spark cluster. To create a notebook in the workspace:

  1. In the sidebar, click Workspace Icon Workspace.

  2. In the Workspace folder, select Down Caret Create > Notebook.

    Create notebook
  3. On the Create Notebook dialog, enter a name and select SQL in the Language drop-down. This selection determines the default language of the notebook.

  4. Click Create. The notebook opens with an empty cell at the top.

  5. Attach the notebook to the cluster you created. Click the cluster selector in the notebook toolbar and select your cluster from the dropdown menu. If you don’t see your cluster, click More… and select the cluster from the dropdown menu in the dialog.

Step 3: Create a table

Create a table using data from a sample CSV data file available in Sample datasets, a collection of datasets mounted to What is the Databricks File System (DBFS)?, a distributed file system installed on Databricks clusters. You have two options for creating the table.

Option 1: Create a Spark table from the CSV data

Use this option if you want to get going quickly, and you only need standard levels of performance. Copy and paste this code snippet into a notebook cell:

DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

Option 2: Write the CSV data to Delta Lake format and create a Delta table

Delta Lake offers a powerful transactional storage layer that enables fast reads and other benefits. Delta Lake format consists of Parquet files plus a transaction log. Use this option to get the best performance on future operations on the table.

  1. Read the CSV data into a DataFrame and write out in Delta Lake format. This command uses a Python language magic command, which allows you to interleave commands in languages other than the notebook default language (SQL). Copy and paste this code snippet into a notebook cell:

    %python
    
    diamonds = (spark.read
      .format("csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
    )
    
    diamonds.write.format("delta").save("/mnt/delta/diamonds")
    
  2. Create a Delta table at the stored location. Copy and paste this code snippet into a notebook cell:

    DROP TABLE IF EXISTS diamonds;
    
    CREATE TABLE diamonds USING DELTA LOCATION '/mnt/delta/diamonds/'
    

Run cells by pressing SHIFT + ENTER. The notebook automatically attaches to the cluster you created in Step 2 and runs the command in the cell.

Step 4: Query the table

Run a SQL statement to query the table for the average diamond price by color.

  1. To add a cell to the notebook, mouse over the cell bottom and click the Add Cell icon.

    Add cell
  2. Copy this snippet and paste it in the cell.

    SELECT color, avg(price) AS price FROM diamonds GROUP BY color ORDER BY COLOR
    
  3. Press SHIFT + ENTER. The notebook displays a table of diamond color and average price.

    Run command

Step 5: Display the data

Display a chart of the average diamond price by color.

  1. Click the Bar chart icon Chart Button.

  2. Click Plot Options.

    • Drag color into the Keys box.

    • Drag price into the Values box.

    • In the Aggregation drop-down, select AVG.

      Select aggregation
  3. Click Apply to display the bar chart.

    Apply chart type