Create your first workflow with a Databricks job

This article demonstrates a Databricks job that orchestrates tasks to read and process a sample dataset. In this quickstart, you:

  1. Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.

  2. Save the sample dataset to Unity Catalog.

  3. Create a new notebook and add code to read the dataset from Unity Catalog, filter it by year, and display the results.

  4. Create a new job and configure two tasks using the notebooks.

  5. Run the job and view the results.

Requirements

You must have a volume in Unity Catalog. This article uses a volume named my-volume in a schema named default within a catalog named main. Also, you must have the following permissions in Unity Catalog:

  • READ VOLUME and WRITE VOLUME, or ALL PRIVILEGES, for the my-volume volume.

  • USE SCHEMA or ALL PRIVILEGES for the default schema.

  • USE CATALOG or ALL PRIVILEGES for the main catalog.

To set these permissions, see your Databricks administrator or Unity Catalog privileges and securable objects.

Create the notebooks

Retrieve and save data

To create a notebook to retrieve the sample dataset and save it to Unity Catalog:

  1. Go to your Databricks landing page and click New Icon New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.

  2. If necessary, change the default language to Python.

  3. Copy the following Python code and paste it into the first cell of the notebook.

    import requests
    
    response = requests.get('https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv')
    csvfile = response.content.decode('utf-8')
    dbutils.fs.put("/Volumes/main/default/my-volume/babynames.csv", csvfile, True)
    

Read and display filtered data

To create a notebook to read and present the data for filtering:

  1. Go to your Databricks landing page and click New Icon New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.

  2. If necessary, change the default language to Python.

  3. Copy the following Python code and paste it into the first cell of the notebook.

    babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my-volume/babynames.csv")
    babynames.createOrReplaceTempView("babynames_table")
    years = spark.sql("select distinct(Year) from babynames_table").toPandas()['Year'].tolist()
    years.sort()
    dbutils.widgets.dropdown("year", "2014", [str(x) for x in years])
    display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))
    

Create a job

  1. Click Workflows Icon Workflows in the sidebar.

  2. Click Create Job Button.

    The Tasks tab displays with the create task dialog.

    Create first task dialog
  3. Replace Add a name for your job… with your job name.

  4. In the Task name field, enter a name for the task; for example, retrieve-baby-names.

  5. In the Type drop-down menu, select Notebook.

  6. Use the file browser to find the first notebook you created, click the notebook name, and click Confirm.

  7. Click Create task.

  8. Click Add Task Button below the task you just created to add another task.

  9. In the Task name field, enter a name for the task; for example, filter-baby-names.

  10. In the Type drop-down menu, select Notebook.

  11. Use the file browser to find the second notebook you created, click the notebook name, and click Confirm.

  12. Click Add under Parameters. In the Key field, enter year. In the Value field, enter 2014.

  13. Click Create task.

Run the job

To run the job immediately, click Run Now Button in the upper right corner. You can also run the job by clicking the Runs tab and clicking Run Now in the Active Runs table.

View run details

  1. Click the Runs tab and click the link for the run in the Active Runs table or in the Completed Runs (past 60 days) table.

  2. Click either task to see the output and details. For example, click the filter-baby-names task to view the output and run details for the filter task:

    View filter names results

Run with different parameters

To re-run the job and filter baby names for a different year:

  1. Click Blue Down Caret next to Run Now and select Run Now with Different Parameters or click Run Now with Different Parameters in the Active Runs table.

  2. In the Value field, enter 2015.

  3. Click Run.