Overview of orchestration on Databricks

Databricks provides a built-in experience for orchestrating data processing workloads, so that you can coordinate and run multiple tasks as part of a larger workflow. You can streamline, optimize, and schedule the execution of frequent, repeatable tasks which makes it easy to manage complex workflows.

This article introduces concepts and choices related to managing production workloads using Databricks jobs.

What are jobs?

In Databricks, a job is used to schedule and orchestrate tasks on Databricks in a workflow. Common data processing workflows include ETL workflows, running notebooks, and machine learning (ML) workflows, as well as integrating with external systems like dbt.

Jobs consist of one or more tasks, and support custom control flow logic like branching (if / else statements) or looping (for each statements) using a visual authoring UI. Tasks can load or transform data in an ETL workflow, or build, train and deploy ML models in a controlled and repeatable way as part of your machine learning pipelines.

Example: Daily data processing and validation job

The example below shows a job in Databricks.

An example showing a job in the Databricks interface with 4 tasks and a trigger to run daily.

This example job has the following characteristics:

  1. The first task ingests revenue data.

  2. The second task is an if / else check for nulls.

  3. If not, then a transformation task is run.

  4. Otherwise, it runs a notebook task with a data quality validation.

  5. It is scheduled to run every day at 11:29 AM.

To get a quick introduction to creating your own job, see Create your first workflow with a Databricks job.

Common use cases

From foundational data engineering principles to advanced machine learning and seamless tool integration, these common use cases showcase the breadth of capabilities that drive modern analytics, workflow automation, and infrastructure scalability.

Data engineering

ETL (Extract, Transform, Load) pipelines: Automate the extraction of data from various sources, transform the data into a suitable format, and load it into a data warehouse or data lake. See Run your first ETL workload on Databricks

Data migration: Move data from one system to another.

Continuous data processing: Use Jobs for continuous data processing tasks, such as streaming data from sources like Kafka and writing it to Delta tables.

Data science and machine learning

Model training: Schedule and run machine learning model training jobs to ensure that models are trained on the latest data.

Batch inference: Automate the process of running batch inference jobs to generate predictions from trained models.

Hyperparameter tuning: Orchestrate hyperparameter tuning jobs to optimize machine learning models.

Analytics and reporting

Scheduled queries: Run SQL queries in a job on a schedule to generate reports or update dashboards.

Data aggregation: Perform regular data aggregation tasks to prepare data for analysis.

Automating tasks

Multi-task workflows: Create complex workflows that involve multiple tasks, such as running a series of notebooks, JAR files, SQL queries, or Delta Live Tables pipelines.

Conditional logic: Use conditional logic to control the flow of tasks based on the success or failure of previous tasks.

Notifications and monitoring

Set up notifications and monitor job run results using the Databricks UI, CLI, or API, or using integrations with tools like Slack and webhooks.

Integration with other tools

Apache Airflow: Trigger Databricks jobs using external orchestration tools like Apache Airflow, allowing for more complex and integrated workflows.

Infrastructure as code (IaC)

Databricks Asset Bundles: Manage jobs and other resources as code to facilitate version control, code review, and CI/CD (Continuous Integration/Continuous Deployment) practices.

Orchestration concepts

There are three main concepts when using orchestration in Databricks: jobs, tasks, and triggers.

Job - A job is the primary resource for coordinating, scheduling, and running your operations. Jobs can vary in complexity from a single task running a Databricks notebook to hundreds of tasks with conditional logic and dependencies. The tasks in a job are visually represented by a Directed Acyclic Graph (DAG). You can specify properties for the job, including:

  • Trigger - this defines when to run the job.

  • Parameters - run-time parameters that are automatically pushed to tasks within the job.

  • Notifications - emails or webhooks to be sent when a job fails or takes too long.

  • Git - source control settings for the job tasks.

Task - A task is a specific unit of work within a job. Each task can perform a variety of operations, including:

  • A notebook task runs a Databricks notebook. You specify the path to the notebook and any parameters that it requires.

  • A pipeline task runs a pipeline. You can specify an existing Delta Live Tables pipeline, such as a materialized view or streaming table.

  • A Python script tasks runs a Python file. You provide the path to the file and any necessary parameters.

There are many types of tasks. For a complete list, see Types of tasks. Tasks can have dependencies on other tasks, and conditionally run other tasks, allowing you to create complex workflows with conditional logic and dependencies.

Trigger - A trigger is a mechanism that initiates running a job based on specific conditions or events. A trigger can be time-based, such as running a job at a scheduled time (for example, ever day at 2 AM), or event-based, such as running a job when new data arrives in cloud storage.

Monitoring and observability

Jobs provide built-in support for monitoring and observability. The following topics give an overview of this support. For more details about monitoring jobs and orchestration, see Monitoring and observability for Databricks Jobs.

Job monitoring and observability in the UI - In the Databricks UI you can view jobs, including details such as the job owner and the result of the last run, and filter by job properties. You can view a history of job runs, and get detailed information about each task in the job.

Job run status and metrics - Databricks reports job run success, and logs and metrics for each task within a job run to diagnose issues and understand performance.

Notifications and alerts - You can set up notifications for job events via email, Slack, custom webhooks and a host of other options.

Custom queries through system tables - Databricks provides system tables that record job runs and tasks across the account. You can use these tables to query and analyze job performance and costs. You can create dashboards to visualize job metrics and trends, to help monitor the health and performance of your workflows.

Limitations

The following limitations exist:

  • A workspace is limited to 2000 concurrent task runs. A 429 Too Many Requests response is returned when you request a run that cannot start immediately.

  • The number of jobs a workspace can create in an hour is limited to 10000 (includes “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.

  • A workspace can contain up to 12000 saved jobs.

  • A job can contain up to 100 tasks.

Can I manage workflows programmatically?

Databricks has tools and APIs that allow you to schedule and orchestrate your workflows programmatically, including the following:

For examples of using tools and APIs to create and manage jobs, see Automate job creation and management. For documentation on all available developer tools, see Local development tools.

External tools use the Databricks tools and APIs to programmatically schedule workflows. For example, you can also schedule your jobs using tools such as Apache Airflow. See Orchestrate Databricks jobs with Apache Airflow.