Schedule and orchestrate Workflows

Databricks Workflows provides a collection of tools that allow you to schedule and orchestrate data processing tasks on Databricks. You use Databricks Workflows to configure Databricks Jobs.

This article introduces concepts related to managing production workloads using Databricks Jobs.

Note

Delta Live Tables provide a declarative syntax for creating data processing pipelines. See What is Delta Live Tables?.

What are Databricks jobs?

A Databricks job allows you to configure tasks to run in a specified compute environment on a specified schedule. Along with Delta Live Tables pipelines, jobs are the primary tool used on Databricks to deploy data processing and ML logic into production.

Jobs can vary in complexity from a single task running a Databricks notebook to thousands of tasks running with conditional logic and dependencies.

How can I configure and run jobs?

You can create and run a job using the Jobs UI, the Databricks CLI, or by invoking the Jobs API. You can repair and re-run a failed or canceled job using the UI or API. You can monitor job run results using the UI, CLI, API, and notifications (for example, email, webhook destination, or Slack notifications).

To learn about using the Databricks CLI, see What is the Databricks CLI?. To learn about using the Jobs API, see the Jobs API.

What is the minimum configuration needed for a job?

All jobs on Databricks require the following:

  • Source code that contains logic to be run.

  • A compute resource to run the logic. The compute resource can be serverless compute, classic jobs compute, or all-purpose compute. See Use Databricks compute with your jobs.

  • A specified schedule for when the job should be run or a manual trigger.

  • A unique name.

Note

If you develop your code in Databricks notebooks, you can use the Schedule button to configure that notebook as a job. See Create and manage scheduled notebook jobs.

What is a task?

A task represents a unit of logic in a job. Tasks can range in complexity and include the following:

  • A notebook

  • A JAR

  • A SQL query

  • A DLT pipeline

  • Another job

  • Control flow tasks

You can control the execution order of tasks by specifying dependencies between them. You can configure tasks to run in sequence or parallel.

Jobs interact with state information and metadata of tasks, but task scope is isolated. You can use task values to share context between scheduled tasks. See Share information between tasks in a Databricks job.

What control flow options are available for jobs?

When you configure jobs and tasks within jobs, you can customize settings that control how the entire job and individual tasks run.

Trigger types

You must specify a trigger type when you configure a job. You can choose from the following trigger types:

You can also choose to manually trigger your job, but this is mostly reserved for specific use cases such as:

  • You use an external orchestration tool for triggering jobs using REST API calls.

  • You have a job that runs rarely that requires a human-in-the-loop for validation or resolving data quality issues.

  • You are running a workload that only needs to be run once or a few times, such as a migration.

See Trigger jobs when new files arrive.

Retries

Retries specifies how many times a particular job or task should be re-run if the job fails with an error message. Errors are often transient and resolved through restart, and some features on Databricks such as schema evolution with Structured Streaming assume that you run jobs with retries in order to reset the environment and allow a workflow to proceed.

An option for configuring retries appears in the UI for supported contexts. These include the following:

  • You can specify retries for an entire job, meaning the whole job restarts if any task fails.

  • You can specify retries for a task, in which case the task restarts up to the specified number of times if it encounters an error.

When running in continuous trigger mode, Databricks automatically retries with exponential backoff. See How are failures handled for continuous jobs?.

Run if conditional tasks

You can use the Run if task type to specify conditionals for later tasks based on the outcome of other tasks. You add tasks to your job and specify upstream-dependent tasks. Based on the status of those tasks, you can configure one or more downstream tasks to run. Jobs support the following dependencies:

  • All succeeded

  • At least one succeeded

  • None failed

  • All done

  • At least one failed

  • All failed

See Run tasks conditionally in a Databricks job

If/else conditional tasks

You can use the If/else task type to specify conditionals based on some value. See Add branching logic to your job with the If/else condition task

Jobs support taskValues that you define within your logic and allow you to return the results of some computation or state from a task to the jobs environment. You can define If/else conditions against taskValues, job parameters, or dynamic values.

Databricks supports the following operands for conditionals:

  • ==

  • !=

  • >

  • >=

  • <

  • <=

See also:

For each tasks

Use the For each task to run another task in a loop, passing a different set of parameters to each iteration of the task.

Adding the For each task to a job requires defining two tasks: The For each task and a nested task. The nested task is the task to run for each iteration of the For each task and is one of the standard Databricks Jobs task types. Multiple methods are supported for passing parameters to the nested task.

See Run a parameterized Databricks job task in a loop.

Duration threshold

You can specify a duration threshold to either send a warning or stop a task or job if a specified duration is exceeded. Examples of when you might want to configure this setting include the following:

  • You have tasks that are prone to getting stuck in a hung state.

  • You need to warn an engineer if an SLA for a workflow is exceeded.

  • You want to fail a job configured with a large cluster to avoid unexpected costs.

Concurrency

Most jobs are configured with the default concurrency of 1 concurrent job. This means that if a previous job run has not completed by the time a new job should be triggered, the next job run is skipped.

There are some use cases for increased concurrency, but most workloads do not require altering this setting.

How can I monitor jobs?

You can receive notifications when a job or task starts, completes, or fails. You can send notifications to one or more email addresses or system destinations. See Add email and system notifications for job events.

System tables include a lakeflow schema where you can view records related to job activity in your account. See Jobs system table reference.

You can also join the jobs system tables with billing tables to monitor the cost of jobs across your account. See Monitor job costs with system tables.

Limitations

The following limitations exist:

  • A workspace is limited to 1000 concurrent task runs. A 429 Too Many Requests response is returned when you request a run that cannot start immediately.

  • The number of jobs a workspace can create in an hour is limited to 10000 (includes “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.

  • A workspace can contain up to 12000 saved jobs.

  • A job can contain up to 100 tasks.

Can I manage workflows programmatically?

Databricks provides tools and APIs that allow you to schedule and orchestrate your workflows programmatically, including the following:

For more information about developer tools, see Developer tools and guidance.

Workflow orchestration with Apache AirFlow

You can use Apache Airflow to manage and schedule your data workflows. With Airflow, you define your workflow in a Python file, and Airflow manages scheduling and running the workflow. See Orchestrate Databricks jobs with Apache Airflow.