Best practices for reliability

This article covers best practices for reliability organized by architectural principles listed in the following sections.

1. Design for failure

Use a data format that supports ACID transactions

ACID transactions are a critical feature for maintaining data integrity and consistency. Choosing a data format that supports ACID transactions helps build pipelines that are simpler and much more reliable.

Delta Lake is an open source storage framework that provides ACID transactions as well as schema enforcement, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake is fully compatible with the Apache Spark APIs and is designed for tight integration with structured streaming, allowing you to easily use a single copy of data for both batch and streaming operations and provide incremental processing at scale.

Use a resilient distributed data engine for all workloads

Apache Spark, as the compute engine of the Databricks lakehouse, is based on resilient distributed data processing. If an internal Spark task does not return a result as expected, Apache Spark automatically reschedules the missing tasks and continues to execute the entire job. This is helpful for out-of-code failures, such as a brief network problem or a revoked Spot VM. Working with both the SQL API and the Spark DataFrame API, this resiliency is built into the engine.

In the Databricks lakehouse, Photon, a native vectorized engine entirely written in C++, is a high performance compute compatible with Apache Spark APIs.

Automatically rescue invalid or nonconforming data

Invalid or nonconforming data can cause workloads that rely on an established data format to crash. To increase the end-to-end resiliency of the entire process, it is best practice to filter out invalid and nonconforming data at ingest. Support for rescued data ensures that you never lose or miss data during ingest or ETL. The rescued data column contains any data that was not parsed, either because it was missing from the given schema, because there was a type mismatch, or because the column body in the record or file did not match that in the schema.

  • Databricks Auto Loader: Auto Loader is the ideal tool for streaming the file ingestion. It supports rescued data for JSON and CSV. For example, for JSON, the rescued data column contains any data that was not parsed, possibly because it was missing from the given schema, because there was a type mismatch, or because the casing of the column did not match. The rescued data column is part of the schema returned by Auto Loader as _rescued_data by default when the schema is being inferred.

  • Delta Live Tables: Another option to build workflows for resilience is to use Delta Live Tables with quality constraints. See Manage data quality with Delta Live Tables. Out of the box, Delta Live Tables supports three modes: Retain, drop, and fail on invalid records. To quarantine identified invalid records, expectation rules can be defined in a specific way so that invalid records are stored (“quarantined”) in another table. See Quarantine invalid data.

Configure jobs for automatic retries and termination

Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.

  • Databricks jobs support an automatic retry policy that determines when and how many times failed runs are retried.

  • You can configure optional duration thresholds for a task, including an expected completion time for the task and a maximum completion time for the task.

  • Delta Live Tables also automates failure recovery by using escalating retries to balance speed and reliability. See Development and production modes.

On the other hand, a hanging task can prevent the entire job from completing, resulting in high costs. Databricks jobs support timeout configuration to kill jobs that take longer than expected.

Use scalable and production-grade model serving infrastructure

For batch and streaming inference, use Databricks jobs and MLflow to deploy models as Apache Spark UDFs to leverage job scheduling, retries, autoscaling, and so on. See Use MLflow for model inference.

Leverage managed services such as serverless SQL warehouses where possible. These services are operated by Databricks in a reliable and scalable manner at no additional cost to the customer, making workloads more reliable.

2. Manage data quality

Use a layered storage architecture

Curate data by creating a layered architecture and ensuring that data quality increases as data moves through the layers. A common layering approach is:

  • Raw layer (bronze): Source data gets ingested into the lakehouse into the first layer and should be persisted there. When all downstream data is created from the raw layer, it is possible to rebuild subsequent layers from this layer as needed.

  • Curated layer (silver): The purpose of the second layer is to hold cleansed, refined, filtered and aggregated data. The goal of this layer is to provide a solid, reliable foundation for analysis and reporting across all roles and functions.

  • Final layer (gold): The third layer is built around business or project needs. It provides a different view as data products to other business units or projects, preparing data around security needs (such as anonymized data) or optimizing for performance (such as with pre-aggregated views). The data products in this layer are considered as the truth for the business.

The final layer should contain only high quality data and be fully trusted from a business perspective.

Improve data integrity by reducing data redundancy

Copying or duplicating data creates data redundancy and leads to loss of integrity, loss of data lineage, and often different access permissions. This reduces the quality of the data in the lakehouse.

A temporary or disposable copy of data is not harmful in itself - it is sometimes necessary to increase agility, experimentation, and innovation. However, when these copies become operational and are regularly used to make business decisions, they become data silos. When these data silos become out of sync, it has a significant negative impact on data integrity and quality, raising questions such as “Which data set is the master?” or “Is the data set current?

Actively manage schemas

Uncontrolled schema changes can lead to invalid data and failing jobs that use these data sets. Databricks has several methods to validate and enforce the schema:

  • Delta Lake supports schema validation and schema enforcement by automatically handling schema variations to prevent the insertion of bad records during ingestion. See Schema enforcement.

  • Auto Loader detects the addition of new columns as it processes your data. By default, the addition of a new column causes your streams to stop with an UnknownFieldException. Auto Loader supports several modes for schema evolution.

Use constraints and data expectations

Delta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table is automatically checked. When a constraint is violated, Delta Lake throws an InvariantViolationException error to signal that the new data can’t be added. See Constraints on Databricks.

To further improve this handling, Delta Live Tables supports expectations: Expectations define data quality constraints on the contents of a data set. An expectation consists of a description, an invariant, and an action to be taken if a record violates the invariant. Expectations on queries use Python decorators or SQL constraint clauses. See Manage data quality with Delta Live Tables.

Take a data-centric approach to machine learning

A guiding principle that remains at the core of the AI vision for the Databricks Data Intelligence Platform is a data-centric approach to machine learning. As generative AI becomes more prevalent, this perspective remains just as important.

The core components of any ML project can simply be thought of as data pipelines: Feature engineering, training, model deployment, inference, and monitoring pipelines are all data pipelines. As such, operationalizing an ML solution requires merging data from prediction, monitoring, and feature tables with other relevant data. Fundamentally, the easiest way to achieve this is to develop AI-powered solutions on the same platform used to manage production data. See Data-centric MLOps and LLMOps

3. Design for autoscaling

Enable autoscaling for ETL workloads

Autoscaling allows clusters to automatically resize based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective. The documentation provides considerations for determining whether to use autoscaling and how to get the most benefit.

For streaming workloads, Databricks recommends using Delta Live Tables with autoscaling. Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.

Enable autoscaling for SQL warehouse

The scaling parameter of a SQL warehouse sets the minimum and the maximum number of clusters over which queries sent to the warehouse are distributed. The default is one one cluster without autoscaling.

To handle more concurrent users for a given warehouse, increase the number of clusters. To learn how Databricks adds clusters to and removes clusters from a warehouse, see SQL warehouse sizing, scaling, and queuing behavior.

4. Test recovery procedures

Recover from Structured Streaming query failures

Structured Streaming provides fault tolerance and data consistency for streaming queries; using Databricks workflows, you can easily configure your Structured Streaming queries to automatically restart on failure. By enabling checkpointing for a streaming query, you can restart the query after a failure. The restarted query will continue from where the failed query left off.

Recover ETL jobs using data time travel capabilities

Despite thorough testing, a job may fail in production or produce unexpected, even invalid, data. Sometimes this can be fixed with an additional job after understanding the source of the problem and fixing the pipeline that caused the problem in the first place. However, often this is not straightforward and the job in question should be rolled back. Using Delta Time travel, users can easily roll back changes to an older version or timestamp, repair the pipeline, and restart the fixed pipeline.

A convenient way to do so is the `RESTORE` command.

Leverage a job automation framework with built-in recovery

Databricks Workflows are designed for recovery. When a task in a multi-task job fails (and, as such, all dependent tasks), Databricks Workflows provide a matrix view of the runs that allows you to investigate the problem that caused the failure, see View runs for a job. Whether it was a short network issue or a real issue in the data, you can fix it and start a repair run in Databricks Workflows. It will run only the failed and dependent tasks and keep the successful results from the earlier run, saving time and money, see Troubleshoot and repair job failures

Configure a disaster recovery pattern

For a cloud-native data analytics platform like Databricks, a clear disaster recovery pattern is critical. It’s critical that your data teams can use the Databricks platform even in the rare event of a regional, service-wide outage of a cloud service provider, whether caused by a regional disaster such as a hurricane, earthquake, or some other source.

Databricks is often a core part of an overall data ecosystem that includes many services, including upstream data ingestion services (batch/streaming), cloud-native storage such as Google Cloud Storage, downstream tools and services such as business intelligence apps, and orchestration tooling. Some of your use cases may be particularly sensitive to a regional service-wide outage.

Disaster recovery involves a set of policies, tools, and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. A large cloud service such as Google Cloud serves many customers and has built-in protections against a single failure. For example, a region is a group of buildings connected to different power sources to ensure that a single power outage will not bring down a region. However, cloud region failures can occur, and the severity of the failure and its impact on your business can vary.

5. Automate deployments and workloads

See Operational Excellence - Automate deployments and workloads.