Best practices for interoperability and usability

This article covers best practices for interoperability and usability, organized by architectural principles listed in the following sections.

1. Define standards for integration

Use standard and reusable integration patterns for external integration

Integration standards are important because they provide guidelines for how data should be represented, exchanged, and processed across different systems and applications. These standards help ensure that data is compatible, high quality, and interoperable across various sources and destinations.

The Databricks Lakehouse comes with a comprehensive REST API that allows you to programmatically manage nearly all aspects of the platform. The REST API server runs in the control plane and provides a unified endpoint for managing the Databricks platform.

The REST API provides the lowest level of integration that can always be used. However, the preferred way to integrate with Databricks is to use higher level abstractions such as the Databricks SDKs or CLI tools. CLI tools are shell-based and allow easy integration of the Databricks platform into CI/CD and MLOps workflows.

Use optimized connectors to ingest data sources into the lakehouse

Databricks offers a variety of ways to help you ingest data into Delta Lake.

Databricks provides optimized connectors for stream messaging services such as Apache Kafka for near-real time data ingestion of data.
Databricks provides built-in integrations to many cloud-native data systems and extensible JDBC support to connect to other data systems.
One option for integrating data sources without ETL is Lakehouse Federation. Lakehouse Federation is the query federation platform for Databricks. The term query federation describes a collection of features that allow users and systems to run queries against multiple data sources without having to migrate all the data into a unified system. Databricks uses Unity Catalog to manage query federation. Unity Catalog’s data governance and data lineage tools ensure that data access is managed and audited for all federated queries run by users in your Databricks workspaces.

Note

Any query in the Databricks platform that uses a Lakehouse Federation source is sent to that source. Make sure the source system can handle the load. Also, be aware that if the source system is deployed in a different cloud region or cloud, there is an egress cost for each query.

Use certified partner tools

Organizations have different needs, and no single tool can meet them all. Partner Connect allows you to explore and easily integrate with our partners, who cover all aspects of the lakehouse: data ingestion, preparation and transformation, BI and visualization, machine learning, data quality, and more. Partner Connect allows you to create trial accounts with selected Databricks technology partners and connect your Databricks workspace to partner solutions from the Databricks UI. Try partner solutions using your data in the Databricks lakehouse, and then adopt the solutions that best meet your business needs.

Reduce complexity of data engineering pipelines

Investing in reducing the complexity of data engineering pipelines enables scalability, agility and flexibility to be able to expand and innovate faster. Simplified pipelines make it easier to manage and adapt all of the operational needs of a data engineering pipeline: task orchestration, cluster management, monitoring, data quality, and error handling.

Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations you want to perform on your data, and Delta Live Tables handles task orchestration, cluster management, monitoring, data quality, and error handling. See What is Delta Live Tables?.

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It can reliably read data files from cloud storage. An important aspect of both Delta Live Tables and Auto Loader is their declarative nature: Without them, one must build complex pipelines that integrate different cloud services - such as a notification service and a queuing service - to reliably read cloud files based on events and to reliably combine batch and streaming sources.

Auto Loader and Delta Live Tables reduce system dependencies and complexity and greatly improve the interoperability with the cloud storage and between different paradigms such as batch and streaming. As a side effect, the simplicity of the pipelines increases the usability of the platform.

Use infrastructure as code (IaC) for deployments and maintenance

HashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. See Operational Excellence: Use Infrastructure as code for deployments and maintenance

2. Utilize open interfaces and open data formats

Use open data formats

Using an open data format means there are no restrictions on its use. This is important because it removes barriers to accessing and using the data for analysis and driving business insights. Open formats, such as those built on Apache Spark, also add features that boost performance with support for ACID transactions, unified streaming, and batch data processing. Furthermore, open source is community-driven, meaning the community is constantly working on improving existing features and adding new ones, making it easier for users to get the most out of their projects.

The primary data format used in the Data Intelligence Platform is Delta Lake, a fully open data format that offers many benefits, from reliability features to performance enhancements, see Use a data format that supports ACID transactions and Best practices for performance efficiency.

Because of its open nature, Delta Lake comes with a large ecosystem. Dozens of third-party tools and applications support Delta Lake.

To further enhance interoperability, the Delta Universal Format (UniForm) allows you to read Delta tables with Iceberg reader clients. UniForm automatically generates Iceberg metadata asynchronously, without rewriting the data, so that Iceberg clients can read Delta tables as if they were Iceberg tables. A single copy of the data files serves both formats.

Use open standards for your ML lifecycle management

Like using an open source data format, using open standards for your AI workflows has similar benefits in terms of flexibility, agility, cost, and security.

MLflow is an open source platform for managing the ML and AI lifecycle. Databricks offers a fully managed and hosted version of MLflow, integrated with enterprise security features, high availability, and other Databricks workspace features such as experiment and run management and notebook revision tracking.

The primary components are experimentation tracking to automatically log and track ML and deep learning models, models as a standard format for packaging machine learning models, a model registry integrated with Unity Catalog, and the scalable, enterprise-grade model serving.

3. Simplify new use case implementation

Provide a self-service experience across the platform

There are several benefits of a platform where users have autonomy to use the tools and capabilities depending on their needs. Investing in creating a self-service platform makes it easy to scale to serve more users and drives greater efficiency by minimizing the need for human involvement to provision users, resolve issues, and process access requests.

The Databricks Data Intelligence Platform has all the capabilities needed to provide a self-service experience. While there may be a mandatory approval step, the best practice is to fully automate the setup when a business unit requests access to the lakehouse. Automatically provision their new environment, synchronize users and use SSO for authentication, provide access control to shared data and separate object stores for their own data, and so on. Together with a central data catalog of semantically consistent and business-ready data sets, new business units can quickly and securely access lakehouse capabilities and the data they need.

Use serverless compute

For serverless compute on the Databricks platform, the compute layer runs in the customer’s Databricks account. Cloud administrators no longer need to manage complex cloud environments that require adjusting quotas, creating and maintaining network resources, and connecting to billing sources. Users benefit from near-zero cluster startup latency and improved query concurrency.

Use predefined compute templates

Predefined templates help control how compute resources can be used or created by users: Limit user cluster creation to prescribed settings or a certain number, simplify the user interface, or control costs by limiting the maximum cost per cluster.

The Data Intelligence Platform accomplishes this in two ways:

Provide shared clusters as immediate environments for users. On these clusters, use autoscaling down to a very minimal number of nodes to avoid high idle costs.
For a standardized environment, use compute policies to restrict cluster size or features or to define t-shirt-sized clusters (S, M, L).

Use AI capabilities to increase productivity

In addition to increasing productivity, AI tools can also help identify patterns in errors and provide additional insights based on the input. Overall, incorporating these tools into the development process can greatly reduce errors and facilitate decision-making - leading to faster time to release.

Databricks IQ, the AI-powered knowledge engine, is at the heart of the Data Intelligence Platform. It leverages Unity Catalog metadata to understand your tables, columns, descriptions, and popular data assets across your organization to deliver personalized answers. It enables several features that improve productivity when working with the platform, such as:

Databricks Assistant lets you query data through a conversational interface, making you more productive in Databricks. Describe your task in English and let the wizard generate SQL queries, explain complex code, and automatically fix errors.
AI-generated comments for any table or table column managed by Unity Catalog speed up the metadata management process. However, AI models are not always accurate and comments must be reviewed before saving. Databricks strongly recommends human review of AI-generated comments to check for inaccuracies.

4. Ensure data consistency and usability

Offer reusable data-as-products that the business can trust

Organizations seeking to become AI- and data-driven often need to provide their internal teams with high-quality, trustworthy data. One approach to prioritizing quality and usability is to apply product thinking to your published data assets by creating well-defined “data products”. Building such data products ensures that organizations establish standards and a trusted foundation of business truth for their data and AI goals. Data products ultimately deliver value when users and applications have the right data, at the right time, with the right quality, in the right format. While this value has traditionally been realized in the form of more efficient operations through lower costs, faster processes, and reduced risk, modern data products can also pave the way for new value-added offerings and data sharing opportunities within an organization’s industry or partner ecosystem.

See the blog post Building High-Quality and Trusted Data Products with Databricks.

Publish data products semantically consistent across the enterprise

A data lake typically contains data from multiple source systems. These systems may have different names for the same concept (e.g., customer vs. account) or use the same identifier to refer to different concepts. So that business users can easily combine these data sets in a meaningful way, the data must be made homogeneous across all sources to be semantically consistent. In addition, for some data to be valuable for analysis, internal business rules, such as revenue recognition, must be applied correctly. To ensure that all users are using the correctly interpreted data, datasets with these rules must be made available and published to Unity Catalog. Access to the source data must be restricted to teams that understand the correct usage.

Provide a central catalog for discovery and lineage

A central catalog for discovery and lineage helps data consumers access data from multiple sources across the enterprise, thus reducing operational overhead for the central governance team.

In Unity Catalog, administrators and data stewards manage users and their access to data centrally across all workspaces in a Databricks account. Users in different workspaces can share the same data and, depending on the user privileges centrally granted in Unity Catalog, can access data together.

For data discovery, the Unity Catalog supports users with capabilities such as:

Catalog Explorer is the primary user interface for many Unity Catalog features. You can use Catalog Explorer to view schema details, preview sample data, and view table details and properties. Administrators can view and change owners, and administrators and data object owners can grant and revoke permissions. You can also use Databricks Search, which enables users to easily and seamlessly find data assets (such as tables, columns, views, dashboards, models, and so on). Users are shown results that are relevant to their search requests and that they have access to.
Data lineage across all queries run on a Databricks cluster or SQL warehouse. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, jobs, and dashboards related to the query. Lineage can be visualized in Catalog Explorer in near real-time and retrieved with the Databricks REST API.

To allow enterprises to provide their users a holistic view of all data across all data platforms, Unity Catalog provides integration with enterprise data catalogs (sometimes referred to as the “catalog of catalogs”).