Introduction to data preparation in Databricks

This article describes how Databricks can help you with data preparation for analytics and machine learning. Data preparation is typically the most time-consuming component of an analytics and machine learning project, and good data is important to ensure accurate and useful results.

Data preparation tasks

Data preparation includes the following tasks:

  • Cleaning and formatting data. This includes tasks such as handling missing values or outliers, ensuring data is in the correct format, and removing unneeded columns.

  • Preprocessing data. This includes tasks like numerical transformations, aggregating data, encoding text or image data, and creating new features.

  • Combining data. This includes tasks like joining tables or merging datasets.

Data preparation resources and information

The Databricks platform provides a unified platform for data ingestion, preparation, analytics and machine learning, and monitoring.

  • The medallion lakehouse architecture guides you in data preparation by specifying a set of data layers of increasing quality. The architecture maintains ACID guarantees as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics.

  • Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.

  • Databricks Partner Connect lets you connect your Databricks workspace directly to third-party data preparation and transformation partners. Partner Connect provisions the required Databricks resources on your behalf, then passes resource details to the partner.

  • Databricks Runtime and Databricks Runtime ML provide pre-built environments that come with many of the most widely used data preparation libraries already installed. A list of all built-in libraries is available in the release notes.

  • Feature engineering for machine learning is the process of converting raw data into features that can be used to develop machine learning models. For ML applications, Databricks Feature Store helps your team discover and re-use features, track feature lineage, and publish features to online stores for realtime serving and automatic lookup.