Exploratory data analysis on Databricks: Tools and techniques

This article describes tools and techniques for exploratory data analysis (EDA) on Databricks.

What is EDA and why is it useful?

Exploratory data analysis (EDA) includes methods for exploring data sets to summarize their main characteristics and identify any problems with the data. Using statistical methods and visualizations, you can learn about a data set to determine its readiness for analysis and inform what techniques to apply for data preparation. EDA can also influence which algorithms you choose to apply for training ML models.

What are the EDA tools in Databricks?

Databricks has built-in analysis and visualization tools in both Databricks SQL and in Databricks Runtime. For an illustrated list of the types of visualizations available in Databricks, see Visualization types.

EDA in Databricks SQL

Here are some helpful articles about data visualization and exploration tools in Databricks SQL:

EDA in Databricks Runtime

Databricks Runtime provides a pre-built environment that has popular data exploration libraries already installed. You can see the list of the built-in libraries in the release notes.

In addition, the following articles show examples of visualization tools in Databricks Runtime:

Create data visualizations in Databricks notebooks

In a Databricks Python notebook, you can combine SQL and Python to explore data. When you run code in a SQL language cell in a Python notebook, the table results are automatically made available as a Python DataFrame. For details, see Explore SQL cell results in Python notebooks.