This article describes tools and techniques for exploratory data analysis (EDA) on Databricks.
Exploratory data analysis (EDA) includes methods for exploring data sets to summarize their main characteristics and identify any problems with the data. Using statistical methods and visualizations, you can learn about a data set to determine its readiness for analysis and inform what techniques to apply for data preparation. EDA can also influence which algorithms you choose to apply for training ML models.
Databricks has built-in analysis and visualization tools in both Databricks SQL and in Databricks Runtime. For an illustrated list of the types of visualizations available in Databricks, see Visualization types.
Here are some helpful articles about data visualization and exploration tools in Databricks SQL:
Databricks Runtime provides a pre-built environment that has popular data exploration libraries already installed. You can see the list of the built-in libraries in the release notes.
In addition, the following articles show examples of visualization tools in Databricks Runtime:
In a Databricks Python notebook, you can combine SQL and Python to explore data. When you run code in a SQL language cell in a Python notebook, the table results are automatically made available as a Python DataFrame. For details, see Explore SQL cell results in Python notebooks.