How Databricks AutoML works

This article details how Databricks AutoML works and its implementation of concepts like missing value imputation and large data sampling.

Databricks AutoML performs the following:

  1. Prepares the dataset for model training. For example, AutoML carries out imbalanced data detection for classification problems prior to model training.

  2. Iterates to train and tune multiple models, where each model is constructed from open source components and can easily be edited and integrated into your machine learning pipelines.

    • AutoML automatically distributes hyperparameter tuning trials across the worker nodes of a cluster.

    • With Databricks Runtime 9.1 LTS ML or above, AutoML automatically samples your dataset if it is too large to fit into the memory of a single worker node. See Sampling large datasets.

  3. Evaluates models based on algorithms from the scikit-learn, xgboost, LightGBM, Prophet, and ARIMA packages.

  4. Displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.

AutoML algorithms

Databricks AutoML trains and evaluates models based on the algorithms in the following table.

Note

For classification and regression models, the decision tree, random forests, logistic regression and linear regression with stochastic gradient descent algorithms are based on scikit-learn.

Classification models

Regression models

Forecasting models

Decision trees

Decision trees

Prophet

Random forests

Random forests

Auto-ARIMA (Available in Databricks Runtime 10.3 ML and above.)

Logistic regression

Linear regression with stochastic gradient descent

XGBoost

XGBoost

LightGBM

LightGBM

Supported data feature types

Feature types not listed below are not supported. For example, images are not supported.

The following feature types are supported:

  • Numeric (ByteType, ShortType, IntegerType, LongType, FloatType, and DoubleType)

  • Boolean

  • String (categorical or English text)

  • Timestamps (TimestampType, DateType)

  • ArrayType[Numeric] (Databricks Runtime 10.4 LTS ML and above)

  • DecimalType (Databricks Runtime 11.3 LTS ML and above)

Split data into train/validation/test sets

With Databricks Runtime 10.1 ML and above, you can specify a time column to use for the training/validation/testing data split for classification and regression problems. If you specify this column, the dataset is split into training, validation, and test sets by time. The earliest points are used for training, the next earliest for validation, and the latest points are used as a test set.

In Databricks Runtime 10.1 ML, the time column must be a timestamp or integer column. In Databricks Runtime 10.2 ML and above, you can also select a string column.

Sampling large datasets

Note

Sampling is not applied to forecasting problems.

Although AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node.

AutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary.

In Databricks Runtime 9.1 LTS ML through Databricks Runtime 10.5 ML, the sampling fraction does not depend on the cluster’s node type or the amount of memory on each node.

In Databricks Runtime 11.0 ML and above:

  • The sampling fraction increases for worker nodes that have more memory per core. You can increase the sample size by choosing a memory optimized instance type.

  • You can further increase the sample size by choosing a larger value for spark.task.cpus in the Spark configuration for the cluster. The default setting is 1; the maximum value is the number of CPUs on the worker node. When you increase this value, the sample size is larger, but fewer trials run in parallel. For example, in a machine with 4 cores and 64GB total RAM, the default spark.task.cpus=1 runs 4 trials per worker with each trial limited to 16GB RAM. If you set spark.task.cpus=4, each worker runs only one trial but that trial can use 64GB RAM.

  • If AutoML sampled the dataset, the sampling fraction is shown in the Overview tab in the UI.

For classification problems, AutoML uses the PySpark sampleBy method for stratified sampling to preserve the target label distribution.

For regression problems, AutoML uses the PySpark sample method.

Imbalanced dataset support for classification problems

In Databricks Runtime 11.2 ML and above, if AutoML detects that a dataset is imbalanced, it tries to balance the training set by downsampling the major class(es) and adding class weights.

Semantic type detection

Note

  • Semantic type detection is not applied to forecasting problems.

  • AutoML does not perform semantic type detection for columns that have custom imputation methods specified.

With Databricks Runtime 9.1 LTS ML and above, AutoML tries to detect whether columns have a semantic type that is different from the Spark or pandas data type in the table schema. AutoML treats these columns as the detected semantic type. These detections are best effort and might miss the existence of semantic types in some cases. You can also manually set the semantic type of a column or tell AutoML not to apply semantic type detection to a column using annotations.

Specifically, AutoML makes these adjustments:

  • String and integer columns that represent date or timestamp data are treated as a timestamp type.

  • String columns that represent numeric data are treated as a numeric type.

With Databricks Runtime 10.1 ML and above, AutoML also makes these adjustments:

  • Numeric columns that contain categorical IDs are treated as a categorical feature.

  • String columns that contain English text are treated as a text feature.

Semantic type annotations

With Databricks Runtime 10.1 ML and above, you can manually control the assigned semantic type by placing a semantic type annotation on a column. To manually annotate the semantic type of column <column_name> as <semantic_type>, use the following syntax:

metadata_dict = df.schema["<column_name>"].metadata
metadata_dict["spark.contentAnnotation.semanticType"] = "<semantic_type>"
df = df.withMetadata("<column_name>", metadata_dict)

<semantic_type> can be one of the following:

  • categorical: The column contains categorical values (for example, numerical values that should be treated as IDs).

  • numeric: The column contains numeric values (for example, string values that can be parsed into numbers).

  • datetime: The column contains timestamp values (string, numerical, or date values that can be converted into timestamps).

  • text: The string column contains English text.

To disable semantic type detection on a column, use the special keyword annotation native.

Shapley values (SHAP) for model explainability

Note

For MLR 11.1 and below, SHAP plots are not generated, if the dataset contains a dataset column.

The notebooks produced by AutoML regression and classification runs include code to calculate Shapley values. Shapley values are based in game theory and estimate the importance of each feature to a model’s predictions.

AutoML notebooks use the SHAP package to calculate Shapley values. Because these calculations are very memory-intensive, the calculations are not performed by default.

To calculate and display Shapley values:

  1. Go to the Feature importance section in an AutoML generated trial notebook.

  2. Set shap_enabled = True.

  3. Re-run the notebook.

Time series aggregation

For forecasting problems, when there are multiple values for a timestamp in a time series, AutoML uses the average of the values.

To use the sum instead, edit the source code notebook. In the Aggregate data by … cell, change .agg(y=(target_col, "avg")) to .agg(y=(target_col, "sum")), as shown:

group_cols = [time_col] + id_cols
df_aggregation = df_loaded \
  .groupby(group_cols) \
  .agg(y=(target_col, "sum")) \
  .reset_index() \
  .rename(columns={ time_col : "ds" })

Feature Store integration

In Databricks Runtime 11.3 ML and above, you can use existing feature tables in Feature Store to augment the original input dataset for your classification and regression problems. To create a feature table, see Databricks Feature Store.

To use existing feature tables, you can select feature tables with the AutoML UI or set the feature_store_lookups parameter in your AutoML run specification.

feature_store_lookups = [
  {
     "table_name": "example.trip_pickup_features",
     "lookup_key": ["pickup_zip", "rounded_pickup_datetime"],
  },
  {
      "table_name": "example.trip_dropoff_features",
     "lookup_key": ["dropoff_zip", "rounded_dropoff_datetime"],
  }
]

AutoML experiment with Feature Store example notebook

Open notebook in new tab