Databricks AutoML

Databricks AutoML helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.

AutoML automatically distributes hyperparameter tuning trials across the worker nodes of a cluster.

Each model is constructed from open source components and can easily be edited and integrated into your machine learning pipelines. You can use Databricks AutoML for regression, classification, and forecasting problems. It evaluates models based on algorithms from the scikit-learn, xgboost, and LightGBM packages.

You can run AutoML using either the UI or the Python API.

Note

AutoML is not supported on table ACL enabled clusters.

Requirements

  • Databricks Runtime 8.3 ML or above. For the general availability (GA) version, Databricks Runtime 10.4 LTS ML or above.

  • For time series forecasting, Databricks Runtime 10.0 ML or above.

  • No additional libraries other than those provided with the Databricks Runtime ML runtime can be installed on the cluster.

  • On a high concurrency cluster, AutoML is not compatible with table access control or credential passthrough.

  • To use Unity Catalog with AutoML, the cluster security mode must be Single User, and you must be the designated single user of the cluster.

AutoML algorithms

Databricks AutoML creates and evaluates models based on these algorithms:

Sampling large datasets

Note

Sampling is not applied to forecasting problems.

While AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node. With Databricks Runtime 9.1 LTS ML and above, AutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary. The sampling fraction does not depend on the cluster’s node type or the amount of memory on each node. The sampled dataset is used for model training.

For classification problems, AutoML uses the PySpark sampleBy method for stratified sampling to preserve the target label distribution.

For regression problems, AutoML uses the PySpark sample method.

Semantic type detection

Note

  • Semantic type detection is not applied to forecasting problems.

  • AutoML does not perform semantic type detection for columns that have custom imputation methods specified.

With Databricks Runtime 9.1 LTS ML and above, AutoML tries to detect whether columns have a semantic type that is different from the Spark or pandas data type in the table schema. AutoML treats these columns as the detected semantic type. These detections are best effort and might miss the existence of semantic types in some cases. You can also manually set the semantic type of a column or tell AutoML not to apply semantic type detection to a column using annotations.

Specifically, AutoML makes these adjustments:

  • String and integer columns that represent date or timestamp data are treated as a timestamp type.

  • String columns that represent numeric data are treated as a numeric type.

With Databricks Runtime 10.1 ML and above, AutoML also makes these adjustments:

  • Numeric columns that contain categorical IDs are treated as a categorical feature.

  • String columns that contain English text are treated as a text feature.

Semantic type annotations

With Databricks Runtime 10.1 ML and above, you can manually control the assigned semantic type by placing a semantic type annotation on a column. To manually annotate the semantic type of column <column_name> as <semantic_type>, use the following syntax:

metadata_dict = df.schema["<column_name>"].metadata
metadata_dict["spark.contentAnnotation.semanticType"] = "<semantic_type>"
df = df.withMetadata("<column_name>", metadata_dict)

<semantic_type> can be one of the following:

  • categorical: The column contains categorical values (for example, numerical values that should be treated as IDs).

  • numeric: The column contains numeric values (for example, string values that can be parsed into numbers).

  • datetime: The column contains timestamp values (string, numerical, or date values that can be converted into timestamps).

  • text: The string column contains English text.

To disable semantic type detection on a column, use the special keyword annotation native.

Shapley values (SHAP) for model explainability

The notebooks produced by AutoML regression and classification runs include code to calculate Shapley values. Shapley values are based in game theory and estimate the importance of each feature to a model’s predictions.

AutoML notebooks use the SHAP package to calculate Shapley values. Because these calculations are very memory-intensive, the calculations are not performed by default.

To calculate and display Shapley values:

  1. Go to the Feature importance section in an AutoML generated trial notebook.

  2. Set shap_enabled = True.

  3. Re-run the notebook.

Control the train/validation/test split

With Databricks Runtime 10.1 ML and above, you can specify a time column to use for the training/validation/testing split for classification and regression problems. If you specify this column, the dataset is split into training, validation, and test sets by time. The earliest points are used for training, the next earliest for validation, and the latest points are used as a test set.

In Databricks Runtime 10.1 ML, the time column must be a timestamp or integer column. In Databricks Runtime 10.2 ML and above, you can also select a string column.

Time series aggregation

For forecasting problems, when there are multiple values for a timestamp in a time series, AutoML uses the average of the values.

To use the sum instead, edit the source code notebook. In the Aggregate data by … cell, change .agg(y=(target_col, "avg")) to .agg(y=(target_col, "sum")), as shown:

group_cols = [time_col] + id_cols
df_aggregation = df_loaded \
  .groupby(group_cols) \
  .agg(y=(target_col, "sum")) \
  .reset_index() \
  .rename(columns={ time_col : "ds" })

AutoML UI

The AutoML UI steps you through the process of training a model on a dataset. To access the UI:

  1. Select Machine Learning from the persona switcher at the top of the left sidebar.

  2. In the sidebar, click Create > AutoML Experiment.

    You can also create a new AutoML experiment from the Experiments page.

    The Configure AutoML experiment page displays. On this page, you configure the AutoML process, specifying the dataset, problem type, target or label column to predict, metric to use to evaluate and score the experiment runs, and stopping conditions.

Set up classification or regression problems in the UI

  1. In the Compute field, select a cluster running Databricks Runtime 8.3 ML or above.

  2. From the ML problem type drop-down menu, select Regression or Classification. If you are trying to predict a continuous numeric value for each observation, such as annual income, select regression. If you are trying to assign each observation to one of a discrete set of classes, such as good credit risk or bad credit risk, select classification.

  3. Under Dataset, click Browse. A dialog appears listing the available databases and tables. Navigate to the table you want to use and click Select. The table schema appears.

    You can specify which columns to include in training and select custom imputation methods. See Modify the dataset.

  4. Click in the Prediction target field. A drop-down appears listing the columns shown in the schema. Select the column you want the model to predict.

  5. The Experiment name field shows the default name. To change it, type the new name in the field.

You can also specify additional configuration options.

Set up forecasting problems in the UI

  1. In the Compute field, select a cluster running Databricks Runtime 10.0 ML or above.

  2. From the ML problem type drop-down menu, select Forecasting.

  3. Under Dataset, click Browse. A dialog appears listing the available databases and tables. Navigate to the table you want to use and click Select. The table schema appears.

  4. Click in the Prediction target field. A drop-down appears listing the columns shown in the schema. Select the column you want the model to predict.

  5. Click in the Time column field. A drop-down appears showing the dataset columns that are of type timestamp or date. Select the column containing the time periods for the time series.

  6. For multi-series forecasting, select the column(s) that identify the individual time series from the Time series identifiers drop-down. AutoML groups the data by these columns as different time series and trains a model for each series independently. If you leave this field blank, AutoML assumes that the dataset contains a single time series.

  7. In the Forecast horizon and frequency fields, specify the number of time periods into the future for which AutoML should calculate forecasted values. In the left box, enter the integer number of periods to forecast. In the right box, select the units.

  8. In Databricks Runtime 10.5 ML and above, you can save prediction results. To do so, specify a database in the Output Database field. Click Browse and select a database from the dialog. AutoML writes the prediction results to a table in this database.

  9. The Experiment name field shows the default name. To change it, type the new name in the field.

You can also specify additional configuration options.

Advanced configurations

Open the Advanced Configuration (optional) section to access these parameters.

  • The evaluation metric is the primary metric used to score the runs.

  • In Databricks Runtime 10.3 ML and above, you can exclude training frameworks from consideration. By default, AutoML trains models using frameworks listed under AutoML algorithms.

  • You can edit the stopping conditions. Default stopping conditions are:

    • For forecasting experiments, stop after 120 minutes.

    • For classification and regression experiments, stop after 60 minutes or after completing 200 trials, whichever happens sooner.

    • In Databricks Runtime 10.1 ML and above, for classification and regression experiments, AutoML incorporates early stopping; it stops training and tuning models if the validation metric is no longer improving.

  • In Databricks Runtime 10.1 ML and above, you can select a time column to split the data for training, validation, and testing in chronological order (applies only to classification and regression).

  • In the Data directory field, you can enter a DBFS location where the training dataset is saved. If you leave the field blank, the training dataset is saved as an MLflow artifact.

Modify the dataset

After you select a dataset, the table schema appears. For classification and regression problems only, you can specify which columns to include in training and select custom imputation methods.

Column selection

In Databricks Runtime 10.3 ML and above, you can specify which columns AutoML should use for training. To exclude a column, uncheck it in the Include column. Unchecking a column is equivalent to setting the exclude_columns parameter in the AutoML Python API.

You cannot drop the column selected as the prediction target or as the time column to split the data.

By default, all columns are included.

Imputation of missing values

In Databricks Runtime 10.4 LTS ML and above, you can specify how null values are imputed. In the UI, select a method from the drop-down in the Impute with column in the table schema. Or, use the imputers parameter in the AutoML Python API.

By default, AutoML selects an imputation method based on the column type and content.

Note

If you specify a non-default imputation method, AutoML does not perform semantic type detection.

Run the experiment and monitor the results

To start the AutoML experiment, click Start AutoML. The experiment starts to run, and the AutoML training page appears. To refresh the runs table, click Refresh button.

From this page, you can:

  • Stop the experiment at any time.

  • Open the data exploration notebook.

  • Monitor runs.

  • Navigate to the run page for any run.

With Databricks Runtime 10.1 ML and above, AutoML displays warnings for potential issues with the dataset, such as unsupported column types or high cardinality columns.

Note

Databricks does its best to indicate potential errors or issues. However, this may not be comprehensive and may not capture issues or errors for which you may be searching. Please make sure to conduct your own reviews as well.

To see any warnings for the dataset, click the Warnings tab on the training page, or on the experiment page after the experiment has completed.

AutoML warnings

When the experiment completes, you can:

  • Register and deploy one of the models with MLflow.

  • Click View notebook for best model to review and edit the notebook that created the best model.

  • Click View data exploration notebook to open the data exploration notebook.

  • Search, filter, and sort the runs in the runs table.

  • See details for any run:

    • To open the notebook containing source code for a trial run, click in the Source column.

    • To view results of the run, click in the Models column or the Start Time column. The run page appears showing information about the trial run (such as parameters, metrics, and tags) and artifacts created by the run, including the model. This page also includes code snippets that you can use to make predictions with the model.

To return to this AutoML experiment later, find it in the table on the Experiments page. The results of each AutoML experiment, including the data exploration and training notebooks, are stored in a databricks_automl folder in the home folder of the user who ran the experiment.

Register and deploy a model from the AutoML UI

  1. Click the link in the Models column for the model to register. When a run completes, the best model (based on the primary metric) is the top row.

    The artifacts section of the run page for the run that created the model displays.

  2. Click register model button to register the model in Model Registry.

  3. Click Models Icon Models in the sidebar to navigate to the Model Registry.

  4. Click the name of your model in the model table. The registered model page displays. From this page, you can serve the model.

AutoML Python API

  1. Create a notebook and attach it to a cluster running Databricks Runtime 8.3 ML or above.

  2. Load a Spark or pandas DataFrame from an existing data source or upload a data file to DBFS and load the data into the notebook.

    df = spark.read.format("parquet").load("<folder-path>")
    
  3. To start an AutoML run, pass the DataFrame to AutoML. See the API specification for details.

  4. When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the progress of the run. Refresh the MLflow experiment to see the trials as they are completed.

  5. After the AutoML run completes:

    • Use the links in the output summary to navigate to the MLflow experiment or to the notebook that generated the best results.

    • Use the link to the data exploration notebook to get some insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run the notebook to reproduce the results or do additional data analysis.

    • Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. See the API docs for details.

    • Clone any generated notebook from the trials and re-run the notebook by attaching it to the same cluster to reproduce the results. You can also make necessary edits and re-run them to train additional models and log them to the same experiment.

Python API specification

The Python API provides functions to start classification and regression AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.

Classification

Note

The max_trials parameter is deprecated in Databricks Runtime 10.3 ML and will be removed in the next major Databricks Runtime ML release. Use timeout_minutes to control the duration of an AutoML run.

databricks.automl.classify(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  data_dir: Optional[str] = None,
  exclude_columns: Optional[List[str]] = None,                      # <DBR> 10.3 ML and above
  exclude_frameworks: Optional[List[str]] = None,                   # <DBR> 10.3 ML and above
  experiment_dir: Optional[str] = None,                             # <DBR> 10.4 LTS ML and above
  imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
  max_trials: Optional[int] = None,                                 # deprecated in <DBR> 10.3 ML
  primary_metric: str = "f1",
  time_col: Optional[str] = None,
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Regression

Note

The max_trials parameter is deprecated in Databricks Runtime 10.3 ML and will be removed in the next major Databricks Runtime ML release. Use timeout_minutes to control the duration of an AutoML run.

databricks.automl.regress(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  data_dir: Optional[str] = None,
  exclude_columns: Optional[List[str]] = None,                      # <DBR> 10.3 ML and above
  exclude_frameworks: Optional[List[str]] = None,                   # <DBR> 10.3 ML and above
  experiment_dir: Optional[str] = None,                             # <DBR> 10.4 LTS ML and above
  imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
  max_trials: Optional[int] = None,                                 # deprecated in <DBR> 10.3 ML
  primary_metric: str = "r2",
  time_col: Optional[str] = None,
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Forecasting

databricks.automl.forecast(
  dataset: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame, pyspark.pandas.DataFrame],
  *,
  target_col: str,
  time_col: str,
  data_dir: Optional[str] = None,
  exclude_frameworks: Optional[List[str]] = None,
  experiment_dir: Optional[str] = None,
  frequency: str = "D",
  horizon: int = 1,
  identity_col: Optional[Union[str, List[str]]] = None,
  output_database: Optional[str] = None,                            # <DBR> 10.5 ML and above
  primary_metric: str = "smape",
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Parameters

Classification and regression

Field Name

Type

Description

dataset

pyspark.DataFrame pandas.DataFrame

Input DataFrame that contains training features and target.

target_col

str

Column name for the target label.

data_dir

str of format dbfs:/<folder-name>

(Optional) DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. If empty, AutoML saves the training dataset as an MLflow artifact.

exclude_columns

List[str]

(Optional) List of columns to ignore during AutoML calculations.

Default: []

exclude_ frameworks

List[str]

(Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “sklearn”, “lightgbm”, “xgboost”.

Default: [] (all frameworks are considered)

experiment_dir

str

(Optional) Path to the directory in the workspace to save the generated notebooks and experiments.

Default: /Users/<username>/databricks_automl/

imputers

Dict[str, Union[str, Dict[str, Any]]]

(Optional) Dictionary where each key is a column name, and each value is a string or dictionary describing the imputation strategy. If specified as a string, the value must be one of “mean”, “median”, or “most_frequent”. To impute with a known value, specify the value as a dictionary {“strategy”: “constant”, value: <desired value>}. You can also specify string options as dictionaries, for example {“strategy”: “mean”}.

If no imputation strategy is provided for a column, AutoML selects a default strategy.

Default: {}

max_trials

int

(Optional) Maximum number of trials to run.

This parameter is deprecated in Databricks Runtime 10.3 ML and will be removed in the next major Databricks Runtime ML release.

Default: 20

If timeout_minutes=None, AutoML runs the maximum number of trials.

primary_metric

str

Metric used to evaluate and rank model performance.

Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse”

Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc”

time_col

str

Available in Databricks Runtime 10.1 ML and above.

(Optional) Column name for a time column.

If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set.

Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported. If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails.

timeout_minutes

int

(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: None (no time limit)

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

Forecasting

Field Name

Type

Description

dataset

pyspark.DataFrame pandas.DataFrame

Input DataFrame that contains training features and target.

target_col

str

Column name for the target label.

time_col

str

Name of the time column for forecasting.

frequency

str

Frequency of the time series for forecasting. This is the period with which events are expected to occur. The default setting is “D” or daily data. Be sure to change the setting if your data has a different frequency.

Possible values:

“W” (weeks)

“D” / “days” / “day”

“hours” / “hour” / “hr” / “h”

“m” / “minute” / “min” / “minutes” / “T”

“S” / “seconds” / “sec” / “second”

Default: “D”

horizon

int

Number of periods into the future for which forecasts should be returned. The units are the time series frequency. Default: 1

data_dir

str of format dbfs:/<folder-name>

(Optional) DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. If empty, AutoML saves the training dataset as an MLflow artifact.

exclude_ frameworks

List[str]

(Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “prophet”, “arima”. Default: [] (all frameworks are considered)

experiment_dir

str

(Optional) Path to the directory in the workspace to save the generated notebooks and experiments.

Default: /Users/<username>/databricks_automl/

identity_col

Union[str, list]

(Optional) Column(s) that identify the time series for multi-series forecasting. AutoML groups by these column(s) and the time column for forecasting.

output_database

str

(Optional) If provided, AutoML saves predictions of the best model to a new table in the specified database.

Default: Predictions are not saved.

primary_metric

str

Metric used to evaluate and rank model performance. Supported metrics: “smape”(default) “mse”, “rmse”, “mae”, or “mdape”.

timeout_minutes

int

(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: None (no time limit)

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

Returns

AutoMLSummary

Summary object for an AutoML run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.

Property

Type

Description

experiment

mlflow.entities.Experiment

The MLflow experiment used to log the trials.

trials

List[TrialInfo]

A list containing information about all the trials that were run.

best_trial

TrialInfo

Info about the trial that resulted in the best weighted score for the primary metric.

metric_distribution

str

The distribution of weighted scores for the primary metric across all trials.

output_table_name

str

Used with forecasting only and only if output_database is provided. Name of the table in output_database containing the model’s predictions.

TrialInfo

Summary object for each individual trial.

Property

Type

Description

notebook_path

str

The path to the generated notebook for this trial in the workspace.

notebook_url

str

The URL of the generated notebook for this trial.

mlflow_run_id

str

The MLflow run ID associated with this trial run.

metrics

Dict[str, float]

The metrics logged in MLflow for this trial.

params

Dict[str, str]

The parameters logged in MLflow that were used for this trial.

model_path

str

The MLflow artifact URL of the model trained in this trial.

model_description

str

Short description of the model and the hyperparameters used for training this model.

duration

str

Training duration in minutes.

preprocessors

str

Description of the preprocessors run before training the model.

evaluation_metric_score

float

Score of primary metric, evaluated for the validation dataset.

Method

Description

load_model()

Load the model generated in this trial, logged as an MLflow artifact.

API examples

Review these notebooks to get started with AutoML.

AutoML classification example notebook

Open notebook in new tab

AutoML regression example notebook

Open notebook in new tab

AutoML forecasting example notebook

Open notebook in new tab

databricks-automl-runtime package

With Databricks Runtime 9.1 LTS ML and above, AutoML depends on the databricks-automl-runtime package, which contains components that are useful outside of AutoML, and also helps simplify the notebooks generated by AutoML training. databricks-automl-runtime is available on PyPI.

Limitations

  • Only the following feature types are supported:

    • Numeric (ByteType, ShortType, IntegerType, LongType, FloatType, and DoubleType)

    • Boolean

    • String (categorical or English text)

    • Timestamps (TimestampType, DateType)

    • ArrayType[Numeric] (Databricks Runtime 10.4 LTS ML and above)

  • Feature types not listed above are not supported. For example, images are not supported.

  • Datasets that have multiple columns with the same name are not supported.

  • With Databricks Runtime 9.0 ML and below, AutoML training uses the full training dataset on a single node. The training dataset must fit into the memory of a single worker node. If you run into out-of-memory issues, try using a worker node with more memory. See Create a cluster.

    Alternately, if possible, use Databricks Runtime 9.1 LTS ML or above, where AutoML automatically samples your dataset if it is too large to fit into the memory of a single worker node.

  • To use Auto-ARIMA, the time series must have a regular frequency (that is, the interval between any two points must be the same throughout the time series). The frequency must match the frequency unit specified in the API call or in the AutoML UI. AutoML handles missing time steps by filling in those values with the previous value.