Databricks AutoML

Preview

This feature is in Public Preview.

Databricks AutoML helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.

AutoML automatically distributes hyperparameter tuning trials across the worker nodes of a cluster.

Each model is constructed from open source components and can easily be edited and integrated into your machine learning pipelines. You can use Databricks AutoML for regression, classification, and forecasting problems. It evaluates models based on algorithms from the scikit-learn, xgboost, and LightGBM packages.

You can run AutoML using either the UI or the Python API.

Requirements

  • Databricks Runtime 8.3 ML or above.
  • For time series forecasting, Databricks Runtime 10.0 ML or above.
  • No additional libraries other than those provided with the Databricks Runtime ML runtime can be installed on the cluster.

AutoML algorithms

Databricks AutoML creates and evaluates models based on these algorithms:

Sampling large datasets

Note

Sampling is not applied to forecasting problems.

While AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node. With Databricks Runtime 9.1 LTS ML and above, AutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary. The sampling fraction does not depend on the cluster’s node type or the amount of memory on each node. The sampled dataset is used for model training.

For classification problems, AutoML uses the PySpark sampleBy method for stratified sampling to preserve the target label distribution.

For regression problems, AutoML uses the PySpark sample method.

Semantic type detection

Note

Semantic type detection is not applied to forecasting problems.

With Databricks Runtime 9.1 LTS ML and above, AutoML tries to detect whether columns have a semantic type that is different from the Spark or pandas data type in the table schema. AutoML treats these columns as the detected semantic type. These detections are best effort and might miss the existence of semantic types in some cases. You can also manually set the semantic type of a column or tell AutoML not to apply semantic type detection to a column using annotations.

Specifically, AutoML makes these adjustments:

  • String and integer columns that represent date or timestamp data are treated as a timestamp type.
  • String columns that represent numeric data are treated as a numeric type.

With Databricks Runtime 10.1 ML and above, AutoML also makes these adjustments:

  • Numeric columns that contain categorical IDs are treated as a categorical feature.
  • String columns that contain English text are treated as a text feature.

Semantic type annotations

With Databricks Runtime 10.1 ML and above, you can manually control the assigned semantic type by placing a semantic type annotation on a column. To manually annotate the semantic type of column <column_name> as <semantic_type>, use the following syntax:

metadata_dict = df.schema["<column_name>"].metadata
metadata_dict["spark.contentAnnotation.semanticType"] = "<semantic_type>"
df = df.withMetadata("<column_name>", metadata_dict)

<semantic_type> can be one of the following:

  • categorical: The column contains categorical values (for example, numerical values that should be treated as IDs).
  • numeric: The column contains numeric values (for example, string values that can be parsed into numbers).
  • datetime: The column contains timestamp values (string, numerical, or date values that can be converted into timestamps).
  • text: The string column contains English text.

To disable semantic type detection on a column, use the special keyword annotation native.

Control the train/validation/test split

With Databricks Runtime 10.1 ML and above, you can specify a time column to use for the training/validation/testing split for classification and regression problems. If you specify this column, the dataset is split into training, validation, and test sets by time. The earliest points are used for training, the next earliest for validation, and the latest points are used as a test set.

In Databricks Runtime 10.1, the time column must a timestamp or integer column. In Databricks Runtime 10.2 ML, the API also accepts a string column.

AutoML UI

The AutoML UI steps you through the process of training a model on a dataset. To access the UI:

  1. Select Machine Learning from the persona switcher at the top of the left sidebar.

  2. In the sidebar, click Create > AutoML.

    You can also create a new AutoML experiment from the Experiments page.

    The Configure AutoML experiment page displays. On this page, you configure the AutoML process, specifying the dataset, problem type, target or label column to predict, metric to use to evaluate and score the experiment runs, and stopping conditions.

Set up classification or regression problems in the UI

  1. In the Compute field, select a cluster running Databricks Runtime 8.3 ML or above.
  2. From the ML problem type drop-down menu, select Regression or Classification. If you are trying to predict a continuous numeric value for each observation, such as annual income, select regression. If you are trying to assign each observation to one of a discrete set of classes, such as good credit risk or bad credit risk, select classification.
  3. Under Dataset, click Browse. A dialog appears listing the available databases and tables. Navigate to the table you want to use and click Select. The table schema appears.
  4. Click in the Prediction target field. A drop-down appears listing the columns shown in the schema. Select the column you want the model to predict.
  5. The Experiment name field shows the default name. To change it, type the new name in the field.
  6. You can specify additional configuration options under Advanced configuration (optional).
    • The evaluation metric is the primary metric used to score the runs.
    • You can edit the default stopping conditions. By default, the experiment stops after 60 minutes or when it has completed 200 runs, whichever comes first.
    • In Databricks Runtime 10.1 ML and above, you can enter a time column to split the data for training, validation, and testing in chronological order.
    • In the Data directory field, you can enter a DBFS location where notebooks generated during the AutoML process are saved. If you leave the field blank, notebooks are saved as MLflow artifacts.

Set up forecasting problems in the UI

  1. In the Compute field, select a cluster running Databricks Runtime 10.0 ML or above.
  2. From the ML problem type drop-down menu, select Forecasting.
  3. Under Dataset, click Browse. A dialog appears listing the available databases and tables. Navigate to the table you want to use and click Select. The table schema appears.
  4. Click in the Prediction target field. A drop-down appears listing the columns shown in the schema. Select the column you want the model to predict.
  5. Click in the Time column field. A drop-down appears showing the dataset columns that are of type timestamp or date. Select the column containing the time periods for the time series.
  6. For multi-series forecasting, select the column(s) that identify the individual time series from the Time series identifiers drop-down. AutoML groups the data by these columns as different time series and trains a model for each series independently. If you leave this field blank, AutoML assumes that the dataset contains a single time series.
  7. In the Forecast horizon and frequency fields, specify the number of time periods into the future for which AutoML should calculate forecasted values. In the left box, enter the integer number of periods to forecast. In the right box, select the units.
  8. The Experiment name field shows the default name. To change it, type the new name in the field.
  9. You can specify additional configuration options under Advanced configuration (optional).
    • The evaluation metric is the primary metric used to score the runs.
    • You can edit the default stopping condition. By default, the experiment stops after 120 minutes.
    • In the Data directory field, you can enter a DBFS location where notebooks generated during the AutoML process are saved. If you leave the field blank, notebooks are saved as MLflow artifacts.

Run the experiment and monitor the results

To start the AutoML experiment, click Start AutoML. The experiment starts to run, and the AutoML training page appears. To refresh the runs table, click Refresh button.

From this page, you can:

  • Stop the experiment at any time.
  • Open the data exploration notebook.
  • Monitor runs.
  • Navigate to the run page for any run.

With Databricks Runtime 10.1 ML and above, AutoML displays alerts for potential issues with the dataset, such as unsupported column types or high cardinality columns.

Note

Databricks does its best to indicate potential errors or issues. However, this may not be comprehensive and may not capture issues or errors for which you may be searching. Please make sure to conduct your own reviews as well.

To see any alerts for the dataset, click the Alerts tab on the training page, or on the experiment page after the experiment has completed.

AutoML alerts

When the experiment completes, you can:

  • Register and deploy one of the models with MLflow.
  • Click Edit best model to review and edit the notebook that created the best model.
  • Open the data exploration notebook.
  • Search, filter, and sort the runs in the runs table.
  • See details for any run:
    • To open the notebook containing source code for a trial run, click in the Source column.
    • To view the run page with details about a trial run, click in the Start Time column.
    • To see information about the model that was created, including code snippets to make predictions, click in the Models column.

To return to this AutoML experiment later, find it in the table on the Experiments page.

Register and deploy a model from the AutoML UI

  1. Click the link in the Models column for the model to register. When a run completes, the best model (based on the primary metric) is the top row.

    The artifacts section of the run page for the run that created the model displays.

  2. Click register model button to register the model in Model Registry.

  3. Click Models Icon Models in the sidebar to navigate to the Model Registry.

  4. Click the name of your model in the model table. The registered model page displays. From this page, you can serve the model.

AutoML Python API

  1. Create a notebook and attach it to a cluster running Databricks Runtime 8.3 ML or above.

  2. Load a Spark or pandas DataFrame from an existing data source or upload a data file to DBFS and load the data into the notebook.

    df = spark.read.parquet("<folder-path>")
    
  3. To start an AutoML run, pass the DataFrame to AutoML. See the API specification for details.

  4. When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the progress of the run. Refresh the MLflow experiment to see the trials as they are completed.

  5. After the AutoML run completes:

    • Use the links in the output summary to navigate to the MLflow experiment or to the notebook that generated the best results.
    • Use the link to the data exploration notebook to get some insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run the notebook to reproduce the results or do additional data analysis.
    • Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. See the API docs for details.
    • Clone any generated notebook from the trials and re-run the notebook by attaching it to the same cluster to reproduce the results. You can also make necessary edits and re-run them to train additional models and log them to the same experiment.

Python API specification

The Python API provides functions to start classification and regression AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.

Classification

databricks.automl.classify(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  primary_metric: Optional[str],
  data_dir: Optional[str],
  timeout_minutes: Optional[int],
  max_trials: Optional[int],
  time_col: Optional[str] = None)
) -> AutoMLSummary

Regression

databricks.automl.regress(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  primary_metric: Optional[str],
  data_dir: Optional[str],
  timeout_minutes: Optional[int],
  max_trials: Optional[int],
  time_col: Optional[str] = None)
) -> AutoMLSummary

Forecasting

databricks.automl.forecast(
  dataset: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame, pyspark.pandas.DataFrame],
  *,
  target_col: str,
  time_col: str,
  identity_col: Union[str, List[str], NoneType],
  horizon: int,
  frequency: int,
  data_dir: Union[str, NoneType],
  primary_metric: str,
  timeout_minutes: int
) -> AutoMLSummary

Parameters

Classification and regression

Field Name Type Description
dataset pyspark.DataFrame pandas.DataFrame Input DataFrame that contains training features and target.
primary_metric str Metric used to evaluate and rank model performance. Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse” Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc”
target_col str Column name for the target label.
data_dir str of format dbfs:/<folder-name> DBFS path used to store intermediate data. This path is visible to both driver and worker nodes. If empty, AutoML saves intermediate data as MLflow artifacts.
timeout_minutes int

(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: None (no time limit)

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

max_trials int

(Optional) Maximum number of trials to run.

Default: 20

If timeout_minutes=None, AutoML runs the maximum number of trials.

time_col str

Available in Databricks Runtime 10.1 ML and above.

(Optional) Column name for a time column.

If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set.

Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported. If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails.

Forecasting

Field Name Type Description
dataset pyspark.DataFrame pandas.DataFrame Input DataFrame that contains training features and target.
primary_metric str Metric used to evaluate and rank model performance. Supported metrics: “smape”(default) “mse”, “rmse”, “mae”, or “mdape”.
target_col str Column name for the target label.
data_dir str of format dbfs:/<folder-name> DBFS path used to store intermediate data. This path is visible to both driver and worker nodes. If empty, AutoML saves intermediate data as MLflow artifacts.
timeout_minutes int

(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: None (no time limit)

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

max_trials int

(Optional) Maximum number of trials to run.

Default: 20

If timeout_minutes=None, AutoML runs the maximum number of trials.

time_col str Name of the time column for forecasting.
identity_col Union[str, list] (Optional) Column(s) that identify the time series for multi-series forecasting. AutoML groups by these column(s) and the time column for forecasting.
frequency str

Frequency of the time series for forecasting. This is the period with which events are expected to occur. Possible values:

“W” (weeks)

“D” / “days” / “day”

“hours” / “hour” / “hr” / “h”

“m” / “minute” / “min” / “minutes” / “T”

“S” / “seconds” / “sec” / “second”

horizon int Number of periods into the future for which forecasts should be returned. The units are the time series frequency.

Returns

AutoMLSummary

Summary object for an AutoML classification run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.

Property Type Description
experiment mlflow.entities.Experiment The MLflow experiment used to log the trials.
trials List[TrialInfo] A list containing information about all the trials that were run.
best_trial TrialInfo Info about the trial that resulted in the best weighted score for the primary metric.
metric_distribution str The distribution of weighted scores for the primary metric across all trials.

TrialInfo

Summary object for each individual trial.

Property Type Description
notebook_path str The path to the generated notebook for this trial in the workspace.
notebook_url str The URL of the generated notebook for this trial.
mlflow_run_id str The MLflow run ID associated with this trial run.
metrics Dict[str, float] The metrics logged in MLflow for this trial.
params Dict[str, str] The parameters logged in MLflow that were used for this trial.
model_path str The MLflow artifact URL of the model trained in this trial.
model_description str Short description of the model and the hyperparameters used for training this model.
duration str Training duration in minutes.
preprocessors str Description of the preprocessors run before training the model.
evaluation_metric_score float Score of primary metric, evaluated for the validation dataset.
Method Description
load_model() Load the model generated in this trial, logged as an MLflow artifact.

API examples

Review these notebooks to get started with AutoML.

AutoML classification example notebook

Open notebook in new tab

AutoML regression example notebook

Open notebook in new tab

AutoML forecasting example notebook

Open notebook in new tab

databricks-automl-runtime package

With Databricks Runtime 9.1 LTS ML and above, AutoML depends on the databricks-automl-runtime package, which contains components that are useful outside of AutoML, and also helps simplify the notebooks generated by AutoML training. databricks-automl-runtime is available on PyPI.

Limitations

  • Only the following feature types are supported:

    • Numeric (ByteType, ShortType, IntegerType, LongType, FloatType, and DoubleType)
    • Boolean
    • String (categorical or English text)
    • Timestamps (TimestampType, DateType)
  • Feature types not listed above are not supported. For example, images are not supported.

  • With Databricks Runtime 9.0 ML and below, AutoML training uses the full training dataset on a single node. The training dataset must fit into the memory of a single worker node. If you run into out-of-memory issues, try using a worker node with more memory. See Create a cluster.

    Alternately, if possible, use Databricks Runtime 9.1 LTS ML or above, where AutoML automatically samples your dataset if it is too large to fit into the memory of a single worker node.