Databricks AutoML

Preview

This feature is in Public Preview.

Databricks AutoML helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.

Each model is constructed from open source components, such as scikit-learn and XGBoost, and can easily be edited and integrated into your machine learning pipelines.

You can run AutoML using either the UI or the Python API.

Requirements

  • Databricks Runtime 8.3 ML or above.
  • No additional libraries other than those provided with the Databricks Runtime ML runtime can be installed on the cluster.

AutoML UI

The AutoML UI steps you through the process of training a model on a dataset. To access the UI:

  1. Select Machine Learning from the persona switcher at the top of the left sidebar.

  2. In the sidebar, click Create > AutoML.

    You can also create a new AutoML experiment from the Experiments page.

    The Configure AutoML experiment page displays. On this page, you configure the AutoML process, specifying the dataset, problem type, target or label column to predict, metric to use to evaluate and score the experiment runs, and stopping conditions.

  3. In the Cluster field, select a cluster running Databricks Runtime 8.3 ML or above.

  4. From the ML problem type drop-down menu, select Regression or Classification. If you are trying to predict a continuous numeric value for each observation, such as annual income, select regression. If you are trying to assign each observation to one of a discrete set of classes, such as good credit risk or bad credit risk, select classification.

  5. Under Dataset, click Browse Tables. A dialog appears listing the available databases and tables. Navigate to the table you want to use and click Select. The table schema appears.

  6. Click in the Prediction target field. A drop-down appears listing the columns shown in the schema. Select the column you want the model to predict.

  7. The Experiment name field shows the default name. To change it, type the new name in the field.

  8. You can specify additional configuration options under Advanced configuration (optional).

    • The evaluation metric is the primary metric used to score the runs.
    • You can edit the default stopping conditions. By default, the experiment stops after 60 minutes or when it has completed 200 runs, whichever comes first.
    • In the Data directory field, you can enter a DBFS location where notebooks generated during the AutoML process are saved. If you leave the field blank, notebooks are saved as MLflow artifacts.
  9. Click Start AutoML. The experiment starts to run, and the AutoML training page appears. To refresh the runs table, click Refresh button.

    From this page, you can:

    • Stop the experiment at any time
    • Open the data exploration notebook
    • Monitor runs
    • Navigate to the run page for any run

When the experiment completes, you can:

  • Register and deploy one of the models with MLflow.
  • Click Edit best model to review and edit the notebook that created the best model.
  • Open the data exploration notebook.
  • Search, filter, and sort the runs in the runs table.
  • See details for any run:
    • To open the notebook containing source code for a trial run, click in the Source column
    • To view the run page with details about a trial run, click in the Start Time column
    • To see information about the model that was created, including code snippets to make predictions, click in the Models column

To return to this AutoML experiment later, find it in the table on the Experiments page.

Register and deploy a model from the AutoML UI

  1. Click the link in the Models column for the model to register. When a run completes, the best model (based on the primary metric) is the top row.

    The artifacts section of the run page for the run that created the model displays.

  2. Click register model button to register the model in Model Registry.

  3. Click Models Icon Models in the sidebar to navigate to the Model Registry.

  4. Click the name of your model in the model table. The registered model page displays. From this page, you can serve the model.

AutoML Python API

  1. Create a notebook and attach it to a cluster running Databricks Runtime 8.3 ML or above.

  2. Load a Spark or pandas DataFrame from an existing data source or upload a data file to DBFS and load the data into the notebook.

    df = spark.read.parquet("<folder-path>")
    
  3. To start an AutoML run, pass the DataFrame to AutoML. See the API docs for details.

  4. When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the progress of the run. Refresh the MLflow experiment to see the trials as they are completed.

  5. After the AutoML run completes:

    • Use the links in the output summary to navigate to the MLflow experiment or to the notebook that generated the best results.
    • Use the link to the data exploration notebook to get some insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run the notebook to reproduce the results or do additional data analysis.
    • Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. See the API docs for details.
    • Clone any generated notebook from the trials and re-run the notebook by attaching it to the same cluster to reproduce the results. You can also make necessary edits and re-run them to train additional models and log them to the same experiment.

Python API specification

The Python API provides functions to start classification and regression AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.

Classification

databricks.automl.classify(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  primary_metric: Optional[str],
  timeout_minutes: Optional[int],
  max_trials: Optional[int]
) -> AutoMLSummary

Regression

databricks.automl.regress(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  primary_metric: Optional[str],
  timeout_minutes: Optional[int],
  max_trials: Optional[int]
) -> AutoMLSummary

Parameters

Field Name Type Description
dataset pyspark.DataFrame pandas.DataFrame Input DataFrame that contains training features and target.
primary_metric str Metric used to evaluate and rank model performance. Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse” Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc”
target_col str Column name for the target label.
timeout_minutes int Optional parameter for maximum time to wait for AutoML trials to complete. If omitted will run trials without any time restrictions (default). Throws an exception if the passed timeout is less than 5 minutes or if the timeout is not enough to run at least 1 trial. Longer timeouts allow AutoML to run more trials and provide a model with better accuracy.
max_trials int Optional parameter for the maximum number of trials to run. The default value is 20. When timeout=None, maximum number of trials will run to completion.

Returns

AutoMLSummary

Summary object for an AutoML classification run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.

Property Type Description
experiment mlflow.entities.Experiment The MLflow experiment used to log the trials.
trials List[TrialInfo] A list containing information about all the trials that were run.
best_trial TrialInfo Info about the trial that resulted in the best weighted score for the primary metric.
metric_distribution str The distribution of weighted scores for the primary metric across all trials.

TrialInfo

Summary object for each individual trial.

Property Type Description
notebook_path str The path to the generated notebook for this trial in the workspace.
notebook_url str The URL of the generated notebook for this trial.
mlflow_run_id str The MLflow run ID associated with this trial run.
metrics Dict[str, float] The metrics logged in MLflow for this trial.
params Dict[str, str] The parameters logged in MLflow that were used for this trial.
model_path str The MLflow artifact URL of the model trained in this trial.
model_description str Short description of the model and the hyperparameters used for training this model.
duration str Training duration in minutes.
preprocessors str Description of the preprocessors run before training the model.
evaluation_metric_score float Score of primary metric, evaluated for the validation dataset.
Method Description
load_model() Load the model generated in this trial, logged as an MLflow artifact.

API examples

Review these notebooks to get started with AutoML.

AutoML classification example notebook

Open notebook in new tab

AutoML regression example notebook

Open notebook in new tab

Known limitations

  • Only classification and regression are supported
  • Only the following feature types are supported:
    • Numeric (ByteType, ShortType, IntegerType, LongType, FloatType, and DoubleType)
    • Boolean
    • String (only categorical)
    • Timestamps (TimestampType, DateType)
  • The following feature types are not supported:
    • Images
    • Text
  • The data_dir parameter in the Python API, which specifies the DBFS path used to store intermediate data, is not supported in Databricks on Google Cloud. Instead, AutoML saves all intermediate data as MLflow artifacts.
  • Likewise, any information that you enter in the Data directory field in the UI is ignored. Notebooks are always saved as MLflow artifacts.