What is Mosaic AutoML?

Mosaic AutoML simplifies the process of applying machine learning to your datasets by automatically finding the best algorithm and hyperparameter configuration for you.

Provide your dataset and specify the type of machine learning problem, then AutoML does the following:

  1. Cleans and prepares your data.

  2. Orchestrates distributed model training and hyperparameter tuning across multiple algorithms.

  3. Finds the best model using open source evaluation algorithms from scikit-learn, xgboost, LightGBM, Prophet, and ARIMA.

  4. Presents the results. AutoML also generates source code notebooks for each trial, allowing you to review, reproduce, and modify the code as needed.

Get started with AutoML experiments through a low-code UI or the Python API.

Requirements

  • Databricks Runtime 9.1 ML or above. For the general availability (GA) version, Databricks Runtime 10.4 LTS ML or above.

    • For time series forecasting, Databricks Runtime 10.0 ML or above.

    • With Databricks Runtime 9.1 LTS ML and above, AutoML depends on the databricks-automl-runtime package, which contains components that are useful outside of AutoML and also helps simplify the notebooks generated by AutoML training. databricks-automl-runtime is available on PyPI.

  • No additional libraries other than those preinstalled in Databricks Runtime for Machine Learning should be installed on the cluster.

    • Any modification (removal, upgrades, or downgrades) to existing library versions results in run failures due to incompatibility.

  • To access files in your workspace, you must have network ports 1017 and 1021 open for AutoML experiments. To open these ports or confirm they are open, review your cloud VPN firewall configuration and security group rules or contact your local cloud administrator. For additional information on workspace configuration and deployment, see Create a workspace.

  • Use a compute resource with a supported compute access mode. Not all compute access modes have access to the Unity Catalog:

    Compute access mode

    AutoML support

    Unity Catalog support

    single user

    Supported (must be the designated single user for the cluster)

    Supported

    Shared access mode

    Unsupported

    Unsupported

    No isolation shared

    Supported

    Unsupported

AutoML algorithms

Mosaic AutoML trains and evaluates models based on the algorithms in the following table.

Note

For classification and regression models, the decision tree, random forests, logistic regression, and linear regression with stochastic gradient descent algorithms are based on scikit-learn.

Classification models

Regression models

Forecasting models

Decision trees

Decision trees

Prophet

Random forests

Random forests

Auto-ARIMA (Available in Databricks Runtime 10.3 ML and above.)

Logistic regression

Linear regression with stochastic gradient descent

XGBoost

XGBoost

LightGBM

LightGBM

Trial notebook generation

AutoML generates notebooks of the source code behind trials so you can review, reproduce, and modify the code as needed.

For forecasting experiments, AutoML-generated notebooks are automatically imported to your workspace for all trials of your experiment.

For classification and regression experiments, AutoML-generated notebooks for data exploration and the best trial in your experiment are automatically imported to your workspace. Generated notebooks for other experiment trials are saved as MLflow artifacts on DBFS instead of auto-imported into your workspace. For all trials besides the best trial, the notebook_path and notebook_url in the TrialInfo Python API are not set. If you need to use these notebooks, you can manually import them into your workspace with the AutoML experiment UI or the databricks.automl.import_notebook Python API.

If you only use the data exploration notebook or best trial notebook generated by AutoML, the Source column in the AutoML experiment UI contains the link to the generated notebook for the best trial.

If you use other generated notebooks in the AutoML experiment UI, these are not automatically imported into the workspace. You can find the notebooks by clicking into each MLflow run. The IPython notebook is saved in the Artifacts section of the run page. You can download this notebook and import it into the workspace, if downloading artifacts is enabled by your workspace administrators.

Shapley values (SHAP) for model explainability

Note

For MLR 11.1 and below, SHAP plots are not generated if the dataset contains a datetime column.

The notebooks produced by AutoML regression and classification runs include code to calculate Shapley values. Shapley values are based in game theory and estimate the importance of each feature to a model’s predictions.

AutoML notebooks calculate Shapley values using the SHAP package. Because these calculations are highly memory-intensive, the calculations are not performed by default.

To calculate and display Shapley values:

  1. Go to the Feature importance section in an AutoML-generated trial notebook.

  2. Set shap_enabled = True.

  3. Re-run the notebook.