Machine learning

This section includes example notebooks showing how to use Databricks to train models using the most popular packages.

scikit-learn

scikit-learn is one of the most popular Python libraries for single-node machine learning. It is included in Databricks Runtime and Databricks Runtime ML. See Databricks runtime release notes for the scikit-learn library version included with your cluster’s runtime.

MLlib

Apache Spark MLlib is the Apache Spark machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.

XGBoost

XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. It is included in Databricks Runtime ML. For information about installing XGBoost on Databricks Runtime, or installing a custom version on Databricks Runtime ML, see these instructions.

You can train XGBoost models on an individual machine or in a distributed fashion.