Train Spark ML models on Databricks Connect with `pyspark.ml.connect`

Preview

This article provides an example that demonstrates how to use the pyspark.ml.connect module to perform distributed training to train Spark ML models and run model inference on Databricks Connect.

What is `pyspark.ml.connect`?

Spark 3.5 introduces pyspark.ml.connect which is designed for supporting Spark connect mode and Databricks Connect. Learn more about Databricks Connect.

The pyspark.ml.connect module consists of common learning algorithms and utilities, including classification, feature transformers, ML pipelines, and cross validation. This module provides similar interfaces to the legacy `pyspark.ml` module, but the pyspark.ml.connect module currently only contains a subset of the algorithms in pyspark.ml. The supported algorithms are listed below:

Classification algorithm: pyspark.ml.connect.classification.LogisticRegression
Feature transformers: pyspark.ml.connect.feature.MaxAbsScaler and pyspark.ml.connect.feature.StandardScaler
Evaluator: pyspark.ml.connect.RegressionEvaluator, pyspark.ml.connect.BinaryClassificationEvaluator and MulticlassClassificationEvaluator
Pipeline: pyspark.ml.connect.pipeline.Pipeline
Model tuning: pyspark.ml.connect.tuning.CrossValidator

Requirements

Set up Databricks Connect on your clusters. See Compute configuration for Databricks Connect.
Databricks Runtime 14.0 ML or higher installed.
Cluster access mode of Assigned.

Example notebook

The following notebook demonstrates how to use Distributed ML on Databricks Connect:

Distributed ML on Databricks Connect

Open notebook in new tab

For reference information about APIs in pyspark.ml.connect, Databricks recommends the Apache Spark API reference

Train Spark ML models on Databricks Connect with pyspark.ml.connect

What is pyspark.ml.connect?