Train Spark ML models on Databricks Connect with pyspark.ml.connect

Preview

This feature is in Public Preview.

This article provides an example that demonstrates how to use the pyspark.ml.connect module to perform distributed training to train Spark ML models and run model inference on Databricks Connect.

What is pyspark.ml.connect?

Spark 3.5 introduces pyspark.ml.connect which is designed for supporting Spark connect mode and Databricks Connect. Learn more about Databricks Connect.

The pyspark.ml.connect module consists of common learning algorithms and utilities, including classification, feature transformers, ML pipelines, and cross validation. This module provides similar interfaces to the legacy `pyspark.ml` module, but the pyspark.ml.connect module currently only contains a subset of the algorithms in pyspark.ml. The supported algorithms are listed below:

  • Classification algorithm: pyspark.ml.connect.classification.LogisticRegression

  • Feature transformers: pyspark.ml.connect.feature.MaxAbsScaler and pyspark.ml.connect.feature.StandardScaler

  • Evaluator: pyspark.ml.connect.RegressionEvaluator, pyspark.ml.connect.BinaryClassificationEvaluator and MulticlassClassificationEvaluator

  • Pipeline: pyspark.ml.connect.pipeline.Pipeline

  • Model tuning: pyspark.ml.connect.tuning.CrossValidator

Requirements

Example notebook

The following notebook demonstrates how to use Distributed ML on Databricks Connect:

Distributed ML on Databricks Connect

Open notebook in new tab

For reference information about APIs in pyspark.ml.connect, Databricks recommends the Apache Spark API reference