This article answers typical questions that come up when you migrate single node workloads to Databricks.
I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?
If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.
Alternatively, you can use Pandas API on Spark, which allows you to use the pandas DataFrame API to access data in Apache Spark DataFrames.
There is an algorithm in sklearn that I love, but Spark ML doesn’t support it (such as DBSCAN). How can I use this algorithm and still take advantage of Spark?
- Use joblib-spark, an Apache Spark backend for joblib to distribute tasks on a Spark cluster.
- Use a pandas user-defined function.
- For hyperparameter tuning, use Hyperopt.
What are my deployment options for Spark ML?
The best deployment option depends on the latency requirement of the application.
- For batch predictions, see Deploy and serve models and Model inference.
- For streaming applications, see Structured Streaming.
How can I install or update pandas or another library?
There are several ways to install or update a library.
- To install or update a library for all users on a cluster, see Cluster libraries.
- To make a Python library or a library version available only for a specific notebook, see Notebook-scoped Python libraries.
How can I get data into Databricks?
- Mounting. See Mount object storage to DBFS.
- Data tab. See Introduction to importing, reading, and modifying data.