Migrate single node workloads to Databricks

This article answers typical questions that come up when you migrate single node workloads to Databricks.

I just created a 20 node Spark cluster and my pandas code doesn’t run any faster. What is going wrong?

If you are working with any single-node libraries, they will not inherently become distributed when you switch to using Databricks. You will need to re-write your code using PySpark, the Apache Spark Python API.

Alternatively, you can use Pandas API on Spark, which allows you to use the pandas DataFrame API to access data in Apache Spark DataFrames.

There is an algorithm in sklearn that I love, but Spark ML doesn’t support it (such as DBSCAN). How can I use this algorithm and still take advantage of Spark?

What are my deployment options for Spark ML?

The best deployment option depends on the latency requirement of the application.

How can I install or update pandas or another library?

There are several ways to install or update a library.

How can I get data into Databricks?