Compute creation cheat sheet

This article aims to provide clear and opinionated guidance for compute creation. By using the right compute types for your workflow, you can improve performance and save on costs.

Best Practice

Impact

Docs

If you are new to Databricks, start by using general all-purpose instance types

Selecting the appropriate instance type for the workload results in higher efficiency.

Use shared access mode unless your required functionality isn’t supported

Compute with shared access mode can be used by multiple users with data isolation among users.

Use the latest generation instance types if there is enough availability

The latest generation of instance types provide the best performance and latest features.

Set your on-demand and spot-instance balance based on how quickly you need your workload to run

Spot instances save on cost but can affect the overall run time of an operation if the spot instances are reclaimed.

Choose the size of your nodes and the number of workers based on the types of operations your workload performs

For example, if you expect a lot of shuffles, it can be more efficient to use a large single node instead of multiple smaller nodes.

Run vacuum on a cluster with auto-scaling set for 1-4 workers, where each worker has 8 cores.

Select a driver with between 8 and 32 cores. Increase the size of the driver if you get out-of-memory (OOM) errors.

Vacuum statements happen in two phases, the second of which is driver-heavy. If you don’t use the right-sized cluster, the operation could cause a slowdown and might not succeed.

Assess whether your batch workflow would benefit from Photon

Photon provides faster queries and reduces your total cost per workload.