Databricks for Scala developers

This article provides a guide to developing notebooks and jobs in Databricks using the Scala language. The first section provides links to tutorials for common workflows and tasks. The second section provides links to APIs, libraries, and key tools.

A basic workflow for getting started is:

Beyond this, you can branch out into more specific topics:

Tutorials

The tutorials below provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.

Reference

The below subsections list key features and tips to help you begin developing in Databricks with Scala.

Manage code with notebooks and Databricks Repos

Databricks notebooks support Scala. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.

Tip

To completely reset the state of your notebook, it can be useful to restart the kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a notebook, click on the cluster dropdown in the upper-left and click Detach & Re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the process.

Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos helps with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.

Clusters and libraries

Databricks Clusters provide compute management for both single nodes and large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.

Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom libraries to use with notebooks and jobs.

Visualizations

Databricks Scala notebooks have built-in support for many types of visualizations. You can also use legacy visualizations:

Interoperability

This section describes features that support interoperability between Scala and SQL.

Jobs

You can automate Scala workloads as scheduled or triggered jobs in Databricks. Jobs can run notebooks and JARs.

  • For details on creating a job via the UI, see Create a job.

  • The Jobs API 2.1 allows you to create, edit, and delete jobs.

IDEs, developer tools, and APIs

In addition to developing Scala code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as IntelliJ IDEA. To synchronize work between external development environments and Databricks, there are several options:

Databricks provides a full set of REST APIs which support automation and integration with external tooling. You can use APIs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See REST API (latest).

For more information on IDEs, developer tools, and APIs, see Developer tools and guidance.

Additional resources