Databricks for R developers

This section provides a guide to developing notebooks and jobs in Databricks using the R language.

A basic workflow for getting started is:

  1. Import code: Either import your own code from files or Git repos or try a tutorial listed below. Databricks recommends learning using interactive Databricks notebooks.

  2. Run your code on a cluster: Either create a cluster of your own, or ensure you have permissions to use a shared cluster. Attach your notebook to the cluster, and run the notebook.

Beyond this, you can branch out into more specific topics:

Tutorials

The following tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.

Reference

The following subsections list key features and tips to help you begin developing in Databricks with R.

Databricks supports two APIs that provide an R interface to Apache Spark: SparkR and sparklyr.

SparkR

These articles provide an introduction and reference for SparkR. SparkR is an R interface to Apache Spark that provides a distributed data frame implementation. SparkR supports operations like selection, filtering, and aggregation (similar to R data frames) but on large datasets.

sparklyr

This article provides an introduction to sparklyr. sparklyr is an R interface to Apache Spark that provides functionality similar to dplyr, broom, and DBI.

Comparing SparkR and sparklyr

This article explains key similarities and differences between SparkR and sparklyr.

Work with DataFrames and tables with SparkR and sparklyr

This article describes how to use R, SparkR, sparklyr, and dplyr to work with R data.frames, Spark DataFrames, and Spark tables in Databricks.

Manage code with notebooks and Databricks Repos

Databricks notebooks support R. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.

Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos helps with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.

Clusters

Databricks Clusters provide compute management for both single nodes and large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.

Single node R and distributed R

Databricks clusters consist of an Apache Spark driver node and zero or more Spark worker (also known as executor) nodes. The driver node maintains attached notebook state, maintains the SparkContext, interprets notebook and library commands, and runs the Spark master that coordinates with Spark executors. Worker nodes run the Spark executors, one Spark executor per worker node.

A single node cluster has one driver node and no worker nodes, with Spark running in local mode to support access to tables managed by Databricks. Single node clusters support RStudio, notebooks, libraries, and DBFS, and are useful for R projects that don’t depend on Spark for big data or parallel processing. See Single Node clusters.

For data sizes that R struggles to process (many gigabytes or petabytes), you should use multiple-node or distributed clusters instead. Distributed clusters have one driver node and one or more worker nodes. Distributed clusters support not only RStudio, notebooks, libraries, and DBFS, but R packages such as SparkR and sparklyr are uniquely designed to use distributed clusters through the SparkContext. These packages provide familiar SQL and DataFrame APIs, which enable assigning and running various Spark tasks and commands in parallel across worker nodes. To learn more about sparklyr and SparkR, see Comparing SparkR and sparklyr.

Some SparkR and sparklyr functions that take particular advantage of distributing related work across worker nodes include the following:

  • sparklyr::spark_apply: Runs arbitrary R code at scale within a cluster. This is especially useful for using functionality that is available only in R, or R packages that are not available in Apache Spark nor other Spark packages.

  • SparkR::dapply: Applies the specified function to each partition of a SparkDataFrame.

  • SparkR::dapplyCollect: Applies the specified function to each partition of a SparkDataFrame and collects the results back to R as a data.frame.

  • SparkR::gapply: Groups a SparkDataFrame by using the specified columns and applies the specified R function to each group.

  • SparkR::gapplyCollect: Groups a SparkDataFrame by using the specified columns, applies the specified R function to each group, and collects the result back to R as a data.frame.

  • SparkR::spark.lapply: Runs the specified function over a list of elements, distributing the computations with Spark.

For examples, see the notebook Distributed R: User Defined Functions in Spark.

Libraries

Databricks clusters use the Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom R packages into libraries to use with notebooks and jobs.

Start with the default libraries in the Databricks Runtime. Use the Databricks Runtime for Machine Learning for machine learning workloads. For full lists of pre-installed libraries, see the “Installed R libraries” section for the target Databricks Runtime in Databricks runtime releases.

You can customize your environment by using Notebook-scoped R libraries, which allow you to modify your notebook or job environment with libraries from CRAN or other repositories. To do this, you can use the familiar install.packages function from utils. The following example installs the Arrow R package from the default CRAN repository:

install.packages("arrow")

If you need an older version than what is included in the Databricks Runtime, you can use a notebook to run install_version function from devtools. The following example installs dplyr version 0.7.4 from CRAN:

require(devtools)

install_version(
  package = "dplyr",
  version = "0.7.4",
  repos   = "http://cran.r-project.org"
)

Packages installed this way are available across a cluster. They are scoped to the user who installs them. This enables you to install multiple versions of the same package on the same compute without creating package conflicts.

You can install other libraries as Cluster libraries as needed, for example from CRAN. To do this, in the cluster user interface, click Libraries > Install new > CRAN and specify the library’s name. This approach is especially important for when you want to call user-defined functions with SparkR or sparklyr.

If you need an older version than what is included in the default CRAN repo, you enter the associated CRAN snapshot URL in the Install library dialog’s Repository field. For example, to install dplyr version 0.7.4, you enter the CRAN snapshot URL of https://cran.microsoft.com/snapshot/2017-09-29/. You can verify the installed version by running the packageVersion function in a notebook. For example, to get the installed version of dplyr:

packageVersion("dplyr")

# [1] '0.7.4'

For more details, see Libraries.

To install a custom package into a library:

  1. Build your custom package from the command line or by using RStudio.

  2. Use the Databricks CLI to copy the custom package file from your development machine over to DBFS for your Databricks workspace.

    For example:

    databricks fs cp /local/path/to/package/<custom_package>.tar.gz dbfs:/path/to/tar/file/
    
  3. Install the custom package into a library by running install.packages.

    For example, from a notebook in your workspace:

    install.packages(
      pkgs  = "/dbfs/path/to/tar/file/<custom_package>.tar.gz",
      type  = "source",
      repos = NULL
    )
    

    Or:

    %sh
    R CMD INSTALL /dbfs/path/to/tar/file/<custom_package>.tar.gz
    

After you install a custom package into a library in DBFS, you can add the library to the search path and then load the library with a single command.

For example:

# Add the library to the search path one time.
.libPaths(c("/dbfs/path/to/tar/file/", .libPaths()))

# Load the library. You do not need to add the library to the search path again.
library(<custom_package>)

To install a custom package as a library on each node in a cluster, you must use Cluster node initialization scripts.

Visualizations

Databricks R notebooks support various types of visualizations using the display function.

Jobs

You can automate R workloads as scheduled or triggered notebook Create, run, and manage Databricks Jobs in Databricks.

  • For details on creating a job via the UI, see Create a job.

  • The Jobs API 2.1 allows you to create, edit, and delete jobs.

Machine learning

Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see the Databricks Machine Learning guide.

For ML algorithms, you can use pre-installed libraries in the Databricks Runtime for Machine Learning. You can also install custom libraries.

For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLFlow. MLflow Tracking lets you record model development and save models in reusable formats; the MLflow Model Registry lets you manage and automate the promotion of models towards production; and Jobs and model serving, with Serverless Real-Time Inference or Classic MLflow Model Serving, allow hosting models as batch and streaming jobs as REST endpoints. For more information and examples, see the MLflow guide or the MLflow R API docs.

R developer tools

In addition to Databricks notebooks, you can also use the following R developer tools:

Additional resources