This section provides a guide to developing notebooks and jobs in Databricks using the R language.
A basic workflow for getting started is:
Import code: Either import your own code from files or Git repos or try a tutorial listed below. Databricks recommends learning using interactive Databricks notebooks.
Run your code on a cluster: Either create a cluster of your own, or ensure you have permissions to use a shared cluster. Attach your notebook to the cluster, and run the notebook.
Beyond this, you can branch out into more specific topics:
Work with larger data sets using Apache Spark
Automate your workload as a job
Use machine learning to analyze your data
The following tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
The following subsections list key features and tips to help you begin developing in Databricks with R.
These articles provide an introduction and reference for SparkR. SparkR is an R interface to Apache Spark that provides a distributed data frame implementation. SparkR supports operations like selection, filtering, and aggregation (similar to R data frames) but on large datasets.
This article explains key similarities and differences between SparkR and sparklyr.
This article describes how to use R, SparkR, sparklyr, and dplyr to work with R data.frames, Spark DataFrames, and Spark tables in Databricks.
Databricks notebooks support R. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos helps with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Databricks Clusters provide compute management for both single nodes and large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use Single Node clusters for cost savings.
For detailed tips, see Best practices: Cluster configuration.
Administrators can set up cluster policies to simplify and guide cluster creation.
Databricks clusters consist of an Apache Spark driver node and zero or more Spark worker (also known as executor) nodes. The driver node maintains attached notebook state, maintains the
SparkContext, interprets notebook and library commands, and runs the Spark master that coordinates with Spark executors. Worker nodes run the Spark executors, one Spark executor per worker node.
A single node cluster has one driver node and no worker nodes, with Spark running in local mode to support access to tables managed by Databricks. Single node clusters support RStudio, notebooks, libraries, and DBFS, and are useful for R projects that don’t depend on Spark for big data or parallel processing. See Single Node clusters.
For data sizes that R struggles to process (many gigabytes or petabytes), you should use multiple-node or distributed clusters instead. Distributed clusters have one driver node and one or more worker nodes. Distributed clusters support not only RStudio, notebooks, libraries, and DBFS, but R packages such as SparkR and sparklyr are uniquely designed to use distributed clusters through the
SparkContext. These packages provide familiar SQL and DataFrame APIs, which enable assigning and running various Spark tasks and commands in parallel across worker nodes. To learn more about sparklyr and SparkR, see Comparing SparkR and sparklyr.
Some SparkR and sparklyr functions that take particular advantage of distributing related work across worker nodes include the following:
sparklyr::spark_apply: Runs arbitrary R code at scale within a cluster. This is especially useful for using functionality that is available only in R, or R packages that are not available in Apache Spark nor other Spark packages.
SparkR::dapply: Applies the specified function to each partition of a
SparkR::dapplyCollect: Applies the specified function to each partition of a
SparkDataFrameand collects the results back to R as a
SparkR::gapply: Groups a
SparkDataFrameby using the specified columns and applies the specified R function to each group.
SparkR::gapplyCollect: Groups a
SparkDataFrameby using the specified columns, applies the specified R function to each group, and collects the result back to R as a
SparkR::spark.lapply: Runs the specified function over a list of elements, distributing the computations with Spark.
For examples, see the notebook Distributed R: User Defined Functions in Spark.
Databricks clusters use the Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom R packages into libraries to use with notebooks and jobs.
Start with the default libraries in the Databricks Runtime. Use the Introduction to Databricks Runtime for Machine Learning for machine learning workloads. For full lists of pre-installed libraries, see the “Installed R libraries” section for the target Databricks Runtime in Databricks Runtime release notes versions and compatibility.
You can customize your environment by using Notebook-scoped R libraries, which allow you to modify your notebook or job environment with libraries from CRAN or other repositories. To do this, you can use the familiar install.packages function from
utils. The following example installs the Arrow R package from the default CRAN repository:
If you need an older version than what is included in the Databricks Runtime, you can use a notebook to run install_version function from
devtools. The following example installs dplyr version 0.7.4 from CRAN:
require(devtools) install_version( package = "dplyr", version = "0.7.4", repos = "http://cran.r-project.org" )
Packages installed this way are available across a cluster. They are scoped to the user who installs them. This enables you to install multiple versions of the same package on the same compute without creating package conflicts.
You can install other libraries as Cluster libraries as needed, for example from CRAN. To do this, in the cluster user interface, click Libraries > Install new > CRAN and specify the library’s name. This approach is especially important for when you want to call user-defined functions with SparkR or sparklyr.
For more details, see Libraries.
To install a custom package into a library:
Build your custom package from the command line or by using RStudio.
databricks fs cp /local/path/to/package/<custom-package>.tar.gz dbfs:/path/to/tar/file/
The preceding example applies to Databricks CLI versions 0.205 and above.
Install the custom package into a library by running
For example, from a notebook in your workspace:
install.packages( pkgs = "/dbfs/path/to/tar/file/<custom-package>.tar.gz", type = "source", repos = NULL )
%sh R CMD INSTALL /dbfs/path/to/tar/file/<custom-package>.tar.gz
After you install a custom package into a library in DBFS, you can add the library to the search path and then load the library with a single command.
# Add the library to the search path one time. .libPaths(c("/dbfs/path/to/tar/file/", .libPaths())) # Load the library. You do not need to add the library to the search path again. library(<custom-package>)
To install a custom package as a library on each node in a cluster, you must use What are init scripts?.
Databricks R notebooks support various types of visualizations using the
You can automate R workloads as scheduled or triggered notebook Create and run Databricks Jobs in Databricks.
Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see the Introduction to Databricks Machine Learning.
For ML algorithms, you can use pre-installed libraries in the Introduction to Databricks Runtime for Machine Learning. You can also install custom libraries.
For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. MLflow Tracking lets you record model development and save models in reusable formats; the MLflow Model Registry lets you manage and automate the promotion of models towards production; and Jobs allow hosting models as batch and streaming jobs. For more information and examples, see the MLflow guide or the MLflow R API docs.
In addition to Databricks notebooks, you can also use the following R developer tools:
In Databricks Runtime 12.0 and above, R sessions can be customized by using site-wide profile (
.Rprofile) files. R notebooks will source the file as R code during startup. To modify the file, find the value of
R_HOME and modify
$R_HOME/etc/Rprofile.site. Note that Databricks has added configuration in the file to ensure proper functionality for hosted RStudio on Databricks. Removing any of it may cause RStudio to not work as expected.
In Databricks Runtime 11.3 and below, this behavior can be enabled by setting the environment variable