Libraries

To make third-party or custom code available to notebooks and jobs running on your clusters, you can install a library. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries and point to external packages in PyPI, Maven, and CRAN repositories.

Important

Libraries uploaded using the library UI are stored in the DBFS root. All workspace users have the ability to modify data and files stored in the DBFS root. You can avoid this by uploading libraries to workspace files or Unity Catalog volumes, using libraries in cloud object storage or using library package repositories.

This article focuses on performing library tasks in the workspace UI. You can also manage libraries using the CLI or the Libraries API.

Databricks includes many common libraries in Databricks Runtime. To see which libraries are included in Databricks Runtime, look at the System Environment subsection of the Databricks Runtime release notes for your Databricks Runtime version.

This article focuses on performing library tasks in the workspace UI. You can also manage libraries using the Libraries API.

Important

Databricks does not invoke Python atexit functions when your notebook or job completes processing. If you use a Python library that registers atexit handlers, you must ensure your code calls required functions before exiting.

Installing Python eggs is deprecated and will be removed in a future Databricks Runtime release. Use Python wheels or install packages from PyPI instead.

Note

Unity Catalog has some limitations on library usage. For details, see Cluster libraries.

You can install libraries in three modes: cluster-installed, notebook-scoped, and workspace.

  • Cluster libraries can be used by all notebooks running on a cluster. You can install a cluster library directly from the following sources:

    • A public repository such as PyPI, Maven, or CRAN.

    • A cloud object storage location.

    • A workspace library stored with workspace files, in a Unity Catalog volume, or with DBFS.

    • Uploading library files from your local machine.

      • Note: Libraries installed directly from upload are stored in the DBFS root.

  • Notebook-scoped libraries, available for Python and R, allow you to install libraries and create an environment scoped to a notebook session. These libraries do not affect other notebooks running on the same cluster. Notebook-scoped libraries do not persist and must be re-installed for each session. Use notebook-scoped libraries when you need a custom environment for a specific notebook.

  • Workspace libraries serve as a local repository from which you can create cluster-installed libraries. A workspace library might be custom code created by your organization, or might be a particular version of an open-source library that your organization has standardized on. The package can be downloaded from a remote repository or an object storage location.

Python environment management

The following table provides an overview of options you can use to install Python libraries in Databricks.

Python package source

Notebook-scoped libraries with %pip

Cluster libraries

Job libraries with Jobs API

PyPI

Use %pip install. See example.

Select PyPI as the source.

Add a new pypi object to the job libraries and specify the package field.

Private PyPI mirror, such as Nexus or Artifactory

Use %pip install with the --index-url option. Secret management is available. See example.

Not supported.

Not supported.

VCS, such as GitHub, with raw source

Use %pip install and specify the repository URL as the package name. See example.

Select PyPI as the source and specify the repository URL as the package name.

Add a new pypi object to the job libraries and specify the repository URL as the package field.

Private VCS with raw source

Use %pip install and specify the repository URL with basic authentication as the package name. Secret management is available. See example.

Not supported.

Not supported.

File path

Use %pip install. See example.

Select File path/GCS as the source.

Add a new egg or whl object to the job libraries and specify the file path as the package field.

GCS

Use %pip install together with a pre-signed URL. Paths with the GCS protocol gs:// are not supported.

Select File path/GCS as the source.

Add a new egg or whl object to the job libraries and specify the GCS path as the package field.

Python library precedence

You might encounter a situation where you need to override the version for a built-in library, or have a custom library that conflicts in name with another library installed on the cluster. When you run import <library>, the library with the high precedence is imported.

Important

Libraries stored in workspace files have different precedence depending on how they are added to the Python sys.path. Databricks Repos adds the current working directory to the path before all other libraries, while notebooks outside Repos add the current working directory after other libraries are installed. If you manually append workspace directories to your path, these always have the lowest precedence.

The following list orders precedence from highest to lowest. In this list, a lower number means higher precedence.

  1. Libraries in the current working directory (Repos only).

  2. Notebook-scoped libraries (%pip install in notebooks).

  3. Cluster libraries (using the UI, CLI, or API).

  4. Libraries included in Databricks Runtime.

    • Libraries installed with init scripts might resolve before or after built-in libraries, depending on how they are installed. Databricks does not recommend installing libraries with init scripts.

  5. Libraries in the current working directory (not in Repos).

  6. Workspace files appended to the sys.path.