Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Note

This article mentions the CLI, which is not available on this release of Databricks on Google Cloud.

You can securely access data in an Azure Data Lake Storage Gen2 (ADLS Gen2) account using OAuth 2.0 with an Azure Active Directory (Azure AD) application service principal for authentication.

You can provide access to multiple workspace users with different permissions. Access data directly through the Azure Blob File System (ABFS) driver.

This article describes creating an Azure AD application and service principal and using that service principal to directly access data in an ADLS Gen2 storage account. The following is an overview of the tasks this article walks through:

  1. Create an Azure AD application, which will create an associated service principal used to access the storage account.
  2. Create a secret scope in your Databricks workspace. The secret scope will securely store the client secret associated with the Azure AD application.
  3. Save the client secret associated with the Azure AD application in the secret scope. The client secret is required for authenticating to the storage account. The secret scope provides secure storage of the secret and allows it to be used without directly referencing it in configuration.
  4. Assign roles to the application to provide the service principal with the required permissions to access the ADLS Gen2 storage account.
  5. Create one or more containers inside the storage account. Like directories in a filesystem, containers provide a way to organize objects in an Azure storage account. You’ll need to create one or more containers before you can access an ADLS Gen2 storage account.
  6. Authenticate and access the ADLS Gen2 storage account through direct access.

Requirements

Register an Azure Active Directory application

Registering an Azure AD application and assigning appropriate permissions will create a service principal that can access ADLS Gen2 storage resources.

  1. In the Azure portal, go to the Azure Active Directory service.

  2. Under Manage, click App Registrations.

  3. Click + New registration. Enter a name for the application and click Register.

  4. Click Certificates & Secrets.

  5. Click + New client secret.

  6. Add a description for the secret and click Add.

  7. Copy and save the value for the new secret.

  8. In the application registration overview, copy and save the Application (client) ID and Directory (tenant) ID.

    App registration overview

Assign roles

You control access to storage resources by assigning roles to an Azure AD application registration associated with the storage account. This example assigns the Storage Blob Data Contributor to the ADLS Gen2 storage account. You may need to assign other roles depending on specific requirements.

  1. In the Azure portal, go to the Storage accounts service.

  2. Select the ADLS Gen2 account to use with this application registration.

  3. Click Access Control (IAM).

  4. Click + Add and select Add role assignment from the dropdown menu.

  5. Set the Select field to the Azure AD application name and set Role to Storage Blob Data Contributor.

  6. Click Save.

    Assign application roles

Create a container

Like directories in a filesystem, containers provide a way to organize objects in an Azure storage account. You’ll need to create one or more containers before you can access an ADLS Gen2 storage account. You can create a container directly in a Databricks notebook or through the Azure command-line interface, the Azure API, or the Azure portal. To create a container through the portal:

  1. In the Azure portal, go to Storage accounts.

  2. Select your ADLS Gen2 account and click Containers.

  3. Click + Container.

  4. Enter a name for your container and click Create.

    Create a container

Access ADLS Gen2 directly

The way you pass credentials to access storage resources directly depends on whether you plan to use the DataFrame or Dataset API, or the RDD API.

DataFrame or DataSet API

If you are using Spark DataFrame or Dataset APIs, Databricks recommends that you set your account credentials in your notebook’s session configs:

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Replace

  • <storage-account-name> with the name of the ADLS Gen2 storage account.
  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <scope-name> with the Databricks secret scope name.
  • <service-credential-key-name> with the name of the key containing the client secret.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.

RDD API

If you use the RDD API to access ADLS Gen2, you cannot access Hadoop configuration options set using spark.conf.set(...). Instead, specify the Hadoop configuration options as Spark configs when you create the cluster. You must add the spark.hadoop. prefix to the Hadoop configuration keys to propagate them to the Hadoop configurations used by your RDD jobs.

Warning

These credentials are available to all users who access the cluster.

spark.hadoop.fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <service-credential>
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token

Replace

  • <storage-account-name> with the name of the ADLS Gen2 storage account.
  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <service-credential> with the value of the client secret.
  • <service-credential-key-name> with the name of the key containing the client secret.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.

Use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

Replace

  • <container-name> with the name of a container in the ADLS Gen2 storage account.
  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <directory-name> with an optional path in the storage account.

Example notebook

This notebook demonstrates using a service principal to:

  1. Authenticate to an ADLS Gen2 storage account.
  2. Mount a filesystem in the storage account.
  3. Write a JSON file containing Internet of things (IoT) data to the new container.
  4. List files using direct access and through the mount point.
  5. Read and display the IoT file using direct access and through the mount point.

ADLS Gen2 OAuth 2.0 with Azure service principals notebook

Open notebook in new tab