Google Cloud Storage

This article describes how to read from and write to Google Cloud Storage (GCS) tables in Databricks. To read or write from a GCS bucket, you must create an attached service account and you must associate the bucket with the service account when creating a cluster.

The way to associate a cluster with a service account depends on how you connect to a GCS bucket:

  • Using DBFS: You must use the service account email address.
  • Directly: You can use the service account email address (recommended approach) or a key that you generate for the service account.

Access a GCS bucket through DBFS

Step 1: Set up Google Cloud service account using Google Cloud Console

You must create a service account for the Databricks cluster. We recommend giving this service account the least privileges needed to perform its tasks.

Important

The service account must be in the Google Cloud project that you used to set up the Databricks workspace.

  1. Click IAM and Admin in the left navigation pane.

  2. Click Service Accounts.

  3. Click + CREATE SERVICE ACCOUNT.

  4. Enter the service account name and description.

    Google Create service account for GCS

  5. Click CREATE.

  6. Click CONTINUE.

  7. Click DONE.

  8. Navigate to the Google Cloud Console list of service accounts and select a service account.

    Copy the associated email address. You will need it when you set up Databricks clusters.

Step 2: Configure your GCS bucket

Create a bucket

If you do not already have a bucket, create one:

  1. Click Storage in the left navigation pane.

  2. Click CREATE BUCKET.

    Google Create Bucket
  3. Click CREATE.

Configure the bucket

Configure the bucket:

  1. Configure the bucket details.

  2. Click the Permissions tab.

  3. Next to the Permissions label, click ADD.

    Google Bucket Details

  4. Provide the following permission to the service account on the bucket from the Cloud Storage roles: Storage Admin.

    Google Bucket Permissions

  5. Click SAVE.

Step 3: Set up a Databricks cluster

When you configure your cluster, expand Advanced Options and set the Google Service Account field to your service account email address.

Step 4: Mount the bucket

You can mount a bucket to Databricks File System (DBFS). The mount is a pointer to a GCS location, so the data is never synced locally.

bucket_name = "<bucket-name>"
mount_name = "<mount-name>"
dbutils.fs.mount("gs://%s" % bucket_name, "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))
val BucketName = "<bucket-name>"
val MountName = "<mount-name>"

dbutils.fs.mount(s"gs://$BucketName", s"/mnt/$MountName")
display(dbutils.fs.ls(s"/mnt/$MountName"))

Access a GCS bucket directly

To read and write directly to a bucket, you can either set the service account email address or configure a key defined in your Spark config.

Step 1: Set up Google Cloud service account using Google Cloud Console

You must create a service account for the Databricks cluster. Databricks recommends giving this service account the least privileges needed to perform its tasks.

  1. Click IAM and Admin in the left navigation pane.

  2. Click Service Accounts.

  3. Click + CREATE SERVICE ACCOUNT.

  4. Enter the service account name and description.

    Google Create service account for GCS
  5. Click CREATE.

  6. Click CONTINUE.

  7. Click DONE.

  8. Get the service account email address or generate a key for the service account.

    Note

    Databricks recommends using the service account email address because there are no keys involved, so there is no risk of leaking the keys. One reason to use a key is if the service account needs to be in a different Google Cloud project than the project that was used when creating the workspace.

    • Service account email address: Navigate to the Google Cloud Console list of service accounts. Select a service account. Copy the email address associated with it. You will need it in the cluster setup page.

      Important

      If you use the service account email address approach, the service account must be in the same Google Cloud project as you used to set up the Databricks workspace.

    • Key: Create a key. See Create a key to access GCS bucket directly.

Create a key to access GCS bucket directly

Warning

The JSON key you generate for the service account is a private key that should only be shared with authorized users as it controls access to datasets and resources in your Google Cloud account.

  1. In the Google Cloud console, in the service accounts list, click the newly created account.

  2. In the Keys section, click ADD KEY > Create new key.

    Google Create Key
  3. Accept the JSON key type.

  4. Click CREATE. The key file is downloaded to your computer.

Step 2: Configure the GCS bucket

Create a bucket

If you do not already have a bucket, create one:

  1. Click Storage in the left navigation pane.

  2. Click CREATE BUCKET.

    Google Create Bucket
  3. Click CREATE.

Configure the bucket

  1. Configure the bucket details.

  2. Click the Permissions tab.

  3. Next to the Permissions label, click ADD.

    Google Bucket Details
  4. Provide the Storage Admin permission to the service account on the bucket from the Cloud Storage roles.

    Google Bucket Permissions
  5. Click SAVE.

Step 3: Set up Databricks cluster

When you configure your cluster:

  1. In the Databricks Runtime Version drop-down, select 7.3 LTS or above.

  2. You can authenticate with the service account email address or a key that you generate for the service account.

    • Service account email address: Expand Advanced Options and set Google Service Account field to your service account email address.

    • Key: In the Spark Config tab, add the following Spark configuration and replace the items in brackets (such as <client_email>) with the values of those exact field names from your key JSON file. For <private_key_id>, copy the entire contents of the multi-line value, including quotes.

      spark.hadoop.google.cloud.auth.service.account.enable true
      spark.hadoop.fs.gs.auth.service.account.email <client_email>
      spark.hadoop.fs.gs.project.id <project_id>
      spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
      spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
      

Step 4: Usage

To read from the GCS bucket, use a Spark read command in any supported format, for example:

df = spark.read.format("parquet").load("gs://<bucket-name>/<path>")

To write to the GCS bucket, use a Spark write command in any supported format, for example:

df.write.format("parquet").mode("<mode>").save("gs://<bucket-name>/<path>")

Replace <bucket-name> with the name of the bucket you created in Step 2: Configure the GCS bucket.

Example notebooks

Read from Google Cloud Storage notebook

Open notebook in new tab

Write to Google Cloud Storage notebook

Open notebook in new tab