Limit network egress for your workspace using a firewall

By default Google Cloud allows all egress traffic from your VPC. Optionally, you can choose to configure your workspace to limit egress to only the essential services and destinations that your organization requires. To allow the essential services, you need to modify the Google Cloud network (VPC) and the firewall that Databricks creates for your workspace. For some services, you can set up Private Google Access to communicate with Google APIs over private IP addresses.

For the latest updates on how these Google Cloud features work, refer to the Google Cloud documentation.

If you want to configure a corporate domain name firewall to connect to the Databricks web application, instead see Configure domain name firewall rules.

For a general overview of Databricks architecture including the definitions of classic compute plane and control plane, see Databricks architecture overview.

Firewall configuration overview

If you want to limit egress from the classic compute plane, you must make two general types of configuration to allow required connections that would otherwise be blocked:

  • Private Google Access for GCR and GCS: By default, when a Compute Engine virtual machine (VM) such as a Databricks cluster node lacks an external (public) IP address that is assigned to its network interface, it can send packets only to other internal IP address destinations. You can allow these VMs to connect to the set of external IP addresses used by Google APIs and services by enabling Private Google Access (PGA) on the subnet that the VM’s network interface uses. WIth PGA, direct communication between the Databricks compute plane and these Google services use Google’s internal network rather than the public internet. This affects the following outgoing connections from the compute plane:

  • Firewall rules for control plane services: After you block egress by modifying the built-in firewall, you must allow several essential services by adding new firewall rules:

    • The egress to the web application and REST APIs on port 443.

    • The egress to the secure cluster connectivity relay on port 443.

    • The egress to the default Hive metastore on port 3306. If you deploy your own external metastore, you do not need to add network rules for the unused default metastore.

    • The egress on ports 8443-8451 to support upcoming features.

    The IP address ranges for the web application, secure cluster connectivity relay, and the default Hive metastore vary by region. For the values for your region, see the table in IP addresses and domains.

This article includes examples of Google Cloud configuration using the gcloud command line interface (CLI). To use these examples, you must install the Google Cloud SDK.

Control plane service endpoint IP addresses by region

If you want to configure a firewall to block egress, you must define new VPC egress firewall rules and routes to allow essential services that are hosted in the Databricks control plane. Endpoint IP addresses vary by the Google Cloud region of your Databricks workspace. Use the table in IP addresses and domains to get the port and IP address of the control plane ingresses for your workspace’s Google Cloud region.

Step 1: Plan your network sizing

If you plan to use large clusters or many workspaces, ensure that your workspaces will have sufficient IP space to run Databricks jobs by calculating your GKE subnet ranges by using the network sizing calculator.

Configure the GKE subnets used by your Databricks workspace accurately because you cannot change them after your workspace is deployed. If the address ranges for your Databricks subnets are too small, then the workspace exhausts its IP space, which in turn causes your Databricks jobs to fail. To determine the address range sizes that you need, Databricks provides a calculator in the form of a Microsoft Excel spreadsheet. See Calculate subnet sizes for a new workspace.

Step 2: Create a workspace

  1. Log into the account console

  2. Create a new Databricks workspace.

    • In the Advanced Configuration section of the form to create the workspace, you must use the default settings for Enable private cluster, which causes the workspace to use a private GKE cluster. For a private GKE cluster, Databricks compute instances have no public IP addresses.

    • Write down all IP subnet settings that appear in the Advanced Configuration section during workspace creation. There are defaults but you can override them. For details see Create a workspace using the account console.

      Firewall subnets
  3. Ensure the Databricks VPC and subnets in your Google account have Private Google Access (PGA) enabled.

    Firewall PGA

Step 3: Add VPC firewall rules

You must now add firewall rules to the firewall that Databricks created for your new workspace.

  1. Get your VPC name. Databricks created this VPC in your Google account as part of workspace creation.

    1. From the account console workspace page, click on your workspace to launch it.

    2. Copy the numeric part of the URL after ?o=. For example, if the workspace URL is https://1676665108650415.5.gcp.databricks.com/?o=1676665108650415#, the workspace ID is 1676665108650415.

    3. The VPC for your workspace has the name databricks-managed-<workspace-ID>. For example, databricks-managed-1676665108650415. Write down this value so you can use it with CLI commands.

  2. If you want to use the Google Console to add firewall rules:

    1. Go to the VPCs page on Google Cloud console.

    2. Click on the VPC name for the network that matches your workspace’s VPC name.

    3. Click on the FIREWALL RULES tab.

  3. Add a rule to block egress to 0.0.0.0/0, which is a special syntax that means all external IP addresses.

    By default, Google Cloud allows all egress traffic from your VPC. You need to create a rule blocking this traffic. Ensure that the priority for this rule is set to a large number. The purpose is for this rule to apply after Google Cloud runs all of other rules. For example you could set this rule to priority 1100 and then later you set your egress rule priorities in the range of 1000-1099.

    Important

    If you incorrectly use a low priority value for this rule, Google Cloud denies network traffic to essential Databricks and GKE services, which would render your workspace unusable.

    Because of this rule, any traffic going to internal IP space (10.0.0.0/8, 172.16.0.0/20, 192.168.0.0/16) need an additional egress rule to allow the cluster to reach those IP addresses.

    You can use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name.

     gcloud compute firewall-rules create deny-egress
     --action DENY
     --rules all
     --destination-ranges 0.0.0.0/0
     --direction EGRESS
     --priority 1100
     --network <vpc-name>
    
  4. Add a rule to allow ingress from Google health checks.

    You can use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name.

    gcloud compute firewall-rules create from-gcp-healthcheck \
    --action ALLOW \
    --rules tcp:80,tcp:443 \
    --source-ranges 130.211.0.0/22,35.191.0.0/16 \
    --direction INGRESS \
    --priority 1010 \
    --network <vpc-name>
    
  5. Add a rule to allow egress to Google health checks.

    You can use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name.

    gcloud compute firewall-rules create to-gcp-healthcheck \
    --action ALLOW \
    --rules tcp:80,tcp:443 \
    --destination-ranges 130.211.0.0/22,35.191.0.0/16 \
    --direction EGRESS \
    --priority 1000 \
    --network <vpc-name>
    
  6. Add a rule to allow egress to restricted APIs (restricted.googleapis.com), which is required to reach the GKE API at a fixed IP address range.

    You can use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name.

    gcloud compute firewall-rules create to-google-apis \
    --action ALLOW \
    --rules all \
    --destination-ranges 199.36.153.4/30 \
    --direction EGRESS \
    --priority 1000 \
    --network <vpc-name>
    
  7. Add a rule to allow egress to the IP range for GKE master resources. There is a default GKE master IP range, but you can override the values during workspace creation. See Create a workspace using the account console.

    You can use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name. Replace X.X.X.X/28 with your IP range for GKE master resources.

    gcloud compute firewall-rules create to-gke-master \
    --action ALLOW \
    --rules tcp:443,tcp:10250 \
    --destination-ranges X.X.X.X/28 \
    --direction EGRESS \
    --priority 1000 \
    --network <vpc-name>
    
  8. Add a rule to allow intra-subnet communication (egress) used by the Databricks workspace. There are defaults for these subnets that are visible during workspace creation in the Advanced Configuration section. You need to allow egress to all three of the subnet ranges that are labelled in the user interface as Primary IP range for GKE nodes, Secondary IP range for GKE pods, and Secondary IP range for GKE services.

    You can use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name. Replace X.X.X.X/X,X.X.X.X/X,X.X.X.X/X with your three Databricks subnets that you provided during workspace creation.

    gcloud compute firewall-rules create to-gke-nodes-subnet \
    --action ALLOW \
    --rules all \
    --destination-ranges X.X.X.X/X,X.X.X.X/X,X.X.X.X/X \
    --direction EGRESS \
    --priority 1000 \
    --network <vpc-name>
    
  9. Add a rule to allow egress to Databricks control plane services. Use the table in IP addresses and domains to determine the correct values for each supported Databricks region for these services.

    Allow connections to the main web application and the secure cluster connectivity (SCC) relay on port 443. Use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name. Replace X.X.X.X with your regional ingress address for the web application. Replace Y.Y.Y.Y with your regional IP address range for the SCC relay.

    gcloud compute firewall-rules create to-databricks-control-plane \
    --action ALLOW \
    --rules tcp:443 \
    --destination-ranges X.X.X.X/32,Y.Y.Y.Y/32 \
    --direction EGRESS \
    --priority 1000 \
    --network <vpc-name>
    
  10. If you use the default metastore, add a rule to allow connection to the metastore on port 3306. Use the following gcloud CLI command to create this rule. Replace <vpc-name> with your VPC name. Replace X.X.X.X with your regional IP address range for the default metastore.

    gcloud compute firewall-rules create to-databricks-managed-hive \
    --action ALLOW \
    --rules tcp:3306 \
    --destination-ranges X.X.X.X/32 \
    --direction EGRESS \
    --priority 1000 \
    --network <vpc-name>
    

    Important

    If you deployed your own metastore instance, create a rule to allow connection to it instead of the default metastore.

  11. Add any other network egress rules that your organization requires to allow required connections to data sources or other systems.

    Important

    For custom egress ALLOW rules, set priority numbers in the range 1000-1099. It is important for the priority number to be lower than the value you used to block all egress, which is set to priority 1100 in the earlier steps.

Your firewall rules should look similar to the following, though you must use the regional IP address for the control plane services for the region of your workspace.

Firewall rules

Step 4: Update VPC Routes

Google Cloud routes define the paths that network traffic takes from a virtual machine (VM) instance to other destinations. In a VPC network, a route consists of a single destination prefix in CIDR format and a single next hop.

  1. Remove the default route that was automatically created in the VPC, which is 0.0.0.0/0 with the default internet gateway as the next hop. Use the console UI to remove this route to clear any paths to the internet.

    1. Go to the routes list in Google Cloud console.

    2. Select the route to 0.0.0.0/0 and click DELETE.

      Delete route
  2. Create a route in the VPC to send traffic to the restricted Google APIs subnet (199.36.153.4/30). Set the next hop as default-internet-gateway as per the requirement for accessing private Google APIs.

    You can use the following gcloud CLI command to create this route. Replace <vpc-name> with your VPC name. Replace <project-name> with your project name.

    gcloud beta compute routes create route-to-google-apis \
    --project=<project-name> --description="Route to Google restricted APIs" \
    --network=<vpc-name> \
    --priority=1000 \
    --destination-range=199.36.153.4/30 \
    --next-hop-gateway=default-internet-gateway
    
  3. Create a route to reach Databricks control plane services. Get the values for your region in IP addresses and domains.

    You can use the following gcloud CLI command to create the route to the web application. Replace <vpc-name> with your VPC name. Replace <project-name> with your project name. Replace X.X.X.X/32 with the regional IP address range for the web application.

    gcloud beta compute routes create route-to-databricks-scc-service \
    --project=<project-name> --description="Route to Databricks web app" \
    --network=<vpc-name> \
    --priority=1000 \
    --destination-range=X.X.X.X/32 \
    --next-hop-gateway=default-internet-gateway
    

    You can use the following gcloud CLI command to create the route to the secure cluster connectivity relay. Replace <vpc-name> with your VPC name. Replace <project-name> with your project name. Replace X.X.X.X/32 with the regional IP address range for the SCC relay.

    gcloud beta compute routes create route-to-databricks-scc-service \
    --project=<project-name> --description="Route to Databricks SCC relay" \
    --network=<vpc-name> \
    --priority=1000 \
    --destination-range=X.X.X.X/32 \
    --next-hop-gateway=default-internet-gateway
    

    If you use the default metastore, use the following gcloud CLI command to create the route to the default metastore. Replace <vpc-name> with your VPC name. Replace <project-name> with your project name. Replace X.X.X.X/32 with the regional IP address range for the metastore.

    gcloud beta compute routes create route-to-databricks-scc-service \
    --project=<project-name> --description="Route to default metastore" \
    --network=<vpc-name> \
    --priority=1000 \
    --destination-range=X.X.X.X/32 \
    --next-hop-gateway=default-internet-gateway
    

Step 5: Create DNS zone for Google APIs so you can enable Google Private Access for GCS and GCR

You must now create a DNS zone for Google APIs, which is required to enable Google Private Access from the classic compute plane to the two Google APIs for GCS and GCR. For details about enabling GPA, see the Google documentation article Configure Private Google Access].

  1. Customize the DNS in the VPC for the domain googleapis.com using Cloud DNS. Typically googleapis.com resolves to public IP addresses, but for Google Private Access you must use restricted.googleapis.com. This domain resolves to a specific range 199.36.153.4/30, which accessible within the Google Cloud network.

    The Databricks classic compute plane GKE cluster requires access to GCS and GCR. For PGA, you need to ensure that the access to restricted APIs is resolved or else GKE does not start successfully.

    You can use the following gcloud commands create the DNS zone. Replace <vpc-name> with your VPC name.

    gcloud dns managed-zones create google-apis \
    --description "private zone for Google APIs" \
    --dns-name googleapis.com \
    --visibility private \
    --networks VPC-NAME
    
    gcloud dns record-sets transaction start --zone google-apis
    
    gcloud dns record-sets transaction add restricted.googleapis.com. \
    --name *.googleapis.com \
    --ttl 300 \
    --type CNAME \
    --zone google-apis
    
    gcloud dns record-sets transaction add "199.36.153.4" "199.36.153.5" "199.36.153.6" "199.36.153.7" \
    --name restricted.googleapis.com \
    --ttl 300 \
    --type A \
    --zone google-apis
    gcloud dns record-sets transaction execute --zone google-apis
    

    The objects in Google Cloud Console look similar to the following:

    Zone for API
  2. Make similar zone settings for the GCR.io domain to send traffic to restricted Google APIs subnet for reaching the container registry. Without these changes, the GKE cluster won’t stand up successfully since it won’t be able to pull down all the necessary Databricks runtime containers.

    You can use the following gcloud commands to create these DNS zones. Replace <vpc-name> with your VPC name.

    gcloud dns managed-zones create gcr-io \
    --description "private zone for GCR.io" \
    --dns-name gcr.io \
    --visibility private \
    --networks <vpc-name>
    
    gcloud dns record-sets transaction start --zone gcr-io
    gcloud dns record-sets transaction add gcr.io. \
    --name *.gcr.io \
    --ttl 300 \
    --type CNAME \
    --zone gcr-io
    
    gcloud dns record-sets transaction add "199.36.153.4" "199.36.153.5" "199.36.153.6" "199.36.153.7" \
    --name gcr.io \
    --ttl 300 \
    --type A \
    --zone gcr-io
    gcloud dns record-sets transaction execute --zone gcr-io
    

    The objects in Google Cloud Console look similar to the following:

    Zone for GCR

Step 6: Validate configuration

  1. Create a Databricks cluster and start it.

  2. Wait to confirm that it starts successfully and has state Running.

    Validate cluster
  3. Go to the GKE workload page on Google Cloud Console.

  4. In the Cluster picker, choose your cluster. The cluster name for your workspace has the name db- followed by the Databricks workspace ID which is a large number.

  5. Confirm that all services show status OK.

    Validate services

Troubleshooting

Cannot start GKE cluster

If the GKE cluster does not start, it is possible that DNS is misconfigured when Private Google Access is enabled on the VPC that is used by your Databricks workspace. If nodes are unable to resolve googleapis hostnames, it can result in the failure of their registration, which ultimately results in the unsuccessful creation of the GKE cluster.

You can view logs by checking the serial port logs of the corresponding GKE nodes. You might see errors like the following:

>>> [ 203.604558] configure.sh[948]: curl: (6) Couldn't resolve host '

>>> storage.googleapis.com'

>>> [ 203.605422] configure.sh[948]: == Failed to download

>>> https://storage.googleapis.com/gke-release/npd-custom-plugins/v1.0.1/npd-custom-plugins-v1.0.1.tar.gz.

>>> Retrying. ==

>>> [ 203.614049] configure.sh[948]:

>>> https://www.googleapis.com/auth/devstorage.read_write

Since the cluster is using a VPC network, the subnet that this cluster is on has Private Google Access enabled. You’ll need to complete the Private Google Access setup by following instructions in this Google PGA documentation.

GKE cluster stops working or cannot start it after it could successfully start in previous attempts

You might find that the GKE cluster stops working or cannot start the GKE cluster due to misconfigured Compute Engine Service Account on the project hosting GKE. The Compute Engine default service account that is associated with the nodes of the problematic cluster has been in the ‘disabled’ state or does not have the required bare minimum role. The service account is <project-id>-compute@developer.gserviceaccount.com.

The cause for the node registration failure may be due to the disabled status of this service account, which is being attached to that nodes of the problematic cluster.

Try to enable the service account by following this Google help center article and attempt to recreate the cluster or node pool to see if it resolves the issue.

You could also try using Google Cloud Console to look at your IAM and Admin page for service accounts and the corresponding nodes of problematic cluster.

By default, the Compute Engine default service account is created with the IAM project editor role. You could change it to the bare minimum required role as follows:

  1. Get current roles that are assigned to the default compute service account.

    gcloud projects get-iam-policy {{project-id}}  \
    --flatten="bindings[].members" \
    --format='table(bindings.role)' \
    --filter="bindings.members:{{project-id}}-compute@developer.gserviceaccount.com"
    
  2. Add required minimum roles to the default compute service account

    gcloud projects add-iam-policy-binding {{project-id}} \
    --member "serviceAccount:{{project-id}}-compute@developer.gserviceaccount.com" \
    --role roles/logging.logWriter
    
    gcloud projects add-iam-policy-binding {{project-id}} \
    --member "serviceAccount:{{project-id}}-compute@developer.gserviceaccount.com" \
    --role roles/monitoring.metricWriter
    
    gcloud projects add-iam-policy-binding {{project-id}} \
    --member "serviceAccount:{{project-id}}-compute@developer.gserviceaccount.com" \
    --role roles/monitoring.viewer
    
    gcloud projects add-iam-policy-binding {{project-id}} \
    --member "serviceAccount:{{project-id}}-compute@developer.gserviceaccount.com" \
    --role roles/stackdriver.resourceMetadata.writer
    
  3. Remove the project editor role from the default compute service account.

    gcloud projects remove-iam-policy-binding {{project-id}} \
    --member "serviceAccount:{{project-id}}-compute@developer.gserviceaccount.com" \
    --role roles/editor
    
  4. Validate roles assigned to the default compute service account.

    gcloud projects get-iam-policy {{project-id}}  \
    --flatten="bindings[].members" \
    --format='table(bindings.role)' \
    --filter="bindings.members:{{project-id}}-compute@developer.gserviceaccount.com"
    
  5. Ensure that this service account has right set of roles on the project.