Databricks SQL Connector for Python

General Availability

This library is Generally Available.

The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Databricks clusters. The Databricks SQL Connector for Python is easier to set up and use than similar Python libraries such as pyodbc. This library follows PEP 249 – Python Database API Specification v2.0.

Requirements

  • A development machine running Python 3.7 or higher.

  • An existing cluster.

Get started

  • Gather the following information for the cluster that you want to use:

    • The server hostname of the cluster. You can get this from the Server Hostname value in the Advanced Options > JDBC/ODBC tab for your cluster.

    • The HTTP path of the cluster. You can get this from the HTTP Path value in the Advanced Options > JDBC/ODBC tab for your cluster. The cluster ID is the string of characters following the final forward slash character in the HTTP Path value. For example, if the HTTP Path value is sql/protocolv1/o/1234567890123456/1234-567890-test123, the cluster ID is 1234-567890-test123. For more information, see Cluster URL and ID.

    • Your Databricks personal access token for the workspace for the cluster. Databricks personal access tokens typically start with dapi.

  • Install the Databricks SQL Connector for Python library on your development machine by running pip install databricks-sql-connector.

Query data

The following code example demonstrates how to call the Databricks SQL Connector for Python to run a basic SQL command on a cluster. This command returns the first two rows from the specified database table.

from databricks import sql

with sql.connect(server_hostname="<server-hostname>",
                 http_path="<http-path>",
                 access_token="<access-token>") as connection:
    with connection.cursor() as cursor:
        cursor.execute("SELECT * FROM <database-name>.<table-name> LIMIT 2")
        result = cursor.fetchall()

        for row in result:
          print(row)

Replace:

  • <server-hostname> with the Server Hostname value.

  • <http-path> with the HTTP Path value.

  • <access-token> with your Databricks personal access token.

  • <database-name> with the name of the database to query.

  • <table-name> with the name of the table to query.

Insert data

The following example demonstrate how to insert small amounts of data (thousands of rows):

from databricks import sql

with sql.connect(server_hostname="...", http_path="...", access_token="...") as connection:
  with connection.cursor() as cursor:
    cursor.execute("CREATE TABLE IF NOT EXISTS squares (x int, x_squared int)")
    squares = [(i, i * i) for i in range(100)]
    values = ",".join([f"({x}, {y})" for (x, y) in squares])

    cursor.execute(f"INSERT INTO squares VALUES {values}")

For large amounts of data, you should first upload the data to cloud storage and then execute the COPY INTO command.

Query metadata

There are dedicated methods for retrieving metadata. The following example retrieves metadata about columns in a sample table:

from databricks import sql

with sql.connect(server_hostname="...", http_path="...", access_token="...") as connection:
  with connection.cursor() as cursor:
    cursor.columns(schema_name="default", table_name="squares")
    print(cursor.fetchall())

Cursor and connection management

It is best practice to close any connections and cursors that have been finished with. This frees resources on Databricks clusters.

You can use a context manager (the with syntax used in previous examples) to manage the resources, or explicitly call close:

from databricks import sql

connection = sql.connect(server_hostname="...", http_path="...", access_token="...")
cursor = connection.cursor()

cursor.execute("SELECT * from range(10)")
print(cursor.fetchall())

cursor.close()
connection.close()

Configure logging

The Databricks SQL Connector uses Python’s standard logging module. You can configure the logging level like below:

import logging

from databricks import sql

logging.getLogger('databricks.sql').setLevel(logging.DEBUG)
sql.connect(...)

API reference

Package

databricks-sql-connector

Usage: pip install databricks-sql-connector

See also databricks-sql-connector in the Python Package Index (PyPI).

Module

databricks.sql

Usage: from databricks import sql

Methods

connect method

Creates a connection to a database.

Returns a Connection object.

Parameters

server_hostname

Type: str

The server hostname for the cluster. To get the server hostname, see the instructions earlier in this article.

This parameter is required.

Example: 1234567890123456.7.gcp.databricks.com

http_path

Type: str

The HTTP path of the cluster. To get the HTTP path, see the instructions earlier in this article.

This parameter is required.

Example: sql/protocolv1/o/1234567890123456/1234-567890-test123

access_token

Type: str

Your Databricks personal access token for the workspace for the cluster. To create a token, see the instructions earlier in this article.

This parameter is required.

Example: dapi...<the-remaining-portion-of-your-token>

session_configuration

Type: dict[str, Any]

A dictionary of Spark session configuration parameters. Setting a configuration is equivalent to using the SET key=val SQL command. Run the SQL command SET -v to get a full list of available configurations.

Defaults to None.

This parameter is optional.

Example: {"spark.sql.variable.substitute": True}

http_headers

Type: List[Tuple[str, str]]]

Additional (key, value) pairs to set in HTTP headers on every RPC request the client makes. Typical usage will not set any extra HTTP headers. Defaults to None.

This parameter is optional.

Since version 2.0

catalog

Type: str

Initial catalog to use for the connection. Defaults to None (in which case the default catalog, typically hive_metastore will be used).

This parameter is optional.

Since version 2.0

schema

Type: str

Initial schema to use for the connection. Defaults to None (in which case the default schema default will be used).

This parameter is optional.

Since version 2.0

Classes

Connection class

Represents a connection to a database.

Methods
close method

Closes the connection to the database and releases all associated resources on the server. Any additional calls to this connection will throw an Error.

No parameters.

No return value.

cursor method

Returns a mechanism that enables traversal over the records in a database.

No parameters.

Returns a Cursor object.

Cursor class

Attributes
arraysize attribute

Used with the fetchmany method, specifies the internal buffer size, which is also how many rows are actually fetched from the server at a time. The default value is 10000. For narrow results (results in which each row does not contain a lot of data), you should increase this value for better performance.

Read-write access.

description attribute

Contains a Python list of tuple objects. Each of these tuple objects contains 7 values, with the first 2 items of each tuple object containing information describing a single result column as follows:

  • name: The name of the column.

  • type_code: A string representing the type of the column. For example, an integer column will have a type code of int.

The remaining 5 items of each 7-item tuple object are not implemented, and their values are not defined. They will typically be returned as 4 None values followed by a single True value.

Read-only access.

Methods
cancel method

Interrupts the running of any database query or command that the cursor has started. To release the associated resources on the server, call the close method after calling the cancel method.

No parameters.

No return value.

close method

Closes the cursor and releases the associated resources on the server. Closing an already closed cursor might throw an error.

No parameters.

No return value.

execute method

Prepares and then runs a database query or command.

No return value.

Parameters

operation

Type: str

The query or command to prepare and then run.

This parameter is required.

Example without the parameters parameter:

cursor.execute(
 'SELECT * FROM default.diamonds WHERE cut="Ideal" LIMIT 2'
)

Example with the parameters parameter:

cursor.execute(
 'SELECT * FROM default.diamonds WHERE cut=%(cut_type)s LIMIT 2',
 { 'cut_type': 'Ideal' }
)

parameters

Type: dictionary

A sequence of parameters to use with the operation parameter.

This parameter is optional. The default is None.

executemany method

Prepares and then runs a database query or command using all parameter sequences in the seq_of_parameters argument. Only the final result set is retained.

No return value.

Parameters

operation

Type: str

The query or command to prepare and then run.

This parameter is required.

seq_of_parameters

Type: list of dict

A sequence of many sets of parameter values to use with the operation parameter.

This parameter is required.

catalogs method

Execute a metadata query about the catalogs. Actual results should then be fetched using fetchmany or fetchall. Important fields in the result set include:

  • Field name: TABLE_CAT. Type: str. The name of the catalog.

No parameters.

No return value.

Since version 1.0

schemas method

Execute a metadata query about the schemas. Actual results should then be fetched using fetchmany or fetchall. Important fields in the result set include:

  • Field name: TABLE_SCHEM. Type: str. The name of the schema.

  • Field name: TABLE_CATALOG. Type: str. The catalog to which the schema belongs.

No return value.

Since version 1.0

Parameters

catalog_name

Type: str

A catalog name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

schema_name

Type: str

A schema name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

tables method

Execute a metadata query about tables and views. Actual results should then be fetched using fetchmany or fetchall. Important fields in the result set include:

  • Field name: TABLE_CAT. Type: str. The catalog to which the table belongs.

  • Field name: TABLE_SCHEM. Type: str. The schema to which the table belongs.

  • Field name: TABLE_NAME. Type: str. The name of the table.

  • Field name: TABLE_TYPE. Type: str. The kind of relation, for example VIEW or TABLE (applies to Databricks Runtime 10.2 and above; prior versions of the Databricks Runtime return an empty string).

No return value.

Since version 1.0

Parameters

catalog_name

Type: str

A catalog name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

schema_name

Type: str

A schema name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

table_name

Type: str

A table name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

table_types

Type: List[str]

A list of table types to match, for example TABLE or VIEW.

This parameter is optional.

columns method

Execute a metadata query about the columns. Actual results should then be fetched using fetchmany or fetchall. Important fields in the result set include:

  • Field name: TABLE_CAT. Type: str. The catalog to which the column belongs.

  • Field name: TABLE_SCHEM. Type: str. The schema to which the column belongs.

  • Field name: TABLE_NAME. Type: str. The name of the table to which the column belongs.

  • Field name: COLUMN_NAME. Type: str. The name of the column.

No return value.

Since version 1.0

Parameters

catalog_name

Type: str

A catalog name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

schema_name

Type: str

A schema name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

table_name

Type: str

A table name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

column_name

Type: str

A column name to retrieve information about. The % character is interpreted as a wildcard.

This parameter is optional.

fetchall method

Gets all (or all remaining) rows of a query.

No parameters.

Returns all (or all remaining) rows of the query as a Python list of Row objects.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

fetchmany method

Gets the next rows of a query.

Returns up to size (or the arraysize attribute if size is not specified) of the next rows of a query as a Python list of Row objects. If there are fewer than size rows left to be fetched, all remaining rows will be returned.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

Parameters

size

Type: int

The number of next rows to get.

This parameter is optional. If not specified, the value of the arraysize attribute is used.

Example: cursor.fetchmany(10)

fetchone method

Gets the next row of the dataset.

No parameters.

Returns the next row of the dataset as a single sequence as a Python tuple object, or returns None if there is no more available data.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

fetchall_arrow method

Gets all (or all remaining) rows of a query, as a PyArrow Table object. Queries returning very large amounts of data should use fetchmany_arrow instead to reduce memory consumption.

No parameters.

Returns all (or all remaining) rows of the query as a PyArrow table.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

Since version 2.0

fetchmany_arrow method

Gets the next rows of a query as a PyArrow Table object.

Returns up to the size argument (or the arraysize attribute if size is not specified) of the next rows of a query as a Python PyArrow Table object.

Throws an Error if the previous call to the execute method did not return any data or no execute call has yet been made.

Since version 2.0

Parameters

size

Type: int

The number of next rows to get.

This parameter is optional. If not specified, the value of the arraysize attribute is used.

Example: cursor.fetchmany_arrow(10)

Row class

The row class is a tuple-like data structure that represents an individual result row. If the row contains a column with the name "my_column", you can access the "my_column" field of row via row.my_column. You can also use numeric indicies to access fields, for example row[0]. If the column name is not allowed as an attribute method name (for example, it begins with a digit), then you can access the field as row["1_my_column"].

Since version 1.0

Methods
asDict method

Return a dictionary representation of the row, which is indexed by field names. If there are duplicate field names, one of the duplicate fields (but only one) will be returned in the dictionary. Which duplicate field is returned is not defined.

No parameters.

Returns a dict of fields.

Type conversions

The following table maps Apache Spark SQL data types to their Python data type equivalents.

Apache Spark SQL data type

Python data type

array

str

bigint

int

binary

bytearray

boolean

bool

date

datetime.date

decimal

decimal.Decimal

double

float

int

int

map

str

null

NoneType

smallint

int

string

str

struct

str

timestamp

datetime.datetime

tinyint

int

Troubleshooting

tokenAuthWrapperInvalidAccessToken: Invalid access token message

Issue: When you run your code, you see a message similar to Error during request to server: tokenAuthWrapperInvalidAccessToken: Invalid access token.

Possible cause: The value passed to access_token is not a valid Databricks personal access token.

Recommended fix: Check that the value passed to access_token is correct and try again.

gaierror(8, 'nodename nor servname provided, or not known') message

Issue: When you run your code, you see a message similar to Error during request to server: gaierror(8, 'nodename nor servname provided, or not known').

Possible cause: The value passed to server_hostname is not the correct host name.

Recommended fix: Check that the value passed to server_hostname is correct and try again.

For more information on finding the server hostname, see Retrieve the connection details.

IpAclError message

Issue: When you run your code, you see the message Error during request to server: IpAclValidation when you try to use the connector on a Databricks notebook.

Possible cause: You may have IP allow listing enabled for the Databricks workspace. With IP allow listing, connections from Spark clusters back to the control plane are not allowed by default.

Recommended fix: Ask your administrator to add the data plane subnet to the IP allow list.

Additional resources

For more information, see: