Connect Python and pyodbc to Databricks
You can connect from your local Python code through ODBC to data in a Databricks cluster or SQL warehouse. To do this, you can use the open source Python code module pyodbc
.
Follow these instructions to install, configure, and use pyodbc
.
For more information about pyodbc
, see the pyodbc Wiki.
Note
Databricks offers the Databricks SQL Connector for Python as an alternative to pyodbc
. The Databricks SQL Connector for Python is easier to set up and use, and has a more robust set of coding constructs, than pyodbc
. However pyodbc
may have better performance when fetching queries results above 10 MB.
These instructions were tested with Databricks ODBC driver 2.7.5, pyodbc 5.0.1, and unixODBC 2.3.12.
Requirements
A local development machine running one of the following:
macOS
Windows
A Unix or Linux distribution that supports
.rpm
or.deb
files
pip.
For Unix, Linux, or macOS, Homebrew.
A Databricks cluster, a Databricks SQL warehouse, or both. For more information, see Compute configuration reference and Connect to a SQL warehouse.
Step 1: Download, install, and configure software
In this step, you download and install the Databricks ODBC driver, the unixodbc
package, and the pyodbc
module. (The pyodbc
module requires the unixodbc
package on Unix, Linux, and macOS.) You also configure an ODBC Data Source Name (DSN) to authenticate with and connect to your cluster or SQL warehouse.
Download and install the Databricks ODBC driver and and configure an ODBC DSN for your operating system.
For Unix, Linux, and macOS, install the
unixodbc
package: from the terminal, use Homebrew to run the commandbrew install unixodbc
. For more information, see unixodbc on the Homebrew website.Install the
pyodbc
module: from the terminal or command prompt, usepip
to run the commandpip install pyodbc
. For more information, see pyodbc on the PyPI website and Install in the pyodbc Wiki.
Step 2: Test your configuration
In this step, you write and run Python code to use your Databricks cluster or Databricks SQL warehouse to query the trips
table in the samples
catalog’s nyctrips
schema and display the results.
Create a file named
pyodbc-demo.py
with the following content. Replace<dsn-name>
with the name of the ODBC DSN that you created earlier, save the file, and then run the file with your Python interpreter.import pyodbc # Connect to the Databricks cluster by using the # Data Source Name (DSN) that you created earlier. conn = pyodbc.connect("DSN=<dsn-name>", autocommit=True) # Run a SQL query by using the preceding connection. cursor = conn.cursor() cursor.execute(f"SELECT * FROM samples.nyctaxi.trips") # Print the rows retrieved from the query. for row in cursor.fetchall(): print(row)
To speed up running the code, start the cluster that corresponds to the
HTTPPath
setting in your DSN.Run the
pyodbc-demo.py
file with your Python interpreter. Information about the table’s rows are displayed.
Next steps
To run the Python test code against a different cluster or SQL warehouse, create a different DSN and change
<dsn-name>
to the DSN’s name.To run the Python test code with a different SQL query, change the
execute
command string.
Using a DSN-less connection
As an alternative to using an DSN name, you can specify the connection settings inline. The following example shows how to use a DSN-less connection string for Databricks personal access token authentication. This example assumes that you have the following environment variables:
Set
DATABRICKS_SERVER_HOSTNAME
to the workspace instance name, for example1234567890123456.7.gcp.databricks.com
.Set
DATABRICKS_HTTP_PATH
to the HTTP Path value for the target cluster or SQL warehouse in the workspace. To get the HTTP Path value, see Get connection details for a Databricks compute resource.Set
DATABRICKS_TOKEN
to the Databricks personal access token for the target user. To create a personal access token, see Databricks personal access tokens for workspace users.
To set environment variables, see your operating system’s documentation.
import pyodbc
import os
conn = pyodbc.connect(
"Driver=/Library/simba/spark/lib/libsparkodbc_sb64-universal.dylib;" +
f"Host={os.getenv('DATABRICKS_HOST')};" +
"Port=443;" +
f"HTTPPath={os.getenv('DATABRICKS_HTTP_PATH')};" +
"SSL=1;" +
"ThriftTransport=2;" +
"AuthMech=3;" +
"UID=token;" +
f"PWD={os.getenv('DATABRICKS_TOKEN')}",
autocommit = True
)
# Run a SQL query by using the preceding connection.
cursor = conn.cursor()
cursor.execute("SELECT * FROM samples.nyctaxi.trips")
# Print the rows retrieved from the query.
for row in cursor.fetchall():
print(row)
Troubleshooting
This section addresses common issues when using pyodbc
with Databricks.
Unicode decode error
Issue: You receive an error message similar to the following:
<class 'pyodbc.Error'> returned a result with an error set
Traceback (most recent call last):
File "/Users/user/.pyenv/versions/3.7.5/lib/python3.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 2112-2113: illegal UTF-16 surrogate
Cause: An issue exists in pyodbc
version 4.0.31 or below that could manifest with such symptoms when running queries that return columns with long names or a long error message. The issue has been fixed by a newer version of pyodbc
.
Solution: Upgrade your installation of pyodbc
to version 4.0.32 or above.
General troubleshooting
See Issues in the mkleehammer/pyodbc repository on GitHub.