Use a Python wheel in a Databricks job

A Python wheel is a standard way to package and distribute the files required to run a Python application. Using the Python wheel task, you can ensure fast and reliable installation of Python code in your Databricks jobs. This article provides an example of creating a Python wheel and a job that runs the application packaged in the Python wheel. In this example, you will:

  • Create the Python files defining an example application.

  • Bundle the example files into a Python wheel.

  • Create a job to run the Python wheel.

  • Run the job and view the results.

Before you begin

You need the following to complete this example:

  • Python3

  • The Python wheel and setuptool packages. You can use pip to install these packages. For example, you can run the following command to install these packages:

    pip install wheel setuptools
    

Step 1: Create a local directory for the example

Create a local directory to hold the example code and generated artifacts, for example, databricks_wheel_test.

Step 2: Create the example Python script

The following Python example is a simple script that reads input arguments and prints out those arguments. Copy this script and save it to a path called my_test_code/__main__.py in the directory you created in the previous step.

"""
The entry point of the Python Wheel
"""

import sys

def main():
  # This method will print the provided arguments
  print('Hello from my func')
  print('Got arguments:')
  print(sys.argv)

if __name__ == '__main__':
  main()

Step 3: Create a metadata file for the package

The following file contains metadata describing the package. Save this to a path called my_test_code/__init__.py in the directory you created in step 1.

__version__ = "0.0.1"
__author__ = "Databricks"

Step 4: Create the Python wheel

Converting the Python artifacts into a Python wheel requires specifying package metadata such as the package name and entry points. The following script defines this metadata.

Note

The entry_points defined in this script are used to run the package in the Databricks workflow. In each value in entry_points, the value before = (in this example, run) is the name of the entry point and is used to configure the Python wheel task.

  1. Save this script in a file named setup.py in the root of the directory you created in step 1:

    from setuptools import setup, find_packages
    
    import my_test_code
    
    setup(
      name='my_test_package',
      version=my_test_code.__version__,
      author=my_test_code.__author__,
      url='https://databricks.com',
      author_email='john.doe@databricks.com',
      description='my test wheel',
      packages=find_packages(include=['my_test_code']),
      entry_points={
        'group_1': 'run=my_test_code.__main__:main'
      },
      install_requires=[
        'setuptools'
      ]
    )
    
  2. Change into the directory you created in step 1, and run the following command to package your code into the Python wheel distribution:

    python3 setup.py bdist_wheel
    

This command creates the Python wheel and saves it to the dist/my_test_package-0.0.1-py3.none-any.whl file in your directory.

Step 5. Create a Databricks job to run the Python wheel

  1. Go to your Databricks landing page and do one of the following:

    • In the sidebar, click Jobs Icon Workflows and click Create Job Button.

    • In the sidebar, click New Icon New and select Job from the menu.

  2. In the task dialog box that appears on the Tasks tab, replace Add a name for your job… with your job name, for example, Python wheel example.

  3. In Task name, enter a name for the task, for example, python_wheel_task.

  4. In Type, select Python wheel.

  5. In Package name, enter my_test_package. The package name is the value assigned to the name variable in the setup.py script.

  6. In Entry point, enter run. The entry point is one of the values specified in the entry_points collection in the setup.py script. In this example, run is the only entry point defined.

  7. In Cluster, select a compatible cluster. See Compute compatibility with libraries and init scripts

  8. Click Add under Dependent Libraries. In the Add dependent library dialog, with Workspace selected, drag the my_test_package-0.0.1-py3-none-any.whl file created in step 4 into the dialog’s Drop file here area.

  9. Click Add.

  10. In Parameters, select Positional arguments or Keyword arguments to enter the key and the value of each parameter. Both positional and keyword arguments are passed to the Python wheel task as command-line arguments.

    • To enter positional arguments, enter parameters as a JSON-formatted array of strings, for example: ["first argument","first value","second argument","second value"].

    • To enter keyword arguments, click + Add and enter a key and value. Click + Add again to enter more arguments.

  11. Click Save task.

Step 6: Run the job and view the job run details

Click Run Now Button to run the workflow. To view details for the run, click View run in the Triggered run pop-up or click the link in the Start time column for the run in the job runs view.

When the run completes, the output displays in the Output panel, including the arguments passed to the task.

Next steps

To learn more about creating and running Databricks jobs, see Create and run Databricks Jobs.