実行 test with pytest for the Databricks extension for Visual Studio Code

この記事では、Visual Studio Code の Databricks 拡張機能の pytest を使用してテストを実行する方法について説明します。「Visual Studio Code の Databricks 拡張機能とは」を参照してください。

この情報では、Visual Studio Code 用の Databricks 拡張機能が既にインストールおよび設定されていることを前提としています。「Visual Studio Code 用の Databricks 拡張機能のインストール」を参照してください。

リモート Databricks ワークスペース内のクラスターへの接続を必要としないローカルコードで pytest を実行できます。たとえば、 pytest を使用して、ローカルメモリ内の PySpark DataFrames を受け入れて返す関数をテストできます。 pytest の使用を開始し、ローカルで実行するには、 pytest ドキュメントの「はじめに」を参照してください。

リモート Databricks ワークスペース内のコードで pytest を実行するには、Visual Studio コードプロジェクトで次の操作を行います。

ステップ 1: テストを作成する

実行するテストを含む次のコードを含む Python ファイルを追加します。この例では、このファイルの名前が spark_test.py であり、Visual Studio コードプロジェクトのルートにあることを前提としています。このファイルには、クラスターの SparkSession (クラスター上の Spark 機能へのエントリポイント) をテストで使用できる pytest フィ クスチャが含まれています。このファイルには、テーブル内の指定されたセルに指定された値が含まれているかどうかを確認する 1 つのテストが含まれています。必要に応じて、このファイルに独自のテストを追加できます。

from pyspark.sql import SparkSession
import pytest

@pytest.fixture
def spark() -> SparkSession:
  # Create a SparkSession (the entry point to Spark functionality) on
  # the cluster in the remote Databricks workspace. Unit tests do not
  # have access to this SparkSession by default.
  return SparkSession.builder.getOrCreate()

# Now add your unit tests.

# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third column in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut   | color | clarity | depth | table | price | x    | y     | z    |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1  | 0.23  | Ideal | E     | SI2     | 61.5  | 55    | 326   | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
  spark.sql('USE default')
  data = spark.sql('SELECT * FROM diamonds')
  assert data.collect()[0][2] == 'Ideal'

ステップ 2: pytest ランナーを作成する

次のコードを含む Python ファイルを追加して、前の手順のテストを実行するように pytest に指示します。この例では、ファイルの名前が pytest_databricks.py であり、Visual Studio コードプロジェクトのルートにあることを前提としています。

import pytest
import os
import sys

# Run all tests in the connected directory in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".

# Get the path to the directory for this file in the workspace.
dir_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the root directory.
os.chdir(dir_root)

# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True

# Now run pytest from the root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
#   ...
#   "program": "${workspaceFolder}/path/to/this/file/in/workspace",
#   "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])

ステップ 3: カスタム実行構成を作成する

テストを実行するように pytest に指示するには、カスタム実行構成を作成する必要があります。次のように、既存の Databricks クラスターベースの実行構成を使用して、独自のカスタム実行構成を作成します。

メインメニューで、[実行] > [構成の追加] をクリックします。
コマンドパレットで、[Databricks] を選択します。

Visual Studio Code によって、 .vscode/launch.json ファイルがまだ存在しない場合は、プロジェクトに追加されます。
スターター実行の構成を次のように変更し、ファイルを保存します。
- この実行構成の名前を Run on Databricks から、この構成の一意の表示名に変更します (この例では Unit Tests (on Databricks))。
- program を ${file} から、テストランナーを含むプロジェクト内のパスに変更します (この例では ${workspaceFolder}/pytest_databricks.py)。
- args を [] から、テストを含むファイルを含むプロジェクト内のパスに変更します (この例では ["."])。
launch.json ファイルは次のようになります。
```
{
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "type": "databricks",
      "request": "launch",
      "name": "Unit Tests (on Databricks)",
      "program": "${workspaceFolder}/pytest_databricks.py",
      "args": ["."],
      "env": {}
    }
  ]
}
```

ステップ 4: テストを実行する

最初に、 pytest が既にクラスターにインストールされていることを確認してください。たとえば、Databricks ワークスペースでクラスターの設定ページを開いた状態で、次の操作を行います。

[ライブラリ] タブで pytest が表示されている場合、 pytest はすでにインストールされています。pytest が表示されていない場合は、[新規インストール] をクリックします。
[ ライブラリソース] で [PyPI] をクリックします。
[パッケージ] に「 pytest」と入力します。
[インストール] をクリックします。
[状態] が [保留中] から [インストール済み] に変わるまで待ちます。

テストを実行するには、Visual Studio コードプロジェクトから次の操作を行います。

メインメニューで、[ 表示>実行] をクリックします。
[ 実行とデバッグ] ボックスの一覧で、[ 単体テスト (Databricks 上)] をクリックします (まだ選択されていない場合)。
緑色の矢印 (デバッグ開始) アイコンをクリックします。

pytest 結果は、 デバッグコンソール (メインメニューの [デバッグコンソールの表示>) に表示されます。たとえば、これらの結果は、 spark_test.py ファイルで少なくとも 1 つのテストが見つかったことを示しており、ドット (.) は 1 つのテストが見つかり、合格したことを意味します。 (テストに失敗すると、 Fが表示されます。

<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Workspace/path/to/directory ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/path/to/directory
collected 1 item

spark_test.py .                                                          [100%]

============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)