Use benchmarks in a Genie space

This article explains how to use benchmarks to evaluate the accuracy of your Genie space.

Overview

Benchmarks allow you to create a set of test questions that you can run to assess Genie’s overall response accuracy. A well-designed set of benchmarks covering the most frequently asked user questions helps evaluate the accuracy of your Genie space as you refine it.

Example benchmarks with accuracy reported on nine questions.

Add benchmark questions

Benchmark questions should reflect different ways of phrasing the common questions your users ask. You can use them to check Genie’s response to variations in question phrasing or different question formats.

When creating a benchmark question, you can optionally include a SQL query whose result set is the correct answer. During benchmark runs, accuracy is assessed by comparing the result set from your SQL query to the one generated by Genie.

To add a benchmark question, perform the following steps:

  1. Click the Benchmarks icon in the left sidebar in a Genie space.

  2. Click the Questions tab. Then, click Add benchmark.

  3. In the Question field, enter a benchmark question to test.

  4. (Optional) Enter the SQL statement that accurately answers the question you entered.

    Note

    This step is recommended. Only questions that include this example SQL statement can be automatically assessed for accuracy. Any questions that do not include a SQL Answer require manual review to be scored.

  5. (Optional) Click Run to run your query and view the results.

  6. When you’re finished editing, click Add benchmark.

  7. To update a question after saving, click the Edit icon pencil icon to open the Update question dialog.

Use benchmarks to test alternate question phrasings

When evaluating the accuracy of your Genie space, it’s important to structure tests to reflect realistic scenarios. Users may ask the same question in different ways. Databricks recommends adding multiple phrasings of the same question and using the same example SQL in your benchmark tests to fully assess accuracy. Most Genie spaces should include 2 - 4 phrasings of the same question.

Run benchmark questions

Users with at least CAN EDIT permissions in a Genie space can create a benchmark run anytime, which will automatically evaluate across all benchmark questions. To evaluate each benchmark question, we will first submit the question to Genie, then compare the Genie results against the benchmark. One of the following labels is applied to each benchmark:

  • Good: Responses are marked with this label when the Genie-generated query result matches the results from the provided SQL Answer. When a response is marked Good, it means that the row values match exactly, regardless of sort order or column names.

  • Needs review: Responses are marked with this label when Genie cannot assess correctness or when Genie-generated query results do not match the results from the provided SQL Answer. If there are unexpected changes to a tables dimensions in the generated response or the provided SQL answer, the question may be marked for review. Any benchmark questions that do not include a SQL Answer must be reviewed manually.

  • Bad: Responses are never automatically labelled as Bad. If Genie-generated query results do not match the result set from the provided SQL Answer, the question is marked as Needs review. When you review those benchmarks, you can mark a result as Bad if you don’t think Genie’s generated query results answer the question.

To run all benchmark questions:

  1. Click Benchmarks icon Benchmarks in the Genie space sidebar near the left side of the screen.

  2. Click Run benchmarks to start the test run.

Note

If you close this page, the benchmark run automatically pauses. You can resume the test when you reopen the page.

Access benchmark evaluations

You can access all of your benchmark evaluations to track accuracy in your Genie space over time. When you click the Benchmarks icon in the left sidebar in a Genie space, a timestamped list of evaluation runs appears in the Evaluations tab. If no evaluation runs are found, see Add benchmark questions or Run benchmark questions.

Evaluations screen as described in the text that follows.

The Evaluations tab shows an overview of evaluations and their performance reported in the following categories:

Evaluation name: A timestamp that indicates when an evaluation run occured. Click the timestamp to see details for that evaluation. Execution status: Indicates if the evaluation is completed, paused, or unsuccessful. If an evaluation run includes benchmark questions that do not have predefined SQL answers, it is marked for review in this column. Accuracy: A numeric assessment of accuracy across all benchmark questions. For evaluation runs that require manual review, an accuracy measure appears only after those questions have been reviewed. Created by: Indicates the name of the user who ran the evaluation.

Review individual evaluations

You can review individual evaluations to get a detailed look at each response. You can edit the assessment for any question and update any items that need manual review.

To review individual evaluations:

  1. Click Benchmarks icon Benchmarks in the Genie space sidebar near the left side of the screen.

  2. Click the timestamp for any evaluation in the Evaluation name column to open a detailed view of that test run.

    A screen that shows the results of a single evaluation run. All questions are listed on the left. If applicable, individual questions are shown on the right with the model output and the ground truth output.
  3. Click a question near the left side of the screen to see the associated details. Use the evaluation detail screen perform the next steps.

  4. Review and compare the Model output response with the Ground truth response.

    Note

    The results of these responses appear in the the evaluation details for one week. After one week, the results are no longer visible. The generated SQL statement and the example SQL statement remain.

  5. Click the Edit icon on the label to edit the assessment.

    Mark each result as Good or Bad to get an accurate score for this evaluation.