ai_forecast function

Applies to: check marked yes Databricks SQL

Important

This functionality is in Public Preview. Reach out to your Databricks account team to participate in the preview.

ai_forecast() is a table-valued function designed to extrapolate time series data into the future. See Arguments for available arguments to configure this function.

Requirements

Syntax

ai_forecast(
  observed TABLE,
  horizon DATE | TIMESTAMP | STRING,
  time_col STRING,
  value_col STRING | ARRAY<STRING>,
  group_col STRING | ARRAY<STRING> | NULL DEFAULT NULL,
  prediction_interval_width DOUBLE DEFAULT 0.95,
  frequency STRING DEFAULT 'auto',
  seed INTEGER | NULL DEFAULT NULL,
  parameters STRING DEFAULT '{}'
)

Arguments

ai_forecast() can forecast any number of groups (see group_col) and up to 100 metrics (see value_col) within each group. The forecast frequency is the same for all metrics in a group but can be different across different groups (see frequency).

The following are available arguments for this function:

  • observed is the table-valued input that is used as training data for the forecasting procedure.

    • This input relation must contain one “time” column and one or more “value” columns. “Group” and “parameters” columns are optional. Any additional columns in the input relation are ignored.

  • horizon is a timestamp-castable quantity representing the right-exclusive end time of the forecasting results. Within a group (see group_col) forecast results span the time between the last observation and horizon. If horizon is less than the last observation time, then no results are generated.

  • time_col is a string referencing the “time column” in observed. The column referenced by time_col should be a DATE or a TIMESTAMP.

  • value_col is a string or an array of strings referencing value columns in observed. The columns referenced by this argument should be castable to DOUBLE.

  • group_col (optional) is a string or an array of strings representing the group columns in observed. If specified, group columns are used as partitioning criteria, and forecasts are generated for each group independently. If unspecified, the full input data is treated as a single group.

  • prediction_interval_width (optional) is a value between 0 and 1 representing the width of the prediction interval. Future values have a prediction_interval_width % probability of falling between {v}_upper and {v}_lower.

  • frequency (optional) is a time unit or pandas offset alias string specifying the time granularity of the forecast results. If unspecified, the forecast granularity is automatically inferred for each group independently. If a frequency value is specified, it is applied equally to all groups.

    • The inferred frequency within a group is the mode of the most recent observations. This is a convenience operation that is not tunable by the user.

    • As an example, a time series with 99 “mondays” and 1 “tuesday” results in the “week” being the inferred frequency.

  • seed (optional) is a number used to initialize any pseudorandom number generators used in the forecasting procedure.

  • parameters (optional) is a string-encoded JSON or the name of a column identifier that represents the parameterization of the forecasting procedure. Any combination of parameters can be specified in any order, for example, {“weekly_order”: 10, “global_cap”: 1000}. Any unspecified parameters are automatically determined based on the attributes of the training data. The following parameters are supported:

    • global_cap and global_floor can be used together or independently to define the possible domain of the metric values. {“global_floor”: 0}, for example, can be used to constrain a metric like cost to always be positive. These apply globally to the training data and the forecasted data, and can not be used to provide tight constraints on the forecasted values only.

    • daily_order and weekly_order set the fourier order of the daily and weekly seasonality components.

Returns

A new set of rows containing the forecasted data. The output schema will contain the time and group columns with their types unchanged. For example, if the input time column has type DATE, then the output time column type will also be DATE. For each value column there are three output columns with the pattern {v}_forecast, {v}_upper, and {v}_lower. Regardless of the input value types, the forecasted value columns are always type DOUBLE. The output table contains future values only, spanning the range of time between the end of the observed data until horizon.

See some examples of the schema inference performed by AI_FORECAST below:

Input table

Arguments

Output table

ts: TIMESTAMP val: DOUBLE

time_col => 'ts' value_col => 'val'

ts: TIMESTAMP val_forecast: DOUBLE val_upper: DOUBLE val_lower: DOUBLE

ds: DATE val BIGINT

time_col => 'ds' value_col => 'val'

ds: DATE val_forecast: DOUBLE val_upper: DOUBLE val_lower: DOUBLE

ts: TIMESTAMP dim1: STRING dollars: DECIMAL(10, 2)

time_col => 'ts' value_col => 'dollars' group_col => 'dim1'

ts: TIMESTAMP dim1: STRING dollars_forecast: DOUBLE dollars_upper: DOUBLE dollars_lower: DOUBLE

ts: TIMESTAMP dim1: STRING dim2: BIGINT dollars: DECIMAL(10, 2) users: BIGINT

time_col => 'ts' value_col => ARRAY('dollars', 'users') group_col => ARRAY('dim1', 'dim2')

ts: TIMESTAMP dim1: STRING dim2: BIGINT dollars_forecast: DOUBLE dollars_upper: DOUBLE dollars_lower: DOUBLE users_forecast: DOUBLE users_upper: DOUBLE users_lower: DOUBLE

Examples

The following example forecasts until a specified date:

WITH
aggregated AS (
  SELECT
    DATE(tpep_pickup_datetime) AS ds,
    SUM(fare_amount) AS revenue
  FROM
    samples.nyctaxi.trips
  GROUP BY
    1
)
SELECT * FROM AI_FORECAST(
  TABLE(aggregated),
  horizon => '2016-03-31',
  time_col => 'ds',
  value_col => 'revenue'
)

The following is a more complex example:

WITH
aggregated AS (
  SELECT
    DATE(tpep_pickup_datetime) AS ds,
    dropoff_zip,
    SUM(fare_amount) AS revenue,
    COUNT(*) AS n_trips
  FROM
    samples.nyctaxi.trips
  GROUP BY
    1, 2
),
spine AS (
  SELECT all_dates.ds, all_zipcodes.dropoff_zip
  FROM (SELECT DISTINCT ds FROM aggregated) all_dates
  CROSS JOIN (SELECT DISTINCT dropoff_zip FROM aggregated) all_zipcodes
)
SELECT * FROM AI_FORECAST(
  TABLE(
    SELECT
      spine.*,
      COALESCE(aggregated.revenue, 0) AS revenue,
      COALESCE(aggregated.n_trips, 0) AS n_trips
    FROM spine LEFT JOIN aggregated USING (ds, dropoff_zip)
  ),
  horizon => '2016-03-31',
  time_col => 'ds',
  value_col => ARRAY('revenue', 'n_trips'),
  group_col => 'dropoff_zip',
  prediction_interval_width => 0.9,
  parameters => '{"global_floor": 0}'
)

Note that it is very common for tables to not materialize 0s or empty entries. If the values of the missing entries can be inferred, for example 0, then these values should be coalesced prior to calling the forecast function. If the values are truly missing or unknown, then they can be left as NULL.

For very sparse data, it is best practice to coalesce missing values or provide a frequency value explicitly to avoid unexpected output from the “auto” frequency inference. As an example, “auto” frequency inference on two entries 14 days apart will infer a frequency of “14D” even if “real” frequency might be weekly with 1 missing value. Coalescing the missing entries removes this ambiguity.

Finally, we show an example where different forecast parameters are applied to different groups in the input table:

WITH past AS (
  SELECT
    CASE
      WHEN fare_amount < 30 THEN 'Under $30'
      ELSE '$30 or more'
    END AS revenue_bucket,
    CASE
      WHEN fare_amount < 30 THEN '{"daily_order": 0}'
      ELSE '{"daily_order": "auto"}'
    END AS parameters,
    DATE(tpep_pickup_datetime) AS ds,
    SUM(fare_amount) AS revenue
  FROM samples.nyctaxi.trips
  GROUP BY ALL
)
SELECT * FROM AI_FORECAST(
  TABLE(past),
  horizon => (SELECT MAX(ds) + INTERVAL 30 DAYS FROM past),
  time_col => 'ds',
  value_col => 'revenue',
  group_col => ARRAY('revenue_bucket'),
  parameters => 'parameters'
)

Note the use of a column identifier as the parameters argument. This enables users to store previously-determined parameter JSONs in a table and reuse them on new data.

Limitations

The following limitations apply during the preview:

  • The default forecasting procedure is a prophet-like piecewise linear and seasonality model. This is the only supported forecasting procedure available.

  • Error messages are delivered through the Python UDTF engine, and contain Python trace back information. The end of the trace back contains the actual error message.