Use time series feature tables with point-in-time support

Databricks Feature Store supports use cases that require point-in-time correctness.

The data used to train a model often has time dependencies built into it. For example, if you are training a model to predict which machines on a factory floor need maintenance, you might have historical datasets that contain sensor measurements and usage data for many machines, along with target labels that indicate if the machine needed service or not. The dataset might contain data for machines both before and after a maintenance service was performed.

When you build the model, you must consider only feature values up until the time of the observed target value (needs service or does not need service). If you do not explicitly take into account the timestamp of each observation, you might inadvertently use feature values measured after the timestamp of the target value for training. This is called “data leakage” and can negatively affect the model’s performance.

Time series feature tables include a timestamp key column that ensures that each row in the training dataset represents the latest known feature values as of the row’s timestamp. You should use time series feature tables whenever feature values change over time, for example with time series data, event-based data, or time-aggregated data.

Note

  • Point-in-time lookup functionality is sometimes referred to as “time travel”. The point-in-time functionality in Databricks Feature Store is not related to Delta Lake time travel.

  • To use point-in-time functionality, you must specify time-related keys using the timestamp_keys argument. This indicates that feature table rows should be joined by matching the most recent value for a particular primary key that is not later than the value of the timestamps_keys column, instead of joining based on an exact time match. If you designate a timestamp column as a primary key column, feature store does not apply point-in-time logic to the timestamp column during joins. Instead, it matches only rows with an exact time match instead of matching all rows prior to the timestamp.

How time series feature tables work

Suppose you have the following feature tables. This data is taken from the example notebook.

The tables contain sensor data measuring the temperature, relative humidity, ambient light, and carbon dioxide in a room. The ground truth table indicates if a person was present in the room. Each of the tables has a primary key (‘room’) and a timestamp key (‘ts’). For simplicity, only data for a single value of the primary key (‘0’) is shown.

example feature table data

The following figure illustrates how the timestamp key is used to ensure point-in-time correctness in a training dataset. Feature values are matched based on the primary key (not shown in the diagram) and the timestamp key, using an AS OF join. The AS OF join ensures that the most recent value of the feature at the time of the timestamp is used in the training set.

how point in time works

As shown in the figure, the training dataset includes the latest feature values for each sensor prior to the timestamp on the observed ground truth.

If you created a training dataset without taking into account the timestamp key, you might have a row with these feature values and observed ground truth:

temp

rh

light

co2

ground truth

15.8

32

212

630

0

However, this is not a valid observation for training, because the co2 reading of 630 was taken at 8:52, after the observation of the ground truth at 8:50. The future data is “leaking” into the training set, which will impair the model’s performance.

Requirements

Feature Store client v0.3.7 and above.

Create a time series feature table in Databricks Feature Store

To create a time series feature table, the DataFrame or schema must contain a column that you designate as the timestamp key.

fs = FeatureStoreClient()
# user_features_df DataFrame contains the following columns:
# - user_id
# - ts
# - purchases_30d
# - is_free_trial_active
fs.create_table(
  name="ads_team.user_features",
  keys="user_id",
  timestamp_keys="ts",
  features_df=user_features_df,
)

A time series feature table must have one timestamp key and cannot have any partition columns. The timestamp key column must be of TimestampType or DateType and cannot also be a primary key.

Databricks recommends that time series feature tables have no more than two primary key columns to ensure performant writes and lookups.

Update a time series feature table

When writing features to the time series feature tables, your DataFrame must supply values for all features of the feature table, unlike regular feature tables. This constraint reduces the sparsity of feature values across timestamps in the time series feature table.

fs = FeatureStoreClient()
# daily_users_batch_df DataFrame contains the following columns:
# - user_id
# - ts
# - purchases_30d
# - is_free_trial_active
fs.write_table(
  "ads_team.user_features",
  daily_users_batch_df,
  mode="merge"
)

Streaming writes to time series feature tables is supported.

Create a training set with a time series feature table

To perform a point-in-time lookup for feature values from a time series feature table, you must specify a timestamp_lookup_key in the feature’s FeatureLookup, which indicates the name of the DataFrame column that contains timestamps against which to lookup time series features. Databricks Feature Store retrieves the latest feature values prior to the timestamps specified in the DataFrame’s timestamp_lookup_key column and whose primary keys match the values in the DataFrame’s lookup_key columns, or null if no such feature value exists.

feature_lookups = [
  FeatureLookup(
    table_name="ads_team.user_features",
    feature_names=["purchases_30d", "is_free_trial_active"],
    lookup_key="u_id",
    timestamp_lookup_key="ad_impression_ts"
  ),
  FeatureLookup(
    table_name="ads_team.ad_features",
    feature_names=["sports_relevance", "food_relevance"],
    lookup_key="ad_id",
  )
]

# raw_clickstream DataFrame contains the following columns:
# - u_id
# - ad_id
# - ad_impression_ts
training_set = fs.create_training_set(
  raw_clickstream,
  feature_lookups=feature_lookups,
  exclude_columns=["u_id", "ad_id", "ad_impression_ts"],
  label="did_click",
)
training_df = training_set.load_df()

Any FeatureLookup on a time series feature table must be a point-in-time lookup, so it must specify a timestamp_lookup_key column to use in your DataFrame. Point-in-time lookup does not skip rows with null feature values stored in the time series feature table.

Score models with time series feature tables

When you score a model trained with features from time series feature tables, Databricks Feature Store retrieves the appropriate features using point-in-time lookups with metadata packaged with the model during training. The DataFrame you provide to FeatureStoreClient.score_batch must contain a timestamp column with the same name and DataType as the timestamp_lookup_key of the FeatureLookup provided to FeatureStoreClient.create_training_set.

Example notebook

Time series feature table example notebook

Open notebook in new tab