With time series feature tables, Databricks Feature Store supports time series and event-based use cases that require point-in-time correctness. You can designate a particular column as the feature table’s timestamp key and store historical feature values for a particular primary key value at different timestamps, each in a separate row. To retrieve the latest feature values as of a particular time for training or scoring a model, Databricks Feature Store supports point-in-time lookups against the time series feature table.
Point-in-time lookup functionality is sometimes referred to as “time travel”. The point-in-time functionality in Databricks Feature Store is not related to Delta Lake time travel.
Point-in-time lookups help avoid data leakage problems that can arise when a model is trained on feature values that are not available during real-time inference. Data leakage can introduce significant discrepancies in model performance between training and real-time inference. With time series feature tables, you can ensure that a model uses the latest features, based on timestamps you specify, for training.
Consider using time series feature tables if your feature values change over time, for example with time series data, event-based data, or time-aggregated data.
To create a time series feature table, the DataFrame or schema must contain a column that you designate as the timestamp key.
fs = FeatureStoreClient() # user_features_df DataFrame contains the following columns: # - user_id # - ts # - purchases_30d # - is_free_trial_active fs.create_table( name="ads_team.user_features", keys="user_id", timestamp_keys="ts", features_df=user_features_df, )
A time series feature table must have one timestamp key and cannot have any partition columns. The timestamp key column must be of
DateType and cannot also be a primary key.
Databricks recommends that time series feature tables have no more than two primary key columns to ensure performant writes and lookups.
When writing features to the time series feature tables, your DataFrame must supply values for all features of the feature table, unlike regular feature tables. This constraint reduces the sparsity of feature values across timestamps in the time series feature table.
fs = FeatureStoreClient() # daily_users_batch_df DataFrame contains the following columns: # - user_id # - ts # - purchases_30d # - is_free_trial_active fs.write_table( "ads_team.user_features", daily_users_batch_df, mode="merge" )
Streaming writes to time series feature tables is supported.
To perform a point-in-time lookup for feature values from a time series feature table, you must specify a
timestamp_lookup_key in the feature’s
FeatureLookup, which indicates the name of the DataFrame column that contains timestamps against which to lookup time series features. Databricks Feature Store retrieves the latest feature values prior to the timestamps specified in the DataFrame’s
timestamp_lookup_key column and whose primary keys match the values in the DataFrame’s
lookup_key columns, or
null if no such feature value exists.
feature_lookups = [ FeatureLookup( table_name="ads_team.user_features", feature_names=["purchases_30d", "is_free_trial_active"], lookup_key="u_id", timestamp_lookup_key="ad_impression_ts" ), FeatureLookup( table_name="ads_team.ad_features", feature_names=["sports_relevance", "food_relevance"], lookup_key="ad_id", ) ] # raw_clickstream DataFrame contains the following columns: # - u_id # - ad_id # - ad_impression_ts training_set = fs.create_training_set( raw_clickstream, feature_lookups=feature_lookups, exclude_columns=["u_id", "ad_id", "ad_impression_ts"], label="did_click", ) training_df = training_set.load_df()
FeatureLookup on a time series feature table must be a point-in-time lookup, so it must specify a
timestamp_lookup_key column to use in your DataFrame. Point-in-time lookup does not skip rows with
null feature values stored in the time series feature table.
When you score a model trained with features from time series feature tables, Databricks Feature Store retrieves the appropriate features using point-in-time lookups with metadata packaged with the model during training. The DataFrame you provide to
FeatureStoreClient.score_batch must contain a timestamp column with the same name and
DataType as the
timestamp_lookup_key of the
FeatureLookup provided to