Building a machine learning pipeline of a recommender system usually involves the following stages:
This reference solution covers the stages shown in blue. (See Which topics are not covered?)
The notebook covers several tools provided on Databricks that simplify building a machine learning pipeline:
The data used in this notebook consists of the following Delta tables:
user_profile: contains the
user_idvalues and their static profiles
item_profile: contains the
item_idvalues and their static profiles
user_item_interaction: contains events where a user interacts with an item. This table is randomly split into three Delta tables to build and evaluate the model:
This data format is common for recommendation problems. Some examples are:
- For ad recommenders, the items are ads and the user-item interactions are records of users clicking the ads.
- For online shopping recommenders, the items are products and the user-item interactions are records of users reviewing or order history.
When you adapt this notebook to your dataset, you only need to save your data in the Delta tables and provide the table names and locations. The code for loading data can mostly be reused.
See the dataset generation notebook for details.
This notebook uses the wide-and-deep model (paper | tensorflow implementation). This is a popular model that combines a wide linear model with a deep neural network to handle both memorization and generalization.
This model is just one example among many deep learning models for the recommender problem or for any machine learning pipelines in general. The focus here is showing how to build the workflow. You can swap in different models for your own use case and tune the model for better evaluation metrics.
This notebook uses DBFS access to the local filesystem (FUSE mount) and is not supported on Databricks on Google Cloud as of this release.
To keep the notebook focused on showing how to implement a recommender system, the following stages are not covered. These stages are shown as gray blocks in the workflow diagram.
- Data collection and exploratory data analysis. See Data guide.
- Feature engineering. Feature engineering is an important part of a recommender system, and much information is available on this topic. This notebook assumes that you have a curated dataset containing user-item interactions. For details about the dataset used in this notebook, see Data. For more information about feature engineering, see the following resources:
- The Databricks Solution Accelerators notebooks Personalizing the Customer Experience with Recommendations show examples of feature engineering in a recommender system.
- Preprocess data for examples of feature engineering with scikit-learn, MLlib and transfer learning.
- Model tuning. Model tuning involves revising the code of the existing pipeline, including feature engineering, model structure, model hyperparameters, or even updating the data collection stage, to improve the model’s performance. For more information about tools for model tuning on Databricks, see Hyperparameter tuning and automated machine learning.