Recommendation Systems Notes

Introduction

Recommendation systems utilize machine learning models to determine user preferences based on their past history or by analyzing the preferences of similar users. These systems typically involve documents (entities a system recommends, like movies or videos), queries (information needed to make recommendations, such as user data or location), and embeddings (mappings of queries or documents to a vector space called embedding space).

Candidate Generation: the process begins here, starting with a vast corpus and generating a smaller subset of candidates Scoring: in this step, the model ranks the candidates to select a smaller, more refined set of documents Re-ranking: this final step refines the recommendations

Many large-scale recommendation systems implement these processes in two main stages: retrieval and ranking

2-Stage Recommender Systems

Retrieval: Quickly narrows down the vast candidate pool (millions of items) to a smaller subset using scalable models such as matrix factorization or two-tower architectures Ranking: Applies sophisticated models incorporating richer feature sets, such as user context or item-specific details, to assign relevance scores and generate a personalized, ranked list

4-Stage Recommender Systems

Retrieval: identifies an initial set of candidates, emphasizing efficiency and broad coverage Filtering: applies business rules to exclude ineligible or irrelevant items (out-of-stock products or regionally restricted content) Scoring: uses advanced algorithms, such as deep learning models, to predict interaction likelihood by analyzing user-item interactions and contextual features Ranking (or Ordering): finalizes the recommendation list, balancing relevance, diversity, novelty, and business goals

Eugene Yan’s 2-Stage (2x2) Model of a Recommender System Eugene Yan’s 2x2 model identifies two main dimensions within discovery systems: -Environment: offline vs. online -Process: candidate retrieval vs. ranking

-the offline stage flows bottom-up, producing the necessary artifacts for the online environment, while the online stage processes requests left-to-right, following retrieval and ranking steps to return a final set of recommendations or search results

Offline Environment

-in the offline environment, batch processes handle tasks such as model training, creating embeddings for catalog items, and developing structures like approximate nearest neighbors (ANN) indices or knowledge graphs to identify item similarity. The offline stage may also involve loading item and user data into a feature store, which helps augment input data during ranking

Online Environment

-the online environment serves individual user requests in real-time, utilizing artifacts generated offline (ANN indices, knowledge graphs, models). Here, input items or queries are transformed into embeddings, followed by two main steps: Candidate Retrieval: quickly narrows down millions of items to a more manageable set of candidates, trading precision for speed Ranking: ranks the reduced set of items by relevance. This stage enables the inclusion of additional features such as item and user data or contextual information, which are computationally intensive but feasible due to the smaller candidate set

Candidate Generation

Candidate generation is the first stage of recommendations and is typically achieved by finding features for users that relate to features of the items.

It’s the process of selecting a set of items that are likely to be recommended to a user. This process is typically performed after user preferences have been collected, and it involves filtering a large set of potential recommendations down to a more manageable number. The goal of candidate generation is to reduce the number of items that must be evaluated by the system, while still ensuring that the most relevant recommendations are included.

Given a query (user info), the model generates a set of relevant candidates (videos, movies).

There are two common candidate generation approaches:
Content-based filtering: uses similarity between content to recommend new content if user watches corgi videos, the model will recommend more corgi videos
Collaborative filtering: uses similarity between queries (2 or more users) and items (videos, movies) to provide recommendations if user A watches corgi videos and user A is similar to user B (in demographics and other areas), then the model can recommend corgi videos to user B even if user B has never watched a corgi video before

Sparse and Dense Features

Recommender systems typically deal with two kinds of features: dense and sparse. Dense features are continuous real values, such as movie ratings or release years. Sparse features, on the other hand, are categorical and can vary in cardinality, like movie genres or the list of actors in a film. Dense features are typically the continuous variables and low dimensionality categorical variables: age, gender, number of likes in the past week, etc. Sparse features are the high dimensionality categorical variables and they are represented by an embedding: user id, page id, ad id, etc.

Content-Based Filtering

Content-Based Filtering is a sophisticated recommendation system that suggests items to users by analyzing the similarity between item features and the user’s known preferences. This approach recommends items that resemble those a user has previously liked or interacted with.

Content-based filtering offers several advantages over alternative recommendation approaches: Effective in Data-Limited Environments: this method is particularly useful in situations where there is limited or no data on user behavior, such as in new or niche domains with a small user base Personalized Recommendations: since content-based filtering relies on an individual user’s preferences rather than the collective behavior of other users, it can provide highly personalized recommendations Scalability: the model does not require data from other users, focusing solely on the current user’s information, which makes it easier to scale Recommendation of Niche Items: it can recommend niche items tailored to each user’s unique preferences, items that might not be of interest to a broader audience Ability to Recommend New Items: the method can recommend newly introduced items without waiting for user interaction data, as the recommendations are based on the item’s inherent features Capturing Unique User Interests: content-based filtering excels at capturing and catering to the distinct interests of users by recommending items based on their previous engagements

Limitations of Content-Based Filtering
Need for domain knowledge: features must often be manually engineered. Consequently, the effectiveness of the model is closely tied to the quality of these hand-engineered features Limited exploration of new interests: the model tends to recommend items within the user’s existing preferences, limiting its ability to introduce new or diverse interests Difficulty in discovering new interests
Dependence on feature availability: method may be ineffective in situations where there is a lack of detailed item information or where items possess limited attributes or features

Collaborative Filtering

A recommendation system technique used to suggest items to users based on the behaviors and preferences of a broader user group. It operates on the principle that users with similar tastes are likely to enjoy similar items.

Unlike content-based filtering, collaborative filtering automatically learns embeddings without relying on manually engineered features.

Advantages of collaborative filtering
-Effectiveness in sparse data environments: collaborative filtering is particularly effective when there is limited information about item attributes, as it relies on user behavior rather than item features
-Scalability: it performs well in environments with a large and diverse user base, making it suitable for systems with extensive datasets
-Serendipitous Recommendations: the system can provide unexpected yet relevant recommendations, helping users discover items they might not have found independently
-No Domain Knowledge Required: it does not require specific knowledge of item features, eliminating the need for domain-specific expertise in engineering these features
-Efficiency: collaborative filtering models are generally faster and less computationally intensive than content-based filtering, as they do not require analysis of item-specific attributes

Disadvantages of collaborative filtering
-Data scarcity: it can struggle in situations where there is insufficient data on user behavior, limiting its effectiveness
-Unique preferences: the system may have difficulty making accurate recommendations for users with highly unique or niche preferences, as it relies on similarities between users
-Cold start problem: new items or users with limited data pose challenges for generating accurate recommendations due to the lack of historical interaction data
-Difficulty handling niche interests

Matrix Factorization (MF)

-Matrix Factorization (MF) is a simple embedding model. The algorithm performs a decomposition of the (sparse) user-item feedback matrix into the product of two (dense) lower-dimensional matrices. One matrix represents the user embeddings, while the other represents the item embeddings. In essence, the model learns to map each user to an embedding vector and similarly maps each item to an embedding vector, such that the distance between these vectors reflects their relevance.

-as part of training, we aim to produce user and item embedding matrices so that their product is a good approximation of the feedback matrix. Aka when you multiply the two matrices you get a good approximation of the sparse matrix back.

It takes a large, sparse user-item interaction matrix (R) and factorizes it into two lower-dimensional matrices.

You multiply the full matrices to get the entire predicted matrix, but each individual prediction is a dot product between a user embedding and an item embedding. The dot product between the u-th user vector and the i-th item vector. It involves matrix multiplication globally, but it’s made up of dot products between user and item embeddings at the individual prediction level.

Why do this?
Fill in missing values:
-the original matrix is super sparse (most users haven’t rated most items)
-after factorization, the dot product between user and item embeddings gives a predicted rating or interaction score
Learn latent structure:
-learns hidden patterns behind why users like certain items
Enable recommendations
-once you have embeddings for all users and items, you can recommend top-k items, find similar users/items via vector similarity

Neural Collaborative Filtering

A type of recommendation system that uses neural networks to learn user-item interactions and predict preferences. It generalizes matrix factorization, a common technique in collaborative filtering, by replacing the dot product with a neural network architecture. This allows it to capture more complex and non-linear relationships between users and items. It takes a user ID and item ID and their embeddings. It concatenates the user and item embeddings and then feeds that through a MLP. The output is the predicted interaction score (e.g., click probability).

Implicit vs. Explicit Feedback in RecSys

Explicit feedback is users directly give ratings (e.g. 4 stars, thumbs up). Data is clear, but sparser.
Implicit feedback is inferred from behavior (e.g. clicks, watch time, purchases, skips). Much more abundant, but noisy—just because someone watched something doesn’t mean they liked it.

Deep Neural Network Based Recommendations

-more ability to learn complex patterns in user behavior and item attributes.
-One common approach for DNN-based recommendation is to use a matrix factorization model as a baseline and then incorporate additional layers of neural networks to capture more complex patterns in the user-item interactions
-Another popular approach for DNN-based recommendation is to use a sequence modeling architecture, such as a RNN or a transformer network. These models can capture temporal dependencies in user behavior and item popularity, allowing for more accurate and personalized recommendations. For example, an RNN can be used to model the sequence of items that a user has interacted with over time, and then use this information to predict which item the user is likely to interact with next

Two-tower Model

-many online platforms, like yt, facebook, and tiktok, use the two-tower model in their recommender system
-the two-tower model consists of two sub-neural networks: query and item
-the query tower encodes user data; the item tower encodes product data
-the output of each tower is an embedding, i.e., a dense vector
-the similarity of a user and product pair is measured using the dot product
-the trained embeddings of query and item towers are stored for fast retrieval
In a two-tower architecture, there are two towers or columns: one tower represents the user and the other tower represents the item. Each tower consists of multiple layers of neurons, which can be fully connected or sparse. The user tower takes as input the user’s features, such as demographic information, browsing history, or social network connections, and processes them through the layers to produce a user embedding, which is a low-dimensional vector representation of the user. The item tower takes as input the item’s features, such as its genre, cast, or director, and processes them through the layers to produce an item embedding, which is a low-dimensional vector representation of the item. The user and item embeddings are then combined using a similarity function, such as dot product or cosine similarity, to produce a score that represents the predicted rating or likelihood of interaction between the user and the item.

Candidate Retrieval

-Now that you have an embedding model, how would you decide which items to recommend given a user?
-at serve time, given a query, you start by doing one of the following:
-for a matrix factorization model, the query (or user) embedding is known statically, and the system can simply look it up from the user embedding matrix
-for a DNN model, the system computes the query embedding at serve time by running the network on the feature vector x
-once you have the query embedding q, search for item embeddings that are close to q in the embedding space. This is a nearest neighbor problem.

Large-scale Retrieval
-to compute the nearest neighbors in the embedding space, the system can exhaustively score every potential candidate. Exhausting scoring can be expensive for very large corpora, but you can use either of the following strategies to make it more efficient: -if the query embedding is known statically, the system can perform exhaustive scoring offline, pre-computing and storing a list of the top candidates for each query. This is a common practice for related-item recommendation -use approximate nearest neighbors

Metrics

Click-Through Rate (CTR)

Out of all the times we showed a particular item (or group of items), how often did users click on it?

NDCG (Normalized Discounted Cumulative Gain)

NDCG measures how well your ranking model orders items, taking into account both relevance and position in the list. The idea: Relevant items at the top of the list should get more credit than relevant items buried at the bottom. NDCG Evaluates your model’s ordering, not just which items are present

Recall@K

Out of all the relevant items for this user, how many did my system successfully surface in the top K ranked list?

In practice relevance is defined from logs

Let’s say you want to compute recall@10 offline. You could define “relevant” items as:
-items the user clicked in the past 24 hours
-items they watched for >30s in the past week
-items they gave high ratings to in past data
Then you ask: “Did our top-10 recommendation list contain any of these?”

MAP (Mean Average Precision)

Measures how well the recommendation system ranks relevant items near the top of the recommendation list across all users. It’s useful when you want to reward systems that rank relevant items higher. You care not just what items are recommended, but where in the list they appear.

NDCG vs. MAP intuition:

-NDCG is how well did you rank the most important items near the top?
-MAP is how many of your top-ranked items were correct, and how early did you get them?

Other stuff

Diversity is usually injected during re-ranking, not the initial ranking stage

Genre constraint: “no more than 2 items of the same genre in the top 5”

Dot product is efficient and works well when user and item embeddings live in the same space and relevance is a function of similarity. But in cases where preferences are more conditional or nonlinear, like job or dating recommendations, we’d use deep interaction models instead to model richer user-item relationships.

Popularity Bias: popular items get recommended more often. Less popular items are overlooked even if some users would love them. System tends to reinforce already popular content (rich get richer).

Exposure Bias: you can’t recommend what users have never been shown. Just because a user didn’t interact with something doesn’t mean they disliked it — maybe they just never saw it.

Both popularity and exposure bias lead to feedback loops and reduced discovery.

Bandits are designed to maximize reward over time — but if not properly regularized, they can actually become greedy over a few high-reward arms, which reduces diversity. If the bandit finds that a few items consistently perform well, it might just keep recommending those over and over. Result is that popular items get even more exposure, while other potentially good items never get a chance.

Ranking / Scoring

Scoring refers to the process of assigning a score or a rating to each item in the candidate pool based on its similarity to the user’s preferences or past behavior. The scoring function is used to determine the relevance of each item to the user, and items with higher scores are considered to be more relevant.

Ranking, on the other hand, is the process of ordering the items based on their scores. The items with the highest scores are ranked at the top of the recommendation list, while the items with the lowest scores are ranked at the bottom. The ranking process ensures that the most relevant items are presented to the user first.

Let’s look at different methods and techniques used for scoring.

Re-ranking

In the final phase of a recommendation system, the system has the capability to re-rank candidates by taking into account additional criteria or constraints that were not initially considered. This re-ranking stage allows the system to assess factors beyond the initial scoring, such as the diversity of items within the recommended set.

Popular Architectures

The wide and deep architecture is a model architecture that combines a linear model (wide) and a deep neural network (deep), and trains them together to make predictions.

The “Wide” Part = memorization -think of it like manual features + sparse cross-products -implemented via logistic regression or linear models with sparse features

The “Deep” Part = generalization -a deep neural network that takes in embeddings of user, item, and context -learns nonlinear patterns and abstract interactions -helps generalize to unseen combinations of features

The wide model helps memorize known patterns (co-viewed items, past behaviors) The deep model helps predict and generalize to new behaviors During training, both are optimized together — the outputs are combined (via sum or weighted average)

Latent factor models embed users and items into a shared vector space, where their similarity reflects the likelihood of interaction. They’re commonly trained using matrix factorization and are powerful for collaborative filtering, especially when explicit user-item interaction data is available

Knowledge graphs in recommender systems represent entities like users, items, genres, actors, or creators as nodes, and their relationships (e.g., “acted in,” “belongs to genre,” “watched by”) as edges in a graph structure. This enables the model to capture rich, structured, and explainable connections beyond simple collaborative filtering. By traversing multi-hop paths (e.g., user → watched movie A → has actor X → acted in movie B), the system can recommend semantically relevant items, even in cold-start scenarios where historical interaction data is sparse. Knowledge graphs improve personalization, support content diversity, and enhance explainability, and are often used alongside traditional models or embedded into neural architectures like graph neural networks (GNNs) or meta-path based embeddings to power more intelligent and context-aware recommendations.

Pointwise ranking predicts absolute engagement scores for each item independently, using losses like cross-entropy or regression. Pairwise ranking, on the other hand, models the relative preference between item pairs — optimizing for which item should rank higher — and is often trained with BPR loss. Pairwise is more directly aligned with ranking quality, but pointwise is simpler and commonly used in production.

Multi-armed Bandits

Imagine you have several slot machines (or “arms”) and each one gives a different reward when pulled. The goal is to maximize the total reward by pulling the most rewarding arms. However, you don’t know beforehand which arm is the best. Exploration vs. Exploitation: you need to balance Exploration: trying out different arms to learn which one gives the best reward Exploitation: pulling the arm that seems to give the best reward based on what you’ve learned so far

Bandits offer advantages over batch machine learning and A/B testing methods, as they minimize regret by continuously learning and adapting recommendations without the need for extensive data collection or waiting for test results. They can provide better performance in situations with limited data or when dealing with long-tail or cold-start scenarios, where batch recommenders may favor popular items over potentially relevant but lesser-known options.

Contextual Bandits

Contextual bandits extend the multi-armed bandit problem by introducing context into the decision process. Rather than just choosing an arm based on past reward alone, contextual bandits take into account the current situation or “context” before making a decision. Context: the “context” refers to additional information that can help in decision-making. For example, in a news website, the context might include user information such as location, browsing history, or device type. Action: based on the context, the algorithm decides which action to take (which arm to pull, or which recommendation to make) Reward: after the action, the algorithm observes the reward (whether the user clicked the recommended item or not) and uses this feedback to improve future decisions

Contextual Bandits vs. Full Reinforcement Learning
-contextual bandits are a simplification of full reinforcement learning. In full RL, the agent needs to learn from sequences of actions and long-term rewards (i.e., the environment’s state changes over time). In contrast, contextual bandits only deal with one-time, immediate rewards and do not require learning long-term strategies. This makes them computationally simpler and ideal for situations where immediate feedback is available, like in recommender systems.

Online learning
Contextual bandits are highly suited for online learning, meaning they can adapt to user behavior in real-time. The system can continuously refine its recommendations as it gathers more feedback from the user, making it particularly effective in dynamic environments.

System Design for Recommendations and Search

The offline environment largely hosts batch processes such as model training (e.g., representation learning, ranking), creating embeddings for catalog items, and building an approximate nearest neighbors (ANN) index or knowledge graph to find similar items. It may also include loading item and user data into a feature store that is used to augment input data during ranking.

The online environment then uses the artifacts generated (e.g., ANN indices, knowledge graphs, models, feature stores) to serve individual requests. A typical approach is converting the input item or search query into an embedding, followed by candidate retrieval and ranking. There are also other preprocessing steps (e.g., standardizing queries, tokenization, spell check) and post-processing steps (e.g., filtering undesirable items, business logic) though we won’t discuss them in this writeup.

Candidate retrieval is a fast—but coarse—step to narrow down millions of items into hundreds of candidates. We trade off precision for efficiency to quickly narrow the search space (e.g., from millions to hundreds, a 99.99% reduction) for the downstream ranking task. Most contemporary retrieval methods convert the input (i.e., item, search query) into an embedding before using ANN to find similar items. Nonetheless, in the examples below, we’ll also see systems using graphs (DoorDash) and decision trees (LinkedIn).

Ranking is a slower—but more precise—step to score and rank top candidates. As we’re processing fewer items (i.e., hundreds instead of millions), we have room to add features that would have been infeasible in the retrieval step (due to compute and latency constraints). Such features include item and user data, and contextual information. We can also use more sophisticated models with more layers and parameters.

In the offline environment, data flows bottom-up, where we use training data and item/user data to create artifacts such as models, ANN indices, and feature stores. These artifacts are then loaded into the online environment (via the dashed arrows). In the online environment, each request flows left to right, through the retrieval and ranking steps before returning a set of results (e.g., recommendations, search results).