Recommender Systems - From Item Neighbors to Generative Sequence Models

A research-grounded story of how recommender systems went from item-item neighbors at Amazon to generative sequence models at Meta. Each wave shipped real business wins, hit a structural ceiling, and seeded the next. Curated through cornerstone papers (Linden 2003, Koren 2009, YouTube DNN 2016, DIN 2018, PinSage 2018, EBR 2020, TIGER 2023, HSTU 2024), industry deployments, and the failure modes that overturned consensus.

Two decades of recommender systems, told through cornerstone papers and the production stories behind them: from Amazon’s item-neighbor scale trick in 2003, through Netflix-Prize matrix factorization, YouTube’s two-stage DNN, attention over user behavior at Alibaba, web-scale graphs at Pinterest, hybrid embedding retrieval at Facebook, and the current generative-sequence reframing at Meta.

The throughline

A recommender system, at its core, ranks a catalog $\mathcal{I}$ of items for a user $u$ in a context $c$, choosing a small slate $S \subset \mathcal{I}$ to maximize some utility. The interesting history is not the equation; it is the sequence of constraints that forced the equation to change shape. Latency budgets. Catalog growth. Cold items. Sequential intent. Multi-objective business goals. Hallucination once the model started writing item IDs.

Each wave below shipped a real number on a real product, then hit a ceiling that motivated the next wave. The cornerstone papers are the load-bearing ones, the ones whose architectures showed up two years later inside Amazon, Netflix, YouTube, Alibaba, Pinterest, Facebook, ByteDance, and Spotify. The arc is consistent: an algorithmic idea, a deployment with a metric, a documented failure mode, a next idea.

Act I (2001-2003): precompute item similarities, decouple online cost from catalog size.
Act II (2006-2009): replace neighborhoods with low-rank latent factors; the Netflix Prize.
Act III (2016-2017): deep nets learn feature interactions; YouTube goes two-stage.
Act IV (2018-2019): self-attention over the user’s behavior sequence.
Act V (2018-2020): web-scale graph and two-tower retrieval, hybridized into existing inverted indexes.
Act VI (2018-2020): multi-objective ranking with MMoE / PLE, the DLRM industrial template.
Frontier (2023-2026): generative sequence models that emit semantic item IDs.

Seven waves of recommender systems. Each act started from a documented failure mode of the previous one. Cornerstone systems and years are inside the boxes; the red text below each box is the ceiling that ended the wave.

Act I - Item Neighbors (2001-2003)

The cornerstone is Linden, Smith, York, Amazon.com Recommendations: Item-to-Item Collaborative Filtering (IEEE Internet Computing 2003). Amazon at the time had “more than 29 million customers” and several million products, and the dominant prior idea, user-user collaborative filtering, was infeasible: it scaled with the number of users.

The trick is to flip the axis. Compute, offline, a similarity matrix between every pair of items based on the customers who bought both. Online, take the user’s recent items, look up their precomputed neighbors, and aggregate. Cosine over item co-purchase vectors:

\[\text{sim}(i, j) \;=\; \frac{\mathbf{v}_i \cdot \mathbf{v}_j}{\|\mathbf{v}_i\| \, \|\mathbf{v}_j\|}, \qquad \mathbf{v}_i \in \mathbb{R}^{|U|}\]

where $\mathbf{v}_i$ has a $1$ in dimension $u$ if user $u$ bought item $i$. The paper’s load-bearing sentence: “our algorithm’s online computation scales independently of the number of customers and the number of items in the product catalog.” That property is what made realtime homepage recommendations possible for a catalog Amazon’s size.

Success story. Item-CF shipped as the engine behind “Customers who bought this also bought” and the personalized homepage. It is still the baseline that every new method has to beat on cold-cache scenarios.

Failure mode. The signal is narrow. New items have no co-purchase history (cold-item problem); the model has no notion of features, content, or time; and rare items collapse into popular-item neighborhoods. These cracks are exactly what matrix factorization tried to fix.

Act II - Matrix Factorization and the Netflix Prize (2006-2009)

Netflix released a benchmark in 2006: over 100M ratings from roughly 500K users on 17K movies, with a public test set. The Cinematch baseline scored RMSE $0.9514$. A $1M prize would go to the first team to hit $0.8563$, a 10% relative improvement.

The cornerstone is Koren, Bell, Volinsky, Matrix Factorization Techniques for Recommender Systems (IEEE Computer 2009). The model is one line:

\[\hat{r}_{ui} \;=\; \mu + b_u + b_i + \mathbf{p}_u^\top \mathbf{q}_i, \qquad \mathbf{p}_u, \mathbf{q}_i \in \mathbb{R}^f\]

where $\mu$ is the global mean, $b_u, b_i$ are user and item biases, and $\mathbf{p}_u, \mathbf{q}_i$ are $f$-dimensional latent factors fit by minimizing regularized squared error. SVD++ folds in implicit feedback; timeSVD++ adds temporal drift terms; the paper notes that this family “is superior to classic nearest-neighbor techniques” and supports “implicit feedback, temporal effects, and confidence levels.”

Success story. BellKor’s 2007 Progress Prize submission was 8.43% better than Cinematch. The 2008 BellKor / BigChaos team reached 9.46%. The merged BellKor’s Pragmatic Chaos cleared 10% in 2009 and took the grand prize. The data-driven, low-rank latent-factor recipe became the dominant rating-prediction model for the next half-decade.

Failure mode. MF is bilinear in $(\mathbf{p}_u, \mathbf{q}_i)$ and explains only the rating signal. It cannot absorb arbitrary side features (device, time of day, category), cannot model non-linear feature interactions, and the global $\mathbf{p}_u$ is a fixed summary of the entire user history with no room for “what the user is doing right now.” Famously, Netflix never deployed the full Prize-winning ensemble in production; the production system moved to deep nets within a few years.

Act III - Deep Feature Interactions (2016-2017)

Three load-bearing systems land within a year of each other. The shared question: can a deep network replace the bilinear interaction of MF and the hand-crafted cross-features of logistic regression?

YouTube DNN (Covington et al., RecSys 2016)

The paper introduced the two-stage funnel that every video and feed product copied. Candidate generation reduces millions of videos to a few hundred via an extreme multiclass softmax; ranking sorts the shortlist by expected watch time, not click probability. The candidate-generation network predicts, at time $t$, which video the user will watch next:

\[P(w_t = i \mid U, C) \;=\; \frac{e^{\mathbf{v}_i^\top \mathbf{u}}}{\sum_{j \in V} e^{\mathbf{v}_j^\top \mathbf{u}}}\]

trained with sampled softmax over the catalog $V$. The ranking model is a weighted logistic regression where positives are weighted by observed watch time, so the predicted odds approximate $E[\text{watch time}]$. The paper explicitly motivates this against CTR, which it argues “promotes deceptive videos that the user does not complete (clickbait).”

Wide and Deep (Cheng et al., DLRS 2016)

Cheng et al., Wide and Deep Learning for Recommender Systems (DLRS 2016, arXiv 1606.07792), deployed at Google Play, combined a wide linear model with hand-crafted crosses (memorization) and a deep MLP over embeddings (generalization), trained jointly. The success was concrete: Google Play app installs lifted in production A/B. The failure was the wide side: someone has to design the cross features.

DeepFM and NCF

Guo et al., DeepFM (IJCAI 2017, arXiv 1703.04247), removed Wide and Deep’s hand-crafted crosses by sharing one embedding table between a factorization machine (low-order interactions) and an MLP (high-order interactions). He et al., Neural Collaborative Filtering (WWW 2017, arXiv 1708.05031), argued the inner product was a limitation and replaced it with a learnable MLP:

\[\hat{y}_{ui} \;=\; \sigma\!\big( \mathbf{h}^\top \phi_{\text{MLP}}(\mathbf{p}_u, \mathbf{q}_i) \big)\]

The contrarian update. Rendle, Krichene, Zhang, Anderson, Neural Collaborative Filtering vs. Matrix Factorization Revisited (RecSys 2020, arXiv 2005.09683), showed that a properly tuned dot product beats NCF’s MLP across the same benchmarks. The paper did not refute the deep-net program in recsys; it refuted the specific claim that an MLP is a uniformly better interaction function than the inner product. Healthy correction. Worth knowing before designing your next ablation.

Failure mode for Act III. Even with deep nets, the user is still summarized as a single fixed embedding. A user who has watched ten cooking videos and then searched “guitar strings” looks identical in $\mathbf{u}$ to one who watched ten cooking videos and searched “spatula.” Sequence and intent are crushed into one vector. Act IV attacks exactly this.

Act IV - Attention and Sequences (2018-2019)

Deep Interest Network at Alibaba

Zhou et al., Deep Interest Network for Click-Through Rate Prediction (KDD 2018, arXiv 1706.06978). The architectural move is small and decisive: replace the fixed user-embedding pool with a local activation unit, an attention conditioned on the candidate ad $\mathbf{e}_a$:

\[\mathbf{u}(a) \;=\; \sum_{j=1}^{H} \alpha_j(a) \, \mathbf{e}_j, \qquad \alpha_j(a) = g(\mathbf{e}_j, \mathbf{e}_a)\]

So the user representation $\mathbf{u}(a)$ depends on which item we are scoring. A user with cooking and guitar history looks like a cook when scored against a wok and like a guitarist when scored against a capo. The paper reports DIN “has been successfully deployed in the online display advertising system in Alibaba, serving the main traffic.”

SASRec and BERT4Rec

Kang and McAuley, Self-Attentive Sequential Recommendation (ICDM 2018, arXiv 1808.09781), framed the problem as next-item prediction and applied a causal self-attention stack over the user’s interaction sequence. The authors explicitly position it as “balancing the trade-offs between Markov Chain and RNN methods”: Markov chains had locality but no long-range, RNNs had long-range but slow training and weak locality.

Sun et al., BERT4Rec (CIKM 2019, arXiv 1904.06690), swapped the causal mask for masked-item modeling on the sequence, treating recommendation like masked language modeling.

Comparison.

Method	Year, venue	User representation	Strength
Item-CF	IEEE IC 2003	none, item neighbors	scale, simplicity
MF / SVD++	IEEE Comp 2009	static $\mathbf{p}_u$	latent generalization
YouTube DNN	RecSys 2016	static embedding from history	feature fusion
DIN	KDD 2018	attention conditioned on candidate	per-candidate intent
SASRec	ICDM 2018	causal self-attention over sequence	order + long-range
BERT4Rec	CIKM 2019	bidirectional MLM over sequence	denser supervision

Failure mode. Self-attention over the full user history is $O(L^2)$. Lifetime histories are tens of thousands of events long, so production sequence models truncate to recent windows and lose long-term taste signals. Pinterest’s TransActV2 line is essentially the story of stretching that window. Generative sequence models in the Frontier section attack the same constraint with linear-attention variants.

Act V - Web-Scale Retrieval and Graphs (2018-2020)

The two-stage funnel of Act III put pressure on the first stage: how do you retrieve a few hundred candidates out of billions in under a few milliseconds?

PinSage at Pinterest

Ying et al., Graph Convolutional Neural Networks for Web-Scale Recommender Systems (KDD 2018, arXiv 1806.01973), trained a GCN on the Pinterest pin-board bipartite graph at “3 billion nodes and 18 billion edges,” roughly 10,000x typical academic GCN scale at the time. The trick is random-walk neighborhood sampling and importance-pooled aggregation, so a node embedding is computed from its sampled neighbors without materializing the full graph in memory:

\[\mathbf{h}_v^{(k)} \;=\; \sigma\!\Big( W^{(k)} \cdot \text{aggregate}\big(\{\mathbf{h}_u^{(k-1)} : u \in \mathcal{N}(v)\}\big) \Big)\]

The reported numbers are striking. Offline: 40 percentage points absolute (150% relative) Recall gain and 22 percentage points (60% relative) MRR gain over the production baseline. Online A/B tests reported a roughly 30% relative engagement lift on Home Feed and Related Pin Ads, and a 25% impressions lift on Shop the Look. These are company-reported numbers, not independently audited, but they are sourced from both the peer-reviewed KDD paper and Pinterest’s own engineering writeup.

Two-tower retrieval and Facebook EBR

Yi et al., Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations (RecSys 2019), made the two-tower (user-encoder, item-encoder) approach with sampled softmax production-ready by correcting for the bias of sampling popular items as negatives.

Huang et al., Embedding-Based Retrieval in Facebook Search (KDD 2020, arXiv 2006.11632), is the canonical industrial story. Two contributions worth quoting. First, “a unified embedding framework for personalized search that leverages the searcher’s social graph as contextual information.” Second, on engineering, “serve embedding-based retrieval in a typical search system based on an inverted index structure.” Facebook did not stand up a separate ANN service; they hybridized embedding retrieval into the existing inverted index (Unicorn), which is why the system works at Facebook scale without a parallel infrastructure.

The two-tower pattern that absorbed Act V. Item tower runs offline and the corpus is indexed in an ANN structure; user tower runs online and the query is a single vector. Facebook's EBR hybridized this into Unicorn's existing inverted index rather than a separate ANN service.

Failure mode. Two-tower retrieval cannot model fine-grained interactions between user and item because they never meet before the dot product. That is fine for first-stage retrieval; ranking still has to do the cross-feature work. Generative retrieval (Frontier) attacks this by emitting item IDs token-by-token, letting the model condition each ID token on the full user context.

Act VI - Multi-Task and DLRM (2018-2020)

YouTube’s 2016 paper had one objective: expected watch time. Real platforms have many. Watch time, like, share, comment, follow, save. Optimizing one at the cost of another is how you get rage-bait or, on TikTok, infinite shallow scroll.

MMoE and PLE

Ma et al., Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts (KDD 2018), proposes a shared bottom of $E$ experts and a per-task gating network. Task $k$’s representation is

\[\mathbf{h}_k \;=\; \sum_{e=1}^{E} g_k(\mathbf{x})_e \cdot f_e(\mathbf{x})\]

so each task learns its own soft selection over the shared experts. Tang et al., Progressive Layered Extraction (PLE) (RecSys 2020), formally separates task-shared and task-specific experts to combat the “negative transfer” failure mode of MMoE.

Meta DLRM

Naumov et al., Deep Learning Recommendation Model for Personalization and Ranking (arXiv 1906.00091, 2019, not peer-reviewed but widely adopted as the industrial template). DLRM is a clean blueprint: embed each categorical feature, embed dense features through a bottom MLP, take the pairwise dot products between every embedding pair (interaction layer), concatenate with the dense vector, top MLP. It is the closest thing to a default architecture in the field.

Failure mode. Even DLRM trained on years of logs hits a feature-ceiling: more crosses, more embedding rows, more experts, marginal lift. The HSTU paper (Frontier) makes the case that recsys has not had its scaling-law moment because feature-engineered DLRMs do not scale like Transformers do.

The Frontier - Generative Recommenders (2023-2026)

TIGER and semantic IDs

Rajput et al., Recommender Systems with Generative Retrieval (NeurIPS 2023). The trick is to give every item a semantic ID: a short tuple of discrete codes from a residual-quantization VAE over content embeddings. Then a sequence-to-sequence model is trained to emit the next item’s semantic-ID tokens given the user’s history:

\[P(\text{item}_{t+1} \mid \text{history}) \;=\; \prod_{k=1}^{K} P(c_k \mid c_{\lt k}, \text{history})\]

Generative retrieval removes the offline ANN index: top-$k$ candidates fall out of decoding (beam search over the semantic-ID vocabulary). Cold items get nontrivial probability via shared sub-IDs with their semantic neighbors.

HSTU and the scaling-law moment

Zhai et al., Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations (ICML 2024). HSTU is a decoder-only sequence model with a custom attention operator that scales recsys workloads efficiently to LLM-class size. The paper makes the scaling-law case: at a given compute budget, generative sequence recommenders trained on raw action sequences beat feature-engineered DLRMs, and the gap grows with scale. The authors report production deployment improving topline metrics in Meta surfaces (numbers are company-reported, primary venue is ICML).

LLMs as recommenders

P5 (Geng et al., RecSys 2022), LLaRA, A-LLMRec, and the broader LLM-as-recommender line of work fine-tune general LLMs to take a textual user history and emit a recommendation. Strong zero/few-shot generalization, strong cold-item behavior, brutal cost / latency at serving time. The current frontier is hybrid: small generative models on the hot path, LLMs in offline pipelines for explanation, query rewriting, and synthetic-data generation.

Open failure modes for the Frontier. Hallucinated items (semantic-ID decoding emits a code tuple that does not correspond to a real item), popularity collapse during decoding, and unclear cold-start behavior when the catalog rotates faster than the codebook can be refit.

Open Problems

Generative-decoding hallucination

A generative recommender that emits item IDs token-by-token can produce a semantic-ID tuple that decodes to no real item, or to an item that has been pulled from the catalog. We want a guarantee:

\[P\big(\text{decoded ID} \notin \mathcal{I}_{\text{live}} \mid \text{user}\big) \;\le\; \varepsilon\]

Current systems mask the decoder vocabulary at each step against a trie of live items. The remaining gap is what to do when masking pushes the model far off-distribution and quality drops.

Cold-start under generative formulations

Item-CF, MF, and two-tower retrieval all degrade gracefully when items have side features. Generative recommenders concentrate probability on items whose semantic IDs the model has seen. Formally, we want the per-item retrieval recall as a function of training exposure $n_i$ to satisfy

\[\mathbb{E}[\text{Recall@}k \mid n_i = 0] \;\ge\; \alpha \cdot \mathbb{E}[\text{Recall@}k \mid n_i \gg 0]\]

with $\alpha$ close to one. Today, $\alpha \ll 1$ for many semantic-ID schemes.

Position, popularity, and exposure bias

Logged interaction data is selection-biased: a user can only click items the system already showed them. Counterfactual recsys aims to estimate $E[Y \mid \text{do}(\text{rank})]$ rather than $E[Y \mid \text{rank}]$. The unbiased ranking objective is an IPS-weighted

\[\hat{R}(\pi) \;=\; \frac{1}{N} \sum_{n=1}^{N} \frac{\pi(a_n \mid x_n)}{\pi_0(a_n \mid x_n)} \, r_n\]

but variance explodes when $\pi_0$ has low probability on the actions $\pi$ wants to take. Practical bounds for offline policy evaluation on top-$k$ slates remain open.

Multi-objective: engagement vs. wellbeing

If $r_{\text{engage}}$ and $r_{\text{wellbeing}}$ trade off, what policy is correct? A scalarized $\lambda r_{\text{engage}} + (1-\lambda) r_{\text{wellbeing}}$ is convenient but normatively contested. Recent work prefers Pareto front sampling:

\[\pi^\star \;=\; \arg\max_\pi \;\; E[r_{\text{engage}}] \quad \text{s.t.} \quad E[r_{\text{wellbeing}}] \ge \tau\]

Estimating the constraint side from logs is the hard part.

On-device and private personalization

Federated and differentially private recommenders move user models to the device and limit what the central model can learn about any one user. The constraint is

\[\Pr[\mathcal{M}(D) \in S] \;\le\; e^{\varepsilon} \, \Pr[\mathcal{M}(D') \in S]\]

for neighboring datasets $D, D’$. Quality under tight $\varepsilon$ on long-tail items is still well below the centralized baseline.

Scaling laws for recsys

The HSTU paper made the case that with the right architecture and raw action sequences, recsys obeys scaling laws like language. The conjecture is

\[L(N, D) \;\approx\; A \, N^{-\alpha} + B \, D^{-\beta} + L_\infty\]

with model parameters $N$ and tokens $D$. The exponents $\alpha, \beta$ for recsys, and how they vary across surfaces (e-commerce vs. short-video vs. search), are open.

Catalog drift and continual learning

Catalogs rotate. Items appear, disappear, change price, change category. A static-trained model degrades. We want a continual-learning bound: given a stream of catalog changes at rate $\rho$, the regret vs. an oracle that retrains every step is

\[\text{Regret}(T) \;\le\; \tilde{O}\!\big(\sqrt{\rho T}\big)\]

Today, production systems just retrain nightly. Doing better while keeping costs sane is open.

Fun Projects for Your Portfolio

Reproduce the Netflix-Prize MF curve on MovieLens

Implement biased MF, SVD++, and timeSVD++ from scratch. Train on MovieLens-25M, evaluate by RMSE, and produce the same kind of progress chart the BellKor team did. Headline metric:

\[\text{RMSE} = \sqrt{\frac{1}{|\mathcal{T}|} \sum_{(u,i) \in \mathcal{T}} (r_{ui} - \hat{r}_{ui})^2}\]

Portfolio signal: you can write working numerical-optimization code and read primary IEEE Computer papers.

Two-tower retrieval with sampled softmax + ANN

Train a user-tower and item-tower on Amazon Reviews (or your own catalog), index items with FAISS, and report Recall@K and p99 retrieval latency. Headline metric:

\[\text{Recall@}K = \frac{1}{|U|} \sum_{u} \mathbb{1}[\text{positive}_u \in \text{top-}K(u)]\]

Portfolio signal: you can ship the building block behind every modern feed.

SASRec or BERT4Rec on a personal-history dataset

Reimplement SASRec with a causal Transformer, train it on a public sequence dataset (Beauty, Steam, ML-1M), and compare with BERT4Rec under matched compute. Report NDCG@10:

\[\text{NDCG@}K = \frac{1}{|U|} \sum_u \frac{\sum_{k=1}^{K} \frac{2^{r_{u,k}}-1}{\log_2(k+1)}}{\text{IDCG@}K}\]

Portfolio signal: comfort with Transformer training and sequence-evaluation protocols.

Generative retrieval with semantic IDs (TIGER-lite)

Train an RQ-VAE on item content embeddings to produce 3- or 4-tuple semantic IDs. Train a small encoder-decoder to predict the next item’s ID tuple given the user’s history. Report Recall@K with trie-masked beam search vs. unconstrained decoding. Headline metric: hallucination rate

\[H = \Pr[\text{decoded ID tuple} \notin \mathcal{I}]\]

Portfolio signal: hands-on with the generative-retrieval frontier.

MMoE / PLE on a public multi-task benchmark

On Tenrec or Avazu-style logs, train (a) a per-task MLP, (b) a shared-bottom MLP, (c) MMoE, (d) PLE. Report per-task AUC and document where you observe negative transfer:

\[\text{NT}(k) = \text{AUC}_k(\text{shared}) - \text{AUC}_k(\text{single-task})\]

Portfolio signal: you understand the practical reality that production rankers are multi-task systems.

Counterfactual evaluation with IPS

Take a logged dataset with a known logging policy. Train a new policy on the logs, then evaluate it both naively and with IPS / SNIPS. Compare to the true on-policy reward via a held-out A/B-style split. Headline metric:

\[\text{MSE}\big(\hat{R}_{\text{IPS}}(\pi), R(\pi)\big)\]

Portfolio signal: you can speak the offline-policy-evaluation language that any senior recsys interview will probe.

Cold-start adapter for two-tower retrieval

Take a trained two-tower model. Add a content-encoder adapter (text, image) that maps cold items into the item-tower space without retraining the user side. Evaluate Recall@K specifically on items with zero interactions during training:

\[\text{Recall@}K \mid n_i = 0\]

Portfolio signal: you know that the cold-start regime is where production systems actually live.

Calibrated ranker for engagement vs. quality

On a public dataset with both an engagement signal and a quality signal (e.g., MovieLens with completion + rating, or any dataset with dwell + bookmark), train a multi-objective ranker and produce its Pareto front by sweeping $\lambda$. Headline metric: hypervolume of the engagement / quality front:

\[\text{HV}(\Pi) = \text{vol}\!\big(\bigcup_{\pi \in \Pi}[r_{\text{e}}(\pi), \infty) \times [r_{\text{q}}(\pi), \infty)\big)\]

Portfolio signal: you treat recsys as a normative problem, not just a CTR-maximization problem.

This survey traced the recommender-systems arc from Linden, Smith, York’s item-item collaborative filtering at Amazon (IEEE Internet Computing 2003) to Zhai et al.’s trillion-parameter generative recommenders at Meta (ICML 2024), through seven distinct architectural waves. The active research frontier publishes at KDD, RecSys, NeurIPS, ICML, ICLR, WWW, and SIGIR; for live signals, the RecSys conference proceedings each fall and the ICML / NeurIPS workshops on generative recommendation and foundation models for personalization are the leaderboards to watch.