Part 2 of 6 | ← Part 1: Architecture | Part 3: Production Systems →
Ranking Models
The ranking stage scores candidates with a model trained to predict user engagement. Unlike retrieval, ranking models can afford to examine detailed feature interactions.
Problem Formulation
Ranking is typically framed as pointwise, pairwise, or listwise learning to rank:
| Approach | Loss Function | Pros | Cons |
|---|---|---|---|
| Pointwise | Cross-entropy, MSE on engagement labels | Simple; scales to large data | Ignores relative ordering |
| Pairwise | BPR, hinge loss on (positive, negative) pairs | Captures preference structure | Expensive pair sampling |
| Listwise | LambdaRank, softmax over slate | Directly optimizes ranking metrics | Complex; requires full slate |
Most production systems use pointwise classification (predicting P(click), P(like), P(share)) due to simplicity and scalability, with calibration layers to combine predictions into a single score.
Pointwise Loss Functions
For binary engagement labels $y \in \{0, 1\}$, the standard binary cross-entropy loss is:
$$ \mathcal{L}\_{\text{BCE}} = -\frac{1}{N} \sum\_{i=1}^{N} \left[ y\_i \log(\hat{p}\_i) + (1 - y\_i) \log(1 - \hat{p}\_i) \right] $$where $\hat{p}\_i = \sigma(f(\mathbf{x}\_i))$ is the predicted probability. For continuous labels like watch time, regression losses apply:
$$ \mathcal{L}\_{\text{MSE}} = \frac{1}{N} \sum\_{i=1}^{N} (y\_i - \hat{y}\_i)^2 $$Pairwise Loss Functions
Bayesian Personalized Ranking (BPR) optimizes pairwise preferences. For a positive item $i^+$ and negative item $i^-$:
$$ \mathcal{L}\_{\text{BPR}} = -\sum\_{(u, i^+, i^-)} \log \sigma\left( f(u, i^+) - f(u, i^-) \right) $$This pushes the score of engaged items above non-engaged items. The hinge loss variant introduces a margin $m$:
$$ \mathcal{L}\_{\text{hinge}} = \sum\_{(u, i^+, i^-)} \max\left( 0, m - f(u, i^+) + f(u, i^-) \right) $$Listwise Loss Functions
LambdaRank directly optimizes ranking metrics (NDCG, MAP) by defining gradients as:
$$ \lambda\_{ij} = \frac{\partial \mathcal{L}}{\partial s\_i} = -\sigma(s\_j - s\_i) \cdot |\Delta \text{NDCG}\_{ij}| $$where $|\Delta \text{NDCG}_{ij}|$ is the change in NDCG from swapping items $i$ and $j$. This gradient weighting focuses learning on swaps that most impact ranking quality.
The softmax cross-entropy over a list treats ranking as classification:
$$ \mathcal{L}\_{\text{listwise}} = -\sum\_{i: y\_i = 1} \log \frac{\exp(s\_i)}{\sum\_{j} \exp(s\_j)} $$Model Architectures
Deep & Cross Networks (DCN)
DCN explicitly models feature interactions via cross layers while maintaining a parallel deep network for implicit interactions. The cross layer computes:
$$ \mathbf{x}\_{l+1} = \mathbf{x}\_0 \mathbf{x}\_l^\top \mathbf{w}\_l + \mathbf{b}\_l + \mathbf{x}\_l $$where $\mathbf{x}\_0 \in \mathbb{R}^d$ is the input embedding, $\mathbf{w}\_l, \mathbf{b}\_l \in \mathbb{R}^d$ are learnable parameters, and the outer product $\mathbf{x}\_0 \mathbf{x}\_l^\top$ captures feature interactions. After $L$ cross layers, the model captures polynomial interactions up to degree $L+1$.
DCN-v2 replaces the vector $\mathbf{w}\_l$ with a matrix $\mathbf{W}\_l \in \mathbb{R}^{d \times d}$:
$$ \mathbf{x}\_{l+1} = \mathbf{x}\_0 \odot (\mathbf{W}\_l \mathbf{x}\_l + \mathbf{b}\_l) + \mathbf{x}\_l $$where $\odot$ denotes element-wise multiplication. This increases expressiveness at the cost of $O(d^2)$ parameters per layer.
The full DCN architecture combines cross and deep components:
$$ \hat{y} = \sigma\left( \mathbf{w}\_{\text{logit}}^\top [\mathbf{x}\_{\text{cross}}; \mathbf{x}\_{\text{deep}}] + b \right) $$where $[\cdot ; \cdot]$ denotes concatenation.
Wide & Deep
Google’s Wide & Deep combines a linear model (wide) for memorization with a deep neural network for generalization:
$$ P(y = 1 | \mathbf{x}) = \sigma\left( \mathbf{w}\_{\text{wide}}^\top [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}\_{\text{deep}}^\top a^{(L)} + b \right) $$where $\phi(\mathbf{x})$ represents cross-product feature transformations (e.g., AND(gender=female, language=en)) and $a^{(L)}$ is the final hidden layer of the deep network:
$$ a^{(l+1)} = \text{ReLU}(\mathbf{W}^{(l)} a^{(l)} + \mathbf{b}^{(l)}) $$The wide component memorizes specific feature co-occurrences; the deep component generalizes to unseen combinations via learned embeddings.
Feature Interaction Layers
Factorization Machines (FM) model pairwise interactions efficiently:
$$ \hat{y}\_{\text{FM}} = w\_0 + \sum\_{i=1}^{n} w\_i x\_i + \sum\_{i=1}^{n} \sum\_{j=i+1}^{n} \langle \mathbf{v}\_i, \mathbf{v}\_j \rangle x\_i x\_j $$where $\mathbf{v}\_i \in \mathbb{R}^k$ is a latent factor for feature $i$. The interaction term can be computed in $O(nk)$ time:
$$ \sum\_{i=1}^{n} \sum\_{j=i+1}^{n} \langle \mathbf{v}\_i, \mathbf{v}\_j \rangle x\_i x\_j = \frac{1}{2} \left[ \left( \sum\_{i=1}^{n} \mathbf{v}\_i x\_i \right)^2 - \sum\_{i=1}^{n} \mathbf{v}\_i^2 x\_i^2 \right] $$DeepFM stacks FM with a deep network, sharing embedding vectors between both components.
Multi-Task Learning
Social media engagement is multi-faceted: clicks, likes, comments, shares, watch time, hides, reports. Multi-task models share lower layers while branching into task-specific heads.
Shared-Bottom Architecture: All tasks share the same feature extraction layers:
$$ \mathbf{h} = f\_{\text{shared}}(\mathbf{x}; \theta\_{\text{shared}}) $$$$ \hat{y}^{(t)} = g^{(t)}(\mathbf{h}; \theta^{(t)}) \quad \forall t \in \mathcal{T} $$
Multi-gate Mixture-of-Experts (MMoE): Tasks select from a set of expert networks via learned gating:
$$ \mathbf{h}^{(t)} = \sum\_{e=1}^{E} g^{(t)}\_e(\mathbf{x}) \cdot f\_e(\mathbf{x}) $$where $f\_e$ are expert networks, $g^{(t)}(\mathbf{x}) = \text{softmax}(\mathbf{W}^{(t)}\_g \mathbf{x})$ is a task-specific gating network, and $g^{(t)}\_e$ is the weight assigned to expert $e$ for task $t$.
The joint loss is a weighted sum:
$$ \mathcal{L}\_{\text{MTL}} = \sum\_{t \in \mathcal{T}} \alpha\_t \mathcal{L}^{(t)} $$Task weights $\alpha\_t$ can be fixed or learned dynamically (e.g., via uncertainty weighting or gradient balancing).
flowchart TB
Features --> Shared[Shared Layers]
Shared --> Click["P(click)"]
Shared --> Like["P(like)"]
Shared --> Comment["P(comment)"]
Shared --> Hide["P(hide)"]
Shared --> WatchTime["E(watch time)"]
Final scores combine task predictions via a weighted sum or learned aggregation, enabling the platform to tune for different objectives (e.g., prioritize meaningful interactions over passive consumption).
Sequence Modeling
User engagement history is sequential: the order of interactions carries signal. Sequence models capture temporal dynamics.
Recurrent Architectures
Long Short-Term Memory (LSTM) networks maintain a cell state $\mathbf{c}\_t$ and hidden state $\mathbf{h}\_t$ updated at each timestep:
$$ \begin{aligned} \mathbf{f}\_t &= \sigma(\mathbf{W}\_f [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_f) & \text{(forget gate)} \\ \mathbf{i}\_t &= \sigma(\mathbf{W}\_i [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_i) & \text{(input gate)} \\ \tilde{\mathbf{c}}\_t &= \tanh(\mathbf{W}\_c [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_c) & \text{(candidate)} \\ \mathbf{c}\_t &= \mathbf{f}\_t \odot \mathbf{c}\_{t-1} + \mathbf{i}\_t \odot \tilde{\mathbf{c}}\_t & \text{(cell state)} \\ \mathbf{o}\_t &= \sigma(\mathbf{W}\_o [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_o) & \text{(output gate)} \\ \mathbf{h}\_t &= \mathbf{o}\_t \odot \tanh(\mathbf{c}\_t) & \text{(hidden state)} \end{aligned} $$The final hidden state $\mathbf{h}\_T$ encodes the user’s sequential preferences.
Self-Attention for Sequences
Self-Attentive Sequential Recommendation (SASRec) applies transformer self-attention to interaction sequences. Given item embeddings $\mathbf{E} = [\mathbf{e}\_1, \ldots, \mathbf{e}\_n]$:
$$ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d\_k}} \right) \mathbf{V} $$where $\mathbf{Q} = \mathbf{E} \mathbf{W}^Q$, $\mathbf{K} = \mathbf{E} \mathbf{W}^K$, $\mathbf{V} = \mathbf{E} \mathbf{W}^V$ are query, key, and value projections.
Causal masking ensures position $i$ only attends to positions $\leq i$:
$$ \text{mask}\_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{otherwise} \end{cases} $$Multi-head attention projects into $H$ parallel subspaces:
$$ \text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}\_1, \ldots, \text{head}\_H) \mathbf{W}^O $$where $\text{head}\_h = \text{Attention}(\mathbf{Q} \mathbf{W}^Q\_h, \mathbf{K} \mathbf{W}^K\_h, \mathbf{V} \mathbf{W}^V\_h)$.
Temporal Position Encoding
Standard sinusoidal position encoding:
$$ \begin{aligned} \text{PE}\_{(pos, 2i)} &= \sin\left( \frac{pos}{10000^{2i/d}} \right) \\ \text{PE}\_{(pos, 2i+1)} &= \cos\left( \frac{pos}{10000^{2i/d}} \right) \end{aligned} $$For recommendation, time-aware embeddings encode absolute or relative timestamps:
$$ \mathbf{t}\_i = \text{MLP}\left( [\Delta t\_i, \sin(\omega \cdot t\_i), \cos(\omega \cdot t\_i)] \right) $$where $\Delta t\_i = t\_i - t\_{i-1}$ is the time gap and $\omega$ captures periodic patterns (hour-of-day, day-of-week).
BERT4Rec: Bidirectional Modeling
BERT4Rec uses masked language modeling for sequences: randomly mask items and predict them from context:
$$ \mathcal{L}\_{\text{MLM}} = -\sum\_{i \in \mathcal{M}} \log P(x\_i | \mathbf{x}\_{\setminus \mathcal{M}}) $$where $\mathcal{M}$ is the set of masked positions. Bidirectional attention (no causal mask) captures both past and future context, but requires different inference strategies (e.g., append [MASK] token for next-item prediction).
# Simplified Transformer-based user encoder (PyTorch-like pseudocode)
class UserSequenceEncoder(nn.Module):
def __init__(self, item_vocab_size, d_model, n_heads, n_layers):
self.item_embedding = nn.Embedding(item_vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model)
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model, n_heads),
num_layers=n_layers
)
self.output_proj = nn.Linear(d_model, d_model)
def forward(self, item_sequence, mask):
x = self.item_embedding(item_sequence)
x = self.positional_encoding(x)
x = self.transformer(x, src_key_padding_mask=mask)
# Pool over sequence (e.g., mean or [CLS] token)
return self.output_proj(x.mean(dim=1))
Re-ranking and Business Logic
Raw model scores optimize for predicted engagement, but the final feed must satisfy additional constraints:
Diversity
Showing ten similar posts degrades user experience. Diversity algorithms ensure variety.
Maximal Marginal Relevance (MMR)
MMR iteratively selects items that balance relevance and diversity:
$$ \text{MMR} = \arg\max_{i \in \mathcal{I} \setminus S} \left[ \lambda \cdot \text{rel}(i) - (1 - \lambda) \cdot \max_{j \in S} \text{sim}(i, j) \right] $$where $S$ is the already-selected set, $\text{rel}(i)$ is the relevance score, $\text{sim}(i, j)$ is similarity (e.g., cosine similarity of embeddings), and $\lambda \in [0, 1]$ controls the relevance-diversity trade-off.
Complexity: $O(k \cdot n)$ for selecting $k$ items from $n$ candidates, assuming similarity lookup is $O(1)$.
Determinantal Point Processes (DPP)
DPPs are probabilistic models that assign higher probability to diverse subsets. Given a positive semi-definite kernel matrix $\mathbf{L} \in \mathbb{R}^{n \times n}$:
$$ P(S) \propto \det(\mathbf{L}_S) $$where $\mathbf{L}_S$ is the submatrix indexed by $S$. The kernel encodes both quality (diagonal) and similarity (off-diagonal):
$$ L_{ij} = q_i \cdot \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j) \cdot q_j $$where $q_i$ is item quality (relevance score) and $\phi(\mathbf{x}_i)$ is the diversity feature vector.
Properties:
- Items with high quality are more likely to be selected.
- Similar items (high $\phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)$) are unlikely to appear together.
- Exact sampling is $O(n^3)$; approximate methods (greedy, MCMC) are used in practice.
MAP Inference finds the most diverse high-quality set:
$$ S^* = \arg\max_{|S| = k} \det(\mathbf{L}_S) $$This is NP-hard but admits a $(1 - 1/e)$-approximation via greedy selection.
Submodular Optimization
A set function $F: 2^{\mathcal{I}} \rightarrow \mathbb{R}$ is submodular if it exhibits diminishing returns:
$$ F(S \cup \{i\}) - F(S) \geq F(T \cup \{i\}) - F(T) \quad \forall S \subseteq T, i \notin T $$Coverage functions capture diversity:
$$ F(S) = \sum\_{c \in \mathcal{C}} \max\_{i \in S} w\_{ic} $$where $\mathcal{C}$ is a set of categories/topics and $w\_{ic}$ is item $i$’s coverage of category $c$.
Greedy maximization of monotone submodular functions achieves a $(1 - 1/e) \approx 0.63$ approximation ratio:
$$ S\_{t+1} = S\_t \cup \left\lbrace \arg\max\_{i \notin S\_t} F(S\_t \cup \lbrace i \rbrace) - F(S\_t) \right\rbrace $$Novelty, Serendipity, and Exploration
While diversity ensures variety of topics/creators within a single feed, novelty and serendipity address different user needs:
Novelty: Items the user hasn’t encountered before. Pure relevance-based ranking can get stuck recommending the same high-scoring items. Novelty ensures forward progress through the catalog.
Serendipity: Surprising discoveries—items outside the user’s typical consumption pattern that they end up enjoying. A sports fan unexpectedly loving a cooking video. Serendipity combats filter bubbles and enables exploration.
| Dimension | Diversity | Novelty | Serendipity |
|---|---|---|---|
| Goal | Vary topics/creators within feed | Show unseen items | Enable surprising discoveries |
| Measurement | Topic/creator entropy | Impression history | Out-of-profile engagement |
| Risk | Low (still in user’s interests) | Medium (might not engage) | High (might actively dislike) |
| Frequency | Every session | Most sessions | Occasional “budget” |
Implementation strategies:
-
Novelty filtering:
unseen_candidates = [c for c in candidates if c.id not in user.seen_history]- Requires tracking user impression history (Bloom filter or distributed cache)
- Balance: Don’t filter TOO aggressively (catalog exhaustion for power users)
-
Serendipity injection:
- Reserve 5-10% of feed slots for “exploration” items
- Sample from topics user hasn’t engaged with recently
- Use contextual bandits (Thompson Sampling, LinUCB) to balance exploration vs. exploitation
where $\hat{\mu}\_i$ is estimated reward, $n\_i$ is times shown, $\beta$ controls exploration
-
Calibrated surprise scores:
- Train a model to predict: $P(\text{engage} | \text{item}, \text{user profile})$
- Compute unexpectedness: how different is this item from user’s historical profile?
- Serendipity score: $\text{relevance} \times \text{unexpectedness}$
- Reward items that are both relevant AND surprising
User-level adaptation:
- Explorers: Users who engage with diverse content → increase serendipity budget
- Conservatives: Users who stick to familiar topics → reduce serendipity, focus on relevance
- Cohort analysis: New users need more exploration; established users have known preferences
Measuring success:
- Immediate engagement: Do users click on serendipitous items?
- Session length: Does serendipity increase time spent?
- Long-term retention: Do users with more serendipity return more often?
- Regret signals: Do users hide/dismiss serendipitous items?
Fairness and Creator Economics
Platforms balance exposure across creators to maintain a healthy ecosystem:
- Exposure parity: Ensure new or small creators receive baseline visibility.
- Anti-concentration: Prevent a handful of viral posts from dominating feeds.
- Monetization constraints: Ensure ad load targets are met without degrading organic experience.
Policy Enforcement
Content policy violations (misinformation, hate speech, spam) must be filtered or demoted:
- Hard blocks: Remove items flagged by integrity classifiers.
- Soft demotion: Reduce scores for borderline content.
- Contextual warnings: Overlay labels on sensitive content.
Deduplication and Near-Duplicate Detection
Users experience fatigue when shown the same or nearly identical content repeatedly. Deduplication removes exact duplicates and near-duplicates before final ranking.
Exact Deduplication
The simplest case: multiple candidates point to the same underlying content.
Common causes:
- Cross-posting: Same content shared to multiple groups/pages
- Reposts/retweets: User shares an existing post
- Viral spread: Same image/video uploaded by multiple users
Solution: Content-based hashing
- Text: Hash normalized text (lowercased, punctuation removed) with SHA-256 or MurmurHash
- Images: Perceptual hashing (pHash, dHash) produces similar hashes for visually similar images despite format changes, crops, or compression
- Videos: Hash keyframes or audio fingerprints (Chromaprint, AudioSet embeddings)
At serving time, deduplicate by hash:
seen_hashes = set()
deduplicated = []
for item in ranked_candidates:
h = hash(item.content)
if h not in seen_hashes:
deduplicated.append(item)
seen_hashes.add(h)
Complexity: $O(n)$ with hash table lookups.
Semantic Near-Duplicate Detection
Harder problem: content that’s functionally identical but not byte-for-byte the same.
Examples:
- Same news story from different outlets (different wording, same facts)
- Meme templates with different captions
- Screenshots of tweets vs. embedded tweets
- Videos with different aspect ratios or watermarks
Embedding-based approach:
- Compute content embeddings (BERT for text, CLIP for images/video)
- For each candidate, check cosine similarity against already-selected items: $$ \text{sim}(i, j) = \frac{\mathbf{e}_i^\top \mathbf{e}_j}{\|\mathbf{e}_i\| \|\mathbf{e}_j\|} $$
- If $\text{sim}(i, j) > \tau$ (e.g., $\tau = 0.9$), mark as near-duplicate
Optimization: Use ANN indexes (HNSW, FAISS) to find near neighbors in embedding space without checking all pairs. Complexity: $O(n \log n)$ amortized.
flowchart LR
Candidates[Ranked Candidates] --> Hash[Content Hashing]
Hash --> ExactDedup[Exact Deduplication]
ExactDedup --> Embed[Embedding Lookup]
Embed --> ANN[ANN Similarity Search]
ANN --> NearDedup[Near-Duplicate Filter]
NearDedup --> Final[Deduplicated Feed]
Multi-Session Deduplication
Within a single feed request, deduplication is straightforward. Across sessions, it’s harder:
Challenges:
- State management: Must track what the user has already seen. This requires persistent storage (Redis, DynamoDB) keyed by user ID.
- Recency: Content seen days ago might be worth showing again; content seen minutes ago should be suppressed.
- False positives: Overly aggressive deduplication can starve feeds of content.
Strategies:
| Strategy | Approach | Trade-off |
|---|---|---|
| Session-local only | Deduplicate within current request only | Simple; allows repeats across sessions |
| Bloom filter | Probabilistic set membership; track seen item IDs | Space-efficient; false positives possible |
| Time-decayed cache | Store seen items with TTL (e.g., 24 hours) | Bounded memory; configurable recency |
| Impression logging | Log all impressions; query before ranking | Accurate; high storage cost |
Production pattern: Bloom filter for short-term deduplication (hours), with daily resets to avoid memory bloat.
Novelty vs. Diversity vs. Deduplication
These are related but distinct:
| Concept | Definition | Goal |
|---|---|---|
| Deduplication | Remove identical/near-identical content | Avoid literal repetition |
| Novelty | Ensure user hasn’t seen this item before | Prevent boring the user |
| Diversity | Vary topics, creators, formats | Prevent filter bubbles |
Deduplication is a prerequisite for novelty (can’t be novel if it’s a duplicate). Diversity operates at a higher semantic level (topic/creator variety, not content similarity).
Re-ranking as Constrained Optimization
Re-ranking involves multiple competing objectives (relevance, diversity, fairness, business constraints) that cannot be simultaneously maximized. Formal treatment requires constrained optimization or multi-objective frameworks.
Single-Objective Formulation with Constraints
The simplest formulation maximizes a weighted utility subject to hard constraints (Carbonell & Goldstein 1998; Agrawal et al. 2009):
$$ \max\_{\pi \in \Pi} \sum\_{i=1}^{k} r(\pi\_i) \cdot w\_{\text{pos}}(i) $$subject to:
- Diversity: $\text{sim}(\pi\_i, \pi\_j) < \theta \quad \forall i \neq j$
- Fairness: $\sum\_{i : \text{creator}(\pi\_i) = c} 1 \leq m\_c \quad \forall c \in \mathcal{C}$ (per-creator exposure cap)
- Freshness: $|\{i : \text{age}(\pi\_i) < 24h\}| \geq n$
- Policy: $\text{policy-score}(\pi\_i) > \tau \quad \forall i$
where $\Pi$ is the set of permutations of candidate items, $r(\pi\_i)$ is relevance score, and $w\_{\text{pos}}(i)$ is position-dependent weight.
Computational challenge: This is a constrained integer optimization problem (NP-hard). Practical solutions:
- Greedy approximation: MMR, submodular maximization ($1 - 1/e$ approximation)
- Relaxation: Continuous relaxation + rounding
- Constraint satisfaction: Filter candidates, then rank
Lagrangian Formulation
For soft constraints, use Lagrangian relaxation (Boyd & Vandenberghe 2004):
$$ \mathcal{L}(\pi, \lambda) = \sum\_{i=1}^{k} r(\pi\_i) \cdot w\_{\text{pos}}(i) - \sum\_{j} \lambda\_j g\_j(\pi) $$where $g\_j(\pi)$ are constraint violations (diversity deficit, fairness violation, etc.) and $\lambda\_j \geq 0$ are Lagrange multipliers.
Optimization: Solve the dual problem:
$$ \min\_{\lambda \geq 0} \max\_{\pi} \mathcal{L}(\pi, \lambda) $$Interpretation: Lagrange multipliers $\lambda\_j$ represent the “price” of violating constraint $j$. At optimum, if $\lambda\_j > 0$, constraint $j$ is active (binding).
Example - Diversity constraint:
$$ \mathcal{L}(\pi, \lambda) = \sum\_{i=1}^{k} r(\pi\_i) - \lambda \sum\_{iPareto Efficiency
When objectives genuinely conflict (e.g., relevance vs. diversity vs. fairness), no single ranking dominates on all metrics. The Pareto frontier characterizes the set of optimal trade-offs (Miettinen 1999).
Definition: Ranking $\pi$ is Pareto-optimal if no alternative $\pi'$ exists such that:
$$ f\_j(\pi') \geq f\_j(\pi) \quad \forall j \quad \text{and} \quad f\_k(\pi') > f\_k(\pi) \quad \text{for some } k $$where $f\_j$ are objective functions (e.g., $f\_1$ = relevance, $f\_2$ = diversity, $f\_3$ = fairness).
Scalarization: Convert to single-objective via weighted sum:
$$ \max\_{\pi} \sum\_{j=1}^{m} \alpha\_j f\_j(\pi), \quad \alpha\_j \geq 0, \sum\_j \alpha\_j = 1 $$Limitation: Linear scalarization cannot reach non-convex parts of Pareto frontier. Use Chebyshev scalarization for non-convex objectives:
$$ \max\_{\pi} \min\_{j} \left( \alpha\_j (f\_j(\pi) - z\_j^*) \right) $$where $z\_j^*$ is the ideal value of objective $j$.
Multi-objective re-ranking algorithm:
- Enumerate or sample candidate rankings $\Pi' \subset \Pi$
- Compute objective vector $\mathbf{f}(\pi) = (f\_1(\pi), \ldots, f\_m(\pi))$ for each $\pi \in \Pi'$
- Identify Pareto frontier via non-dominated sorting
- Select from frontier based on business priorities $\alpha$
Computing the Pareto Frontier
Characterization via KKT conditions: For differentiable objectives, Pareto-optimal $\pi^*$ satisfies:
$$ \nabla f\_j(\pi^{\*}) = \sum\_{k=1}^{m} \mu\_k \nabla f\_k(\pi^{\*}), \quad \mu\_k \geq 0, \sum\_k \mu\_k = 1 $$Equivalently, the gradients of all objectives lie in a common cone at the optimum.
Computing the frontier:
-
ε-constraint method: Fix $m-1$ objectives as constraints, optimize the remaining one:
$$ \max\_{\pi} f\_1(\pi) \quad \text{s.t.} \quad f\_j(\pi) \geq \epsilon\_j \quad \forall j > 1 $$Vary $(\epsilon\_2, \ldots, \epsilon\_m)$ to trace the frontier.
-
Weighted sum with systematic sweeps: Vary weights $\alpha$ on a grid and solve:
$$ \max\_{\pi} \sum\_{j} \alpha\_j f\_j(\pi) $$Each $\alpha$ yields a Pareto-optimal solution (for convex objectives).
-
Non-dominated sorting (Deb et al. 2002): For discrete $\Pi$:
- Start with all rankings $\Pi$
- Identify non-dominated set (Pareto front layer 1)
- Remove layer 1, repeat on remainder (layer 2), etc.
- Complexity: $O(m |\Pi|^2)$ comparisons
Example: Relevance-Diversity-Fairness
Three objectives:
- $f\_1(\pi) = \sum\_i r(\pi\_i) \cdot w(i)$ (relevance)
- $f\_2(\pi) = -\sum\_{i
- $f\_3(\pi) = \min\_{g} \frac{\sum\_{i \in g} \mathbb{1}[\pi\_i \in \text{top-}k]}{|g|}$ (min-group exposure fairness)
Pareto front: The set of $\pi$ such that improving one objective requires sacrificing another.
quadrantChart
title Relevance-Diversity Pareto Frontier
x-axis Low Diversity --> High Diversity
y-axis Low Relevance --> High Relevance
quadrant-1 Pareto Optimal
quadrant-2 High Relevance Focus
quadrant-3 Dominated
quadrant-4 High Diversity Focus
Point A: [0.20, 0.95]
Point B: [0.35, 0.88]
Point C: [0.48, 0.78]
Point D: [0.60, 0.65]
Point E: [0.72, 0.52]
Point F: [0.82, 0.38]
Point G: [0.90, 0.25]
Dominated X: [0.45, 0.45]
Dominated Y: [0.55, 0.35]
Points A-G form the Pareto frontier representing optimal trade-offs. Interior points (X, Y) are dominated—both objectives can be improved simultaneously.
Tuning Weights via Constrained Optimization
Problem: How to set $\alpha = (\alpha\_1, \ldots, \alpha\_m)$ in scalarized objective?
Approach 1: Online learning of weights
Run multi-armed bandit over weight configurations. Each arm $a$ corresponds to a weight vector $\alpha^{(a)}$. Observe long-term metric (e.g., 7-day retention) as reward:
$$ \alpha^* = \arg\max\_{\alpha} \mathbb{E}[\text{retention}(\alpha)] $$Use Thompson Sampling or GP-UCB over the weight simplex.
Approach 2: Inverse optimization (Bertsimas et al. 2015)
Given expert rankings $\{\pi\_1, \ldots, \pi\_N\}$ that achieve desirable outcomes, infer the weights that rationalize them:
$$ \min\_{\alpha} \sum\_{n=1}^{N} \left\| \pi\_n - \arg\max\_{\pi} \sum\_j \alpha\_j f\_j(\pi) \right\|^2 $$This is a bi-level optimization problem, typically solved via gradient-free methods or relaxations.
Approach 3: Constrained grid search with business metrics
Define acceptable ranges for each objective:
- Relevance: maintain $\geq 95\%$ of baseline
- Diversity: improve by $\geq 10\%$
- Fairness: achieve CED $\leq 1.2$ across groups
Search over $\alpha$ that satisfy these constraints, select the one maximizing a primary business metric (e.g., DAU, revenue).
Approach 4: Lagrangian dual ascent
For constraints $c\_j(\pi) \geq 0$, iteratively:
-
Primal step: Given $\lambda^{(t)}$, solve:
$$ \pi^{(t+1)} = \arg\max\_{\pi} \mathcal{L}(\pi, \lambda^{(t)}) = \arg\max\_{\pi} \left( f(\pi) - \sum\_j \lambda\_j^{(t)} c\_j(\pi) \right) $$ -
Dual step: Update multipliers:
$$ \lambda\_j^{(t+1)} = \max\left(0, \lambda\_j^{(t)} - \eta \nabla\_{\lambda\_j} \mathcal{L}(\pi^{(t+1)}, \lambda^{(t)}) \right) = \max\left(0, \lambda\_j^{(t)} + \eta c\_j(\pi^{(t+1)}) \right) $$If constraint $j$ is violated ($c\_j < 0$), increase $\lambda\_j$ to penalize violations more heavily.
Convergence: Under convexity and suitable step sizes $\eta$, converges to optimal $(\pi^*, \lambda^*)$ satisfying KKT conditions.
Practical note: Full Pareto optimization is expensive. Production systems typically use scalarization with tuned $\alpha$ (via A/B testing or offline policy evaluation) or sequential filtering (first enforce hard constraints, then optimize relevance).
flowchart LR
Scores[Ranking Scores] --> Rerank[Re-ranking]
Diversity[Diversity Constraints] --> Rerank
Policy[Policy Filters] --> Rerank
Business[Business Rules] --> Rerank
Rerank --> Final[Final Feed]
Real-Time Personalization
Static models trained on historical data capture long-term preferences but fundamentally lag behind user intent. A user who just searched for “running shoes” has different needs than their historical profile suggests—yet batch-trained models won’t reflect this until the next training cycle (hours or days later). Real-time personalization bridges this gap by adapting to in-session signals within milliseconds.
The challenge is architectural: batch systems optimize for throughput and can process billions of interactions overnight, while real-time systems must respond in <50ms at the 99th percentile. This forces different trade-offs across the stack.
What changes in real-time:
| Signal Type | Batch Latency | Real-Time Latency | Example |
|---|---|---|---|
| User features | Hours | <100ms | “User clicked 3 electronics items this session” |
| Item popularity | Daily | Minutes | “This video is trending right now” |
| Contextual state | N/A | Immediate | “User is on mobile, evening, in New York” |
| Social signals | Hours | Seconds | “User’s friend just liked this item” |
Why it matters:
- Intent shift: Users switch contexts mid-session. Someone researching laptops may pivot to laptop bags—real-time systems catch this; batch systems miss it entirely.
- Freshness: New items (breaking news, flash sales, new releases) need immediate exposure, not after the next batch job.
- Feedback loops: Showing an item and observing no click is signal. Real-time systems can demote ignored items within the same session.
- Competitive dynamics: In marketplaces and ads, the value of inventory changes by the second. Stale predictions leave money on the table.
Session-Based Recommendations
Within a session, the system observes which items the user engages with (or ignores) and adjusts subsequent recommendations:
- Contextual bandits: Balance exploration (showing uncertain items) with exploitation (showing high-confidence items). Thompson Sampling, LinUCB, and neural bandit variants are common.
- Session transformers: Encode the in-session interaction sequence and predict next-item relevance.
- Real-time feature updates: Increment engagement counters and recompute user embeddings on each interaction.
Streaming Infrastructure
Real-time personalization requires streaming pipelines that:
- Ingest interaction events (Kafka, Kinesis, Pub/Sub)
- Update user state in low-latency stores (Redis, DynamoDB, Bigtable)
- Trigger re-scoring or model inference on state changes
flowchart LR
App[User App] -->|events| Stream[Event Stream]
Stream --> Processor[Stream Processor]
Processor --> State[(User State Store)]
State --> Inference[Ranking Service]
Inference --> App
Explore-Exploit Trade-offs
Pure exploitation (always showing the highest-scoring items) leads to filter bubbles and prevents discovery of new content. The explore-exploit dilemma is formalized via multi-armed bandits.
Multi-Armed Bandit Framework
Each item $i$ is an “arm” with unknown reward distribution. At each round $t$, the algorithm selects arm $a_t$ and observes reward $r_t \sim P_{a_t}$. The goal is to minimize cumulative regret:
$$ R(T) = \sum_{t=1}^{T} \left( \mu^* - \mu_{a_t} \right) = T \mu^* - \sum_{t=1}^{T} \mu_{a_t} $$where $\mu^* = \max_i \mu_i$ is the optimal arm’s expected reward.
ε-Greedy
With probability $\epsilon$, select a random arm; otherwise select the arm with highest empirical mean:
$$ a_t = \begin{cases} \text{Uniform}(\mathcal{I}) & \text{with probability } \epsilon \\ \arg\max_i \hat{\mu}_i & \text{otherwise} \end{cases} $$Regret: $R(T) = O(\epsilon T + K/\epsilon)$ where $K$ is the number of arms. Optimal $\epsilon = O(T^{-1/3})$ yields $R(T) = O(T^{2/3})$.
Upper Confidence Bound (UCB)
UCB selects the arm with highest upper confidence bound:
$$ a\_t = \arg\max\_i \left( \hat{\mu}\_i + \sqrt{\frac{2 \ln t}{n\_i}} \right) $$where $\hat{\mu}\_i$ is the empirical mean, $n\_i$ is the number of times arm $i$ has been pulled, and the second term is the confidence bonus.
Regret bound: $R(T) = O\left( \sqrt{KT \ln T} \right)$, which is order-optimal (Auer et al. 2002).
Proof sketch:
Assumptions:
- Rewards are 1-sub-Gaussian: $\mathbb{E}[\exp(\lambda(r - \mu))] \leq \exp(\lambda^2/2)$ for all $\lambda$
- Rewards are i.i.d. within each arm
- Support: $r\_t \in [0, 1]$
Key steps:
Define $\Delta\_i = \mu^* - \mu\_i$ as the suboptimality gap for arm $i$. The expected number of times arm $i$ is pulled by time $T$ is bounded by decomposing regret:
$$ \mathbb{E}[n\_i(T)] \leq \frac{8 \ln T}{\Delta\_i^2} + 1 + \frac{\pi^2}{3} $$Intuition: UCB explores arm $i$ only while its confidence bound overlaps with the optimal arm. By Hoeffding’s inequality, the empirical mean $\hat{\mu}\_i$ concentrates around $\mu\_i$ at rate $O(1/\sqrt{n\_i})$. The confidence bonus $\sqrt{2 \ln t / n\_i}$ shrinks similarly, ensuring:
$$ P(\mu\_i \leq \hat{\mu}\_i + \sqrt{2 \ln t / n\_i}) \geq 1 - t^{-4} $$After $n\_i \approx 8 \ln T / \Delta\_i^2$ pulls, the bonus becomes smaller than the gap $\Delta\_i$, and arm $i$ is no longer competitive. Summing over all suboptimal arms:
$$ R(T) = \sum\_{i : \Delta\_i > 0} \Delta\_i \mathbb{E}[n\_i(T)] = O\left( \sum\_{i} \frac{\ln T}{\Delta\_i} \right) = O\left( \sqrt{KT \ln T} \right) $$where the last step uses $\sum\_i 1/\Delta\_i \leq \sqrt{K/\Delta\_{\min}}$ and $\Delta\_{\min} = \Omega(1/\sqrt{T})$ in worst case.
Tightness: This bound is minimax-optimal (Lai & Robbins 1985); no algorithm can achieve better worst-case regret for all problem instances.
Thompson Sampling
Thompson Sampling maintains a posterior distribution $P(\mu\_i | \mathcal{D})$ over each arm’s reward. At each round:
- Sample $\tilde{\mu}\_i \sim P(\mu\_i | \mathcal{D})$ for each arm
- Select $a\_t = \arg\max\_i \tilde{\mu}\_i$
- Update posterior with observed reward
For Bernoulli rewards with Beta prior $\text{Beta}(\alpha\_i, \beta\_i)$:
$$ P(\mu\_i | \mathcal{D}) = \text{Beta}(\alpha\_i + s\_i, \beta\_i + f\_i) $$where $s\_i$ and $f\_i$ are successes and failures for arm $i$.
Regret bound: $R(T) = O\left( \sqrt{KT \ln T} \right)$, matching UCB with better empirical performance.
Contextual Bandits
In recommendation, context (user features, time, session state) affects optimal actions. Contextual bandits model reward as $r = f(\mathbf{x}, a) + \epsilon$.
LinUCB assumes linear rewards:
$$ \mathbb{E}[r | \mathbf{x}, a] = \boldsymbol{\theta}_a^\top \mathbf{x} $$The algorithm maintains per-arm ridge regression and selects:
$$ a\_t = \arg\max\_a \left( \hat{\boldsymbol{\theta}}\_a^\top \mathbf{x}\_t + \alpha \sqrt{\mathbf{x}\_t^\top \mathbf{A}\_a^{-1} \mathbf{x}\_t} \right) $$where $\mathbf{A}\_a = \mathbf{X}\_a^\top \mathbf{X}\_a + \lambda \mathbf{I}$ is the design matrix, and $\hat{\boldsymbol{\theta}}\_a = \mathbf{A}\_a^{-1} \mathbf{X}\_a^\top \mathbf{r}\_a$ is the ridge regression estimate.
Regret bound (Li et al. 2010): Under assumptions:
- $\|\mathbf{x}\_t\| \leq L$ for all $t$ (bounded features)
- $\|\boldsymbol{\theta}\_a\| \leq S$ for all $a$ (bounded parameters)
- Noise $\eta\_t$ is conditionally $R$-sub-Gaussian: $\mathbb{E}[\exp(\lambda \eta\_t) | \mathcal{F}\_{t-1}] \leq \exp(\lambda^2 R^2/2)$
The cumulative regret satisfies:
$$ R(T) = O\left( d \sqrt{T \ln(1 + T L^2/\lambda)} \right) $$where $d$ is feature dimension, and $\alpha = R \sqrt{\ln(2T/\delta)} + \sqrt{\lambda} S$ is the exploration bonus.
Proof intuition: The confidence ellipsoid $\{\boldsymbol{\theta} : \|\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}\_a\|\_{\mathbf{A}\_a} \leq \beta\}$ contains the true $\boldsymbol{\theta}\_a$ with high probability by self-normalized concentration (Abbasi-Yadkori et al. 2011). The term $\sqrt{\mathbf{x}\_t^\top \mathbf{A}\_a^{-1} \mathbf{x}\_t}$ is the width of this ellipsoid in direction $\mathbf{x}\_t$. Regret accumulates only when the algorithm explores regions of high uncertainty, which shrinks as data accumulates.
Neural bandits extend this to non-linear reward functions using neural networks with uncertainty estimates (dropout, ensembles, or neural tangent kernel approximations). Regret analysis for neural bandits requires additional assumptions on function class complexity (e.g., eluder dimension, NTK approximation quality).
| Algorithm | Regret Bound | Computation | Assumptions |
|---|---|---|---|
| ε-greedy | $O(T^{2/3})$ | Low | None |
| UCB | $O(\sqrt{KT \ln T})$ | Low | Sub-Gaussian rewards |
| Thompson Sampling | $O(\sqrt{KT \ln T})$ | Medium | Prior specification |
| LinUCB | $O(d\sqrt{T \ln T})$ | Medium | Linear rewards |
| Neural Bandits | Varies | High | Non-linear; uncertainty estimation |