Recommendation Systems Part 2: Ranking & Re-ranking

Part 2 of 6 | ← Part 1: Architecture | Part 3: Production Systems →

Ranking Models

The ranking stage scores candidates with a model trained to predict user engagement. Unlike retrieval, ranking models can afford to examine detailed feature interactions.

Problem Formulation

Ranking is typically framed as pointwise, pairwise, or listwise learning to rank:

Approach	Loss Function	Pros	Cons
Pointwise	Cross-entropy, MSE on engagement labels	Simple; scales to large data	Ignores relative ordering
Pairwise	BPR, hinge loss on (positive, negative) pairs	Captures preference structure	Expensive pair sampling
Listwise	LambdaRank, softmax over slate	Directly optimizes ranking metrics	Complex; requires full slate

Most production systems use pointwise classification (predicting P(click), P(like), P(share)) due to simplicity and scalability, with calibration layers to combine predictions into a single score.

Pointwise Loss Functions

For binary engagement labels $y \in \{0, 1\}$, the standard binary cross-entropy loss is:

$$ \mathcal{L}\_{\text{BCE}} = -\frac{1}{N} \sum\_{i=1}^{N} \left[ y\_i \log(\hat{p}\_i) + (1 - y\_i) \log(1 - \hat{p}\_i) \right] $$

where $\hat{p}\_i = \sigma(f(\mathbf{x}\_i))$ is the predicted probability. For continuous labels like watch time, regression losses apply:

$$ \mathcal{L}\_{\text{MSE}} = \frac{1}{N} \sum\_{i=1}^{N} (y\_i - \hat{y}\_i)^2 $$

Pairwise Loss Functions

Bayesian Personalized Ranking (BPR) optimizes pairwise preferences. For a positive item $i^+$ and negative item $i^-$:

$$ \mathcal{L}\_{\text{BPR}} = -\sum\_{(u, i^+, i^-)} \log \sigma\left( f(u, i^+) - f(u, i^-) \right) $$

This pushes the score of engaged items above non-engaged items. The hinge loss variant introduces a margin $m$:

$$ \mathcal{L}\_{\text{hinge}} = \sum\_{(u, i^+, i^-)} \max\left( 0, m - f(u, i^+) + f(u, i^-) \right) $$

Listwise Loss Functions

LambdaRank directly optimizes ranking metrics (NDCG, MAP) by defining gradients as:

$$ \lambda\_{ij} = \frac{\partial \mathcal{L}}{\partial s\_i} = -\sigma(s\_j - s\_i) \cdot |\Delta \text{NDCG}\_{ij}| $$

where $|\Delta \text{NDCG}_{ij}|$ is the change in NDCG from swapping items $i$ and $j$. This gradient weighting focuses learning on swaps that most impact ranking quality.

The softmax cross-entropy over a list treats ranking as classification:

$$ \mathcal{L}\_{\text{listwise}} = -\sum\_{i: y\_i = 1} \log \frac{\exp(s\_i)}{\sum\_{j} \exp(s\_j)} $$

Model Architectures

Deep & Cross Networks (DCN)

DCN explicitly models feature interactions via cross layers while maintaining a parallel deep network for implicit interactions. The cross layer computes:

$$ \mathbf{x}\_{l+1} = \mathbf{x}\_0 \mathbf{x}\_l^\top \mathbf{w}\_l + \mathbf{b}\_l + \mathbf{x}\_l $$

where $\mathbf{x}\_0 \in \mathbb{R}^d$ is the input embedding, $\mathbf{w}\_l, \mathbf{b}\_l \in \mathbb{R}^d$ are learnable parameters, and the outer product $\mathbf{x}\_0 \mathbf{x}\_l^\top$ captures feature interactions. After $L$ cross layers, the model captures polynomial interactions up to degree $L+1$.

DCN-v2 replaces the vector $\mathbf{w}\_l$ with a matrix $\mathbf{W}\_l \in \mathbb{R}^{d \times d}$:

$$ \mathbf{x}\_{l+1} = \mathbf{x}\_0 \odot (\mathbf{W}\_l \mathbf{x}\_l + \mathbf{b}\_l) + \mathbf{x}\_l $$

where $\odot$ denotes element-wise multiplication. This increases expressiveness at the cost of $O(d^2)$ parameters per layer.

The full DCN architecture combines cross and deep components:

$$ \hat{y} = \sigma\left( \mathbf{w}\_{\text{logit}}^\top [\mathbf{x}\_{\text{cross}}; \mathbf{x}\_{\text{deep}}] + b \right) $$

where $[\cdot ; \cdot]$ denotes concatenation.

Wide & Deep

Google’s Wide & Deep combines a linear model (wide) for memorization with a deep neural network for generalization:

$$ P(y = 1 | \mathbf{x}) = \sigma\left( \mathbf{w}\_{\text{wide}}^\top [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}\_{\text{deep}}^\top a^{(L)} + b \right) $$

where $\phi(\mathbf{x})$ represents cross-product feature transformations (e.g., AND(gender=female, language=en)) and $a^{(L)}$ is the final hidden layer of the deep network:

$$ a^{(l+1)} = \text{ReLU}(\mathbf{W}^{(l)} a^{(l)} + \mathbf{b}^{(l)}) $$

The wide component memorizes specific feature co-occurrences; the deep component generalizes to unseen combinations via learned embeddings.

Feature Interaction Layers

Factorization Machines (FM) model pairwise interactions efficiently:

$$ \hat{y}\_{\text{FM}} = w\_0 + \sum\_{i=1}^{n} w\_i x\_i + \sum\_{i=1}^{n} \sum\_{j=i+1}^{n} \langle \mathbf{v}\_i, \mathbf{v}\_j \rangle x\_i x\_j $$

where $\mathbf{v}\_i \in \mathbb{R}^k$ is a latent factor for feature $i$. The interaction term can be computed in $O(nk)$ time:

$$ \sum\_{i=1}^{n} \sum\_{j=i+1}^{n} \langle \mathbf{v}\_i, \mathbf{v}\_j \rangle x\_i x\_j = \frac{1}{2} \left[ \left( \sum\_{i=1}^{n} \mathbf{v}\_i x\_i \right)^2 - \sum\_{i=1}^{n} \mathbf{v}\_i^2 x\_i^2 \right] $$

DeepFM stacks FM with a deep network, sharing embedding vectors between both components.

Multi-Task Learning

Social media engagement is multi-faceted: clicks, likes, comments, shares, watch time, hides, reports. Multi-task models share lower layers while branching into task-specific heads.

Shared-Bottom Architecture: All tasks share the same feature extraction layers:

$$ \mathbf{h} = f\_{\text{shared}}(\mathbf{x}; \theta\_{\text{shared}}) $$

$$ \hat{y}^{(t)} = g^{(t)}(\mathbf{h}; \theta^{(t)}) \quad \forall t \in \mathcal{T} $$

Multi-gate Mixture-of-Experts (MMoE): Tasks select from a set of expert networks via learned gating:

$$ \mathbf{h}^{(t)} = \sum\_{e=1}^{E} g^{(t)}\_e(\mathbf{x}) \cdot f\_e(\mathbf{x}) $$

where $f\_e$ are expert networks, $g^{(t)}(\mathbf{x}) = \text{softmax}(\mathbf{W}^{(t)}\_g \mathbf{x})$ is a task-specific gating network, and $g^{(t)}\_e$ is the weight assigned to expert $e$ for task $t$.

The joint loss is a weighted sum:

$$ \mathcal{L}\_{\text{MTL}} = \sum\_{t \in \mathcal{T}} \alpha\_t \mathcal{L}^{(t)} $$

Task weights $\alpha\_t$ can be fixed or learned dynamically (e.g., via uncertainty weighting or gradient balancing).

flowchart TB
    Features --> Shared[Shared Layers]
    Shared --> Click["P(click)"]
    Shared --> Like["P(like)"]
    Shared --> Comment["P(comment)"]
    Shared --> Hide["P(hide)"]
    Shared --> WatchTime["E(watch time)"]

Final scores combine task predictions via a weighted sum or learned aggregation, enabling the platform to tune for different objectives (e.g., prioritize meaningful interactions over passive consumption).

Sequence Modeling

User engagement history is sequential: the order of interactions carries signal. Sequence models capture temporal dynamics.

Recurrent Architectures

Long Short-Term Memory (LSTM) networks maintain a cell state $\mathbf{c}\_t$ and hidden state $\mathbf{h}\_t$ updated at each timestep:

$$ \begin{aligned} \mathbf{f}\_t &= \sigma(\mathbf{W}\_f [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_f) & \text{(forget gate)} \\ \mathbf{i}\_t &= \sigma(\mathbf{W}\_i [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_i) & \text{(input gate)} \\ \tilde{\mathbf{c}}\_t &= \tanh(\mathbf{W}\_c [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_c) & \text{(candidate)} \\ \mathbf{c}\_t &= \mathbf{f}\_t \odot \mathbf{c}\_{t-1} + \mathbf{i}\_t \odot \tilde{\mathbf{c}}\_t & \text{(cell state)} \\ \mathbf{o}\_t &= \sigma(\mathbf{W}\_o [\mathbf{h}\_{t-1}, \mathbf{x}\_t] + \mathbf{b}\_o) & \text{(output gate)} \\ \mathbf{h}\_t &= \mathbf{o}\_t \odot \tanh(\mathbf{c}\_t) & \text{(hidden state)} \end{aligned} $$

The final hidden state $\mathbf{h}\_T$ encodes the user’s sequential preferences.

Self-Attention for Sequences

Self-Attentive Sequential Recommendation (SASRec) applies transformer self-attention to interaction sequences. Given item embeddings $\mathbf{E} = [\mathbf{e}\_1, \ldots, \mathbf{e}\_n]$:

$$ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d\_k}} \right) \mathbf{V} $$

where $\mathbf{Q} = \mathbf{E} \mathbf{W}^Q$, $\mathbf{K} = \mathbf{E} \mathbf{W}^K$, $\mathbf{V} = \mathbf{E} \mathbf{W}^V$ are query, key, and value projections.

Causal masking ensures position $i$ only attends to positions $\leq i$:

$$ \text{mask}\_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{otherwise} \end{cases} $$

Multi-head attention projects into $H$ parallel subspaces:

$$ \text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}\_1, \ldots, \text{head}\_H) \mathbf{W}^O $$

where $\text{head}\_h = \text{Attention}(\mathbf{Q} \mathbf{W}^Q\_h, \mathbf{K} \mathbf{W}^K\_h, \mathbf{V} \mathbf{W}^V\_h)$.

Temporal Position Encoding

Standard sinusoidal position encoding:

$$ \begin{aligned} \text{PE}\_{(pos, 2i)} &= \sin\left( \frac{pos}{10000^{2i/d}} \right) \\ \text{PE}\_{(pos, 2i+1)} &= \cos\left( \frac{pos}{10000^{2i/d}} \right) \end{aligned} $$

For recommendation, time-aware embeddings encode absolute or relative timestamps:

$$ \mathbf{t}\_i = \text{MLP}\left( [\Delta t\_i, \sin(\omega \cdot t\_i), \cos(\omega \cdot t\_i)] \right) $$

where $\Delta t\_i = t\_i - t\_{i-1}$ is the time gap and $\omega$ captures periodic patterns (hour-of-day, day-of-week).

BERT4Rec: Bidirectional Modeling

BERT4Rec uses masked language modeling for sequences: randomly mask items and predict them from context:

$$ \mathcal{L}\_{\text{MLM}} = -\sum\_{i \in \mathcal{M}} \log P(x\_i | \mathbf{x}\_{\setminus \mathcal{M}}) $$

where $\mathcal{M}$ is the set of masked positions. Bidirectional attention (no causal mask) captures both past and future context, but requires different inference strategies (e.g., append [MASK] token for next-item prediction).

# Simplified Transformer-based user encoder (PyTorch-like pseudocode)
class UserSequenceEncoder(nn.Module):
    def __init__(self, item_vocab_size, d_model, n_heads, n_layers):
        self.item_embedding = nn.Embedding(item_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, n_heads),
            num_layers=n_layers
        )
        self.output_proj = nn.Linear(d_model, d_model)

    def forward(self, item_sequence, mask):
        x = self.item_embedding(item_sequence)
        x = self.positional_encoding(x)
        x = self.transformer(x, src_key_padding_mask=mask)
        # Pool over sequence (e.g., mean or [CLS] token)
        return self.output_proj(x.mean(dim=1))

Re-ranking and Business Logic

Raw model scores optimize for predicted engagement, but the final feed must satisfy additional constraints:

Diversity

Showing ten similar posts degrades user experience. Diversity algorithms ensure variety.

Maximal Marginal Relevance (MMR)

MMR iteratively selects items that balance relevance and diversity:

$$ \text{MMR} = \arg\max_{i \in \mathcal{I} \setminus S} \left[ \lambda \cdot \text{rel}(i) - (1 - \lambda) \cdot \max_{j \in S} \text{sim}(i, j) \right] $$

where $S$ is the already-selected set, $\text{rel}(i)$ is the relevance score, $\text{sim}(i, j)$ is similarity (e.g., cosine similarity of embeddings), and $\lambda \in [0, 1]$ controls the relevance-diversity trade-off.

Complexity: $O(k \cdot n)$ for selecting $k$ items from $n$ candidates, assuming similarity lookup is $O(1)$.

Determinantal Point Processes (DPP)

DPPs are probabilistic models that assign higher probability to diverse subsets. Given a positive semi-definite kernel matrix $\mathbf{L} \in \mathbb{R}^{n \times n}$:

$$ P(S) \propto \det(\mathbf{L}_S) $$

where $\mathbf{L}_S$ is the submatrix indexed by $S$. The kernel encodes both quality (diagonal) and similarity (off-diagonal):

$$ L_{ij} = q_i \cdot \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j) \cdot q_j $$

where $q_i$ is item quality (relevance score) and $\phi(\mathbf{x}_i)$ is the diversity feature vector.

Properties:

Items with high quality are more likely to be selected.
Similar items (high $\phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)$) are unlikely to appear together.
Exact sampling is $O(n^3)$; approximate methods (greedy, MCMC) are used in practice.

MAP Inference finds the most diverse high-quality set:

$$ S^* = \arg\max_{|S| = k} \det(\mathbf{L}_S) $$

This is NP-hard but admits a $(1 - 1/e)$-approximation via greedy selection.

Submodular Optimization

A set function $F: 2^{\mathcal{I}} \rightarrow \mathbb{R}$ is submodular if it exhibits diminishing returns:

$$ F(S \cup \{i\}) - F(S) \geq F(T \cup \{i\}) - F(T) \quad \forall S \subseteq T, i \notin T $$

Coverage functions capture diversity:

$$ F(S) = \sum\_{c \in \mathcal{C}} \max\_{i \in S} w\_{ic} $$

where $\mathcal{C}$ is a set of categories/topics and $w\_{ic}$ is item $i$’s coverage of category $c$.

Greedy maximization of monotone submodular functions achieves a $(1 - 1/e) \approx 0.63$ approximation ratio:

$$ S\_{t+1} = S\_t \cup \left\lbrace \arg\max\_{i \notin S\_t} F(S\_t \cup \lbrace i \rbrace) - F(S\_t) \right\rbrace $$

Novelty, Serendipity, and Exploration

While diversity ensures variety of topics/creators within a single feed, novelty and serendipity address different user needs:

Novelty: Items the user hasn’t encountered before. Pure relevance-based ranking can get stuck recommending the same high-scoring items. Novelty ensures forward progress through the catalog.

Serendipity: Surprising discoveries—items outside the user’s typical consumption pattern that they end up enjoying. A sports fan unexpectedly loving a cooking video. Serendipity combats filter bubbles and enables exploration.

Dimension	Diversity	Novelty	Serendipity
Goal	Vary topics/creators within feed	Show unseen items	Enable surprising discoveries
Measurement	Topic/creator entropy	Impression history	Out-of-profile engagement
Risk	Low (still in user’s interests)	Medium (might not engage)	High (might actively dislike)
Frequency	Every session	Most sessions	Occasional “budget”

Implementation strategies:

Novelty filtering:
```
unseen_candidates = [c for c in candidates if c.id not in user.seen_history]
```
- Requires tracking user impression history (Bloom filter or distributed cache)
- Balance: Don’t filter TOO aggressively (catalog exhaustion for power users)
Serendipity injection:
- Reserve 5-10% of feed slots for “exploration” items
- Sample from topics user hasn’t engaged with recently
- Use contextual bandits (Thompson Sampling, LinUCB) to balance exploration vs. exploitation
$$ \text{UCB}\_i = \hat{\mu}\_i + \beta \sqrt{\frac{\ln t}{n\_i}} $$
where $\hat{\mu}\_i$ is estimated reward, $n\_i$ is times shown, $\beta$ controls exploration
Calibrated surprise scores:
- Train a model to predict: $P(\text{engage} | \text{item}, \text{user profile})$
- Compute unexpectedness: how different is this item from user’s historical profile?
- Serendipity score: $\text{relevance} \times \text{unexpectedness}$
- Reward items that are both relevant AND surprising

User-level adaptation:

Explorers: Users who engage with diverse content → increase serendipity budget
Conservatives: Users who stick to familiar topics → reduce serendipity, focus on relevance
Cohort analysis: New users need more exploration; established users have known preferences

Measuring success:

Immediate engagement: Do users click on serendipitous items?
Session length: Does serendipity increase time spent?
Long-term retention: Do users with more serendipity return more often?
Regret signals: Do users hide/dismiss serendipitous items?

Fairness and Creator Economics

Platforms balance exposure across creators to maintain a healthy ecosystem:

Exposure parity: Ensure new or small creators receive baseline visibility.
Anti-concentration: Prevent a handful of viral posts from dominating feeds.
Monetization constraints: Ensure ad load targets are met without degrading organic experience.

Policy Enforcement

Content policy violations (misinformation, hate speech, spam) must be filtered or demoted:

Hard blocks: Remove items flagged by integrity classifiers.
Soft demotion: Reduce scores for borderline content.
Contextual warnings: Overlay labels on sensitive content.

Deduplication and Near-Duplicate Detection

Users experience fatigue when shown the same or nearly identical content repeatedly. Deduplication removes exact duplicates and near-duplicates before final ranking.

Exact Deduplication

The simplest case: multiple candidates point to the same underlying content.

Common causes:

Cross-posting: Same content shared to multiple groups/pages
Reposts/retweets: User shares an existing post
Viral spread: Same image/video uploaded by multiple users

Solution: Content-based hashing

Text: Hash normalized text (lowercased, punctuation removed) with SHA-256 or MurmurHash
Images: Perceptual hashing (pHash, dHash) produces similar hashes for visually similar images despite format changes, crops, or compression
Videos: Hash keyframes or audio fingerprints (Chromaprint, AudioSet embeddings)

At serving time, deduplicate by hash:

seen_hashes = set()
deduplicated = []
for item in ranked_candidates:
    h = hash(item.content)
    if h not in seen_hashes:
        deduplicated.append(item)
        seen_hashes.add(h)

Complexity: $O(n)$ with hash table lookups.

Semantic Near-Duplicate Detection

Harder problem: content that’s functionally identical but not byte-for-byte the same.

Examples:

Same news story from different outlets (different wording, same facts)
Meme templates with different captions
Screenshots of tweets vs. embedded tweets
Videos with different aspect ratios or watermarks

Embedding-based approach:

Compute content embeddings (BERT for text, CLIP for images/video)
For each candidate, check cosine similarity against already-selected items: $$ \text{sim}(i, j) = \frac{\mathbf{e}_i^\top \mathbf{e}_j}{\|\mathbf{e}_i\| \|\mathbf{e}_j\|} $$
If $\text{sim}(i, j) > \tau$ (e.g., $\tau = 0.9$), mark as near-duplicate

Optimization: Use ANN indexes (HNSW, FAISS) to find near neighbors in embedding space without checking all pairs. Complexity: $O(n \log n)$ amortized.

flowchart LR
    Candidates[Ranked Candidates] --> Hash[Content Hashing]
    Hash --> ExactDedup[Exact Deduplication]

    ExactDedup --> Embed[Embedding Lookup]
    Embed --> ANN[ANN Similarity Search]
    ANN --> NearDedup[Near-Duplicate Filter]

    NearDedup --> Final[Deduplicated Feed]

Multi-Session Deduplication

Within a single feed request, deduplication is straightforward. Across sessions, it’s harder:

Challenges:

State management: Must track what the user has already seen. This requires persistent storage (Redis, DynamoDB) keyed by user ID.
Recency: Content seen days ago might be worth showing again; content seen minutes ago should be suppressed.
False positives: Overly aggressive deduplication can starve feeds of content.

Strategies:

Strategy	Approach	Trade-off
Session-local only	Deduplicate within current request only	Simple; allows repeats across sessions
Bloom filter	Probabilistic set membership; track seen item IDs	Space-efficient; false positives possible
Time-decayed cache	Store seen items with TTL (e.g., 24 hours)	Bounded memory; configurable recency
Impression logging	Log all impressions; query before ranking	Accurate; high storage cost

Production pattern: Bloom filter for short-term deduplication (hours), with daily resets to avoid memory bloat.

Novelty vs. Diversity vs. Deduplication

These are related but distinct:

Concept	Definition	Goal
Deduplication	Remove identical/near-identical content	Avoid literal repetition
Novelty	Ensure user hasn’t seen this item before	Prevent boring the user
Diversity	Vary topics, creators, formats	Prevent filter bubbles

Deduplication is a prerequisite for novelty (can’t be novel if it’s a duplicate). Diversity operates at a higher semantic level (topic/creator variety, not content similarity).

Re-ranking as Constrained Optimization

Re-ranking involves multiple competing objectives (relevance, diversity, fairness, business constraints) that cannot be simultaneously maximized. Formal treatment requires constrained optimization or multi-objective frameworks.

Single-Objective Formulation with Constraints

The simplest formulation maximizes a weighted utility subject to hard constraints (Carbonell & Goldstein 1998; Agrawal et al. 2009):

$$ \max\_{\pi \in \Pi} \sum\_{i=1}^{k} r(\pi\_i) \cdot w\_{\text{pos}}(i) $$

subject to:

Diversity: $\text{sim}(\pi\_i, \pi\_j) < \theta \quad \forall i \neq j$
Fairness: $\sum\_{i : \text{creator}(\pi\_i) = c} 1 \leq m\_c \quad \forall c \in \mathcal{C}$ (per-creator exposure cap)
Freshness: $|\{i : \text{age}(\pi\_i) < 24h\}| \geq n$
Policy: $\text{policy-score}(\pi\_i) > \tau \quad \forall i$

where $\Pi$ is the set of permutations of candidate items, $r(\pi\_i)$ is relevance score, and $w\_{\text{pos}}(i)$ is position-dependent weight.

Computational challenge: This is a constrained integer optimization problem (NP-hard). Practical solutions:

Greedy approximation: MMR, submodular maximization ($1 - 1/e$ approximation)
Relaxation: Continuous relaxation + rounding
Constraint satisfaction: Filter candidates, then rank

Lagrangian Formulation

For soft constraints, use Lagrangian relaxation (Boyd & Vandenberghe 2004):

$$ \mathcal{L}(\pi, \lambda) = \sum\_{i=1}^{k} r(\pi\_i) \cdot w\_{\text{pos}}(i) - \sum\_{j} \lambda\_j g\_j(\pi) $$

where $g\_j(\pi)$ are constraint violations (diversity deficit, fairness violation, etc.) and $\lambda\_j \geq 0$ are Lagrange multipliers.

Optimization: Solve the dual problem:

$$ \min\_{\lambda \geq 0} \max\_{\pi} \mathcal{L}(\pi, \lambda) $$

Interpretation: Lagrange multipliers $\lambda\_j$ represent the “price” of violating constraint $j$. At optimum, if $\lambda\_j > 0$, constraint $j$ is active (binding).

Example - Diversity constraint:

$$ \mathcal{L}(\pi, \lambda) = \sum\_{i=1}^{k} r(\pi\_i) - \lambda \sum\_{iIncreasing $\lambda$ penalizes similar items more heavily, enforcing diversity.

Pareto Efficiency

When objectives genuinely conflict (e.g., relevance vs. diversity vs. fairness), no single ranking dominates on all metrics. The Pareto frontier characterizes the set of optimal trade-offs (Miettinen 1999).

Definition: Ranking $\pi$ is Pareto-optimal if no alternative $\pi'$ exists such that:

$$ f\_j(\pi') \geq f\_j(\pi) \quad \forall j \quad \text{and} \quad f\_k(\pi') > f\_k(\pi) \quad \text{for some } k $$

where $f\_j$ are objective functions (e.g., $f\_1$ = relevance, $f\_2$ = diversity, $f\_3$ = fairness).

Scalarization: Convert to single-objective via weighted sum:

$$ \max\_{\pi} \sum\_{j=1}^{m} \alpha\_j f\_j(\pi), \quad \alpha\_j \geq 0, \sum\_j \alpha\_j = 1 $$

Limitation: Linear scalarization cannot reach non-convex parts of Pareto frontier. Use Chebyshev scalarization for non-convex objectives:

$$ \max\_{\pi} \min\_{j} \left( \alpha\_j (f\_j(\pi) - z\_j^*) \right) $$

where $z\_j^*$ is the ideal value of objective $j$.

Multi-objective re-ranking algorithm:

Enumerate or sample candidate rankings $\Pi' \subset \Pi$
Compute objective vector $\mathbf{f}(\pi) = (f\_1(\pi), \ldots, f\_m(\pi))$ for each $\pi \in \Pi'$
Identify Pareto frontier via non-dominated sorting
Select from frontier based on business priorities $\alpha$

Computing the Pareto Frontier

Characterization via KKT conditions: For differentiable objectives, Pareto-optimal $\pi^*$ satisfies:

$$ \nabla f\_j(\pi^{\*}) = \sum\_{k=1}^{m} \mu\_k \nabla f\_k(\pi^{\*}), \quad \mu\_k \geq 0, \sum\_k \mu\_k = 1 $$

Equivalently, the gradients of all objectives lie in a common cone at the optimum.

Computing the frontier:

ε-constraint method: Fix $m-1$ objectives as constraints, optimize the remaining one:
$$ \max\_{\pi} f\_1(\pi) \quad \text{s.t.} \quad f\_j(\pi) \geq \epsilon\_j \quad \forall j > 1 $$
Vary $(\epsilon\_2, \ldots, \epsilon\_m)$ to trace the frontier.
Weighted sum with systematic sweeps: Vary weights $\alpha$ on a grid and solve:
$$ \max\_{\pi} \sum\_{j} \alpha\_j f\_j(\pi) $$
Each $\alpha$ yields a Pareto-optimal solution (for convex objectives).
Non-dominated sorting (Deb et al. 2002): For discrete $\Pi$:
- Start with all rankings $\Pi$
- Identify non-dominated set (Pareto front layer 1)
- Remove layer 1, repeat on remainder (layer 2), etc.
- Complexity: $O(m |\Pi|^2)$ comparisons

Example: Relevance-Diversity-Fairness

Three objectives:

$f\_1(\pi) = \sum\_i r(\pi\_i) \cdot w(i)$ (relevance)
$f\_2(\pi) = -\sum\_{i
$f\_3(\pi) = \min\_{g} \frac{\sum\_{i \in g} \mathbb{1}[\pi\_i \in \text{top-}k]}{|g|}$ (min-group exposure fairness)

Pareto front: The set of $\pi$ such that improving one objective requires sacrificing another.

quadrantChart
    title Relevance-Diversity Pareto Frontier
    x-axis Low Diversity --> High Diversity
    y-axis Low Relevance --> High Relevance
    quadrant-1 Pareto Optimal
    quadrant-2 High Relevance Focus
    quadrant-3 Dominated
    quadrant-4 High Diversity Focus
    Point A: [0.20, 0.95]
    Point B: [0.35, 0.88]
    Point C: [0.48, 0.78]
    Point D: [0.60, 0.65]
    Point E: [0.72, 0.52]
    Point F: [0.82, 0.38]
    Point G: [0.90, 0.25]
    Dominated X: [0.45, 0.45]
    Dominated Y: [0.55, 0.35]

Points A-G form the Pareto frontier representing optimal trade-offs. Interior points (X, Y) are dominated—both objectives can be improved simultaneously.

Tuning Weights via Constrained Optimization

Problem: How to set $\alpha = (\alpha\_1, \ldots, \alpha\_m)$ in scalarized objective?

Approach 1: Online learning of weights

Run multi-armed bandit over weight configurations. Each arm $a$ corresponds to a weight vector $\alpha^{(a)}$. Observe long-term metric (e.g., 7-day retention) as reward:

$$ \alpha^* = \arg\max\_{\alpha} \mathbb{E}[\text{retention}(\alpha)] $$

Use Thompson Sampling or GP-UCB over the weight simplex.

Approach 2: Inverse optimization (Bertsimas et al. 2015)

Given expert rankings $\{\pi\_1, \ldots, \pi\_N\}$ that achieve desirable outcomes, infer the weights that rationalize them:

$$ \min\_{\alpha} \sum\_{n=1}^{N} \left\| \pi\_n - \arg\max\_{\pi} \sum\_j \alpha\_j f\_j(\pi) \right\|^2 $$

This is a bi-level optimization problem, typically solved via gradient-free methods or relaxations.

Approach 3: Constrained grid search with business metrics

Define acceptable ranges for each objective:

Relevance: maintain $\geq 95\%$ of baseline
Diversity: improve by $\geq 10\%$
Fairness: achieve CED $\leq 1.2$ across groups

Search over $\alpha$ that satisfy these constraints, select the one maximizing a primary business metric (e.g., DAU, revenue).

Approach 4: Lagrangian dual ascent

For constraints $c\_j(\pi) \geq 0$, iteratively:

Primal step: Given $\lambda^{(t)}$, solve:
$$ \pi^{(t+1)} = \arg\max\_{\pi} \mathcal{L}(\pi, \lambda^{(t)}) = \arg\max\_{\pi} \left( f(\pi) - \sum\_j \lambda\_j^{(t)} c\_j(\pi) \right) $$
Dual step: Update multipliers:
$$ \lambda\_j^{(t+1)} = \max\left(0, \lambda\_j^{(t)} - \eta \nabla\_{\lambda\_j} \mathcal{L}(\pi^{(t+1)}, \lambda^{(t)}) \right) = \max\left(0, \lambda\_j^{(t)} + \eta c\_j(\pi^{(t+1)}) \right) $$
If constraint $j$ is violated ($c\_j < 0$), increase $\lambda\_j$ to penalize violations more heavily.

Convergence: Under convexity and suitable step sizes $\eta$, converges to optimal $(\pi^*, \lambda^*)$ satisfying KKT conditions.

Practical note: Full Pareto optimization is expensive. Production systems typically use scalarization with tuned $\alpha$ (via A/B testing or offline policy evaluation) or sequential filtering (first enforce hard constraints, then optimize relevance).

flowchart LR
    Scores[Ranking Scores] --> Rerank[Re-ranking]
    Diversity[Diversity Constraints] --> Rerank
    Policy[Policy Filters] --> Rerank
    Business[Business Rules] --> Rerank
    Rerank --> Final[Final Feed]

Real-Time Personalization

Static models trained on historical data capture long-term preferences but fundamentally lag behind user intent. A user who just searched for “running shoes” has different needs than their historical profile suggests—yet batch-trained models won’t reflect this until the next training cycle (hours or days later). Real-time personalization bridges this gap by adapting to in-session signals within milliseconds.

The challenge is architectural: batch systems optimize for throughput and can process billions of interactions overnight, while real-time systems must respond in <50ms at the 99th percentile. This forces different trade-offs across the stack.

What changes in real-time:

Signal Type	Batch Latency	Real-Time Latency	Example
User features	Hours	<100ms	“User clicked 3 electronics items this session”
Item popularity	Daily	Minutes	“This video is trending right now”
Contextual state	N/A	Immediate	“User is on mobile, evening, in New York”
Social signals	Hours	Seconds	“User’s friend just liked this item”

Why it matters:

Intent shift: Users switch contexts mid-session. Someone researching laptops may pivot to laptop bags—real-time systems catch this; batch systems miss it entirely.
Freshness: New items (breaking news, flash sales, new releases) need immediate exposure, not after the next batch job.
Feedback loops: Showing an item and observing no click is signal. Real-time systems can demote ignored items within the same session.
Competitive dynamics: In marketplaces and ads, the value of inventory changes by the second. Stale predictions leave money on the table.

Session-Based Recommendations

Within a session, the system observes which items the user engages with (or ignores) and adjusts subsequent recommendations:

Contextual bandits: Balance exploration (showing uncertain items) with exploitation (showing high-confidence items). Thompson Sampling, LinUCB, and neural bandit variants are common.
Session transformers: Encode the in-session interaction sequence and predict next-item relevance.
Real-time feature updates: Increment engagement counters and recompute user embeddings on each interaction.

Streaming Infrastructure

Real-time personalization requires streaming pipelines that:

Ingest interaction events (Kafka, Kinesis, Pub/Sub)
Update user state in low-latency stores (Redis, DynamoDB, Bigtable)
Trigger re-scoring or model inference on state changes

flowchart LR
    App[User App] -->|events| Stream[Event Stream]
    Stream --> Processor[Stream Processor]
    Processor --> State[(User State Store)]
    State --> Inference[Ranking Service]
    Inference --> App

Explore-Exploit Trade-offs

Pure exploitation (always showing the highest-scoring items) leads to filter bubbles and prevents discovery of new content. The explore-exploit dilemma is formalized via multi-armed bandits.

Multi-Armed Bandit Framework

Each item $i$ is an “arm” with unknown reward distribution. At each round $t$, the algorithm selects arm $a_t$ and observes reward $r_t \sim P_{a_t}$. The goal is to minimize cumulative regret:

$$ R(T) = \sum_{t=1}^{T} \left( \mu^* - \mu_{a_t} \right) = T \mu^* - \sum_{t=1}^{T} \mu_{a_t} $$

where $\mu^* = \max_i \mu_i$ is the optimal arm’s expected reward.

ε-Greedy

With probability $\epsilon$, select a random arm; otherwise select the arm with highest empirical mean:

$$ a_t = \begin{cases} \text{Uniform}(\mathcal{I}) & \text{with probability } \epsilon \\ \arg\max_i \hat{\mu}_i & \text{otherwise} \end{cases} $$

Regret: $R(T) = O(\epsilon T + K/\epsilon)$ where $K$ is the number of arms. Optimal $\epsilon = O(T^{-1/3})$ yields $R(T) = O(T^{2/3})$.

Upper Confidence Bound (UCB)

UCB selects the arm with highest upper confidence bound:

$$ a\_t = \arg\max\_i \left( \hat{\mu}\_i + \sqrt{\frac{2 \ln t}{n\_i}} \right) $$

where $\hat{\mu}\_i$ is the empirical mean, $n\_i$ is the number of times arm $i$ has been pulled, and the second term is the confidence bonus.

Regret bound: $R(T) = O\left( \sqrt{KT \ln T} \right)$, which is order-optimal (Auer et al. 2002).

Proof sketch:

Assumptions:

Rewards are 1-sub-Gaussian: $\mathbb{E}[\exp(\lambda(r - \mu))] \leq \exp(\lambda^2/2)$ for all $\lambda$
Rewards are i.i.d. within each arm
Support: $r\_t \in [0, 1]$

Key steps:

Define $\Delta\_i = \mu^* - \mu\_i$ as the suboptimality gap for arm $i$. The expected number of times arm $i$ is pulled by time $T$ is bounded by decomposing regret:

$$ \mathbb{E}[n\_i(T)] \leq \frac{8 \ln T}{\Delta\_i^2} + 1 + \frac{\pi^2}{3} $$

Intuition: UCB explores arm $i$ only while its confidence bound overlaps with the optimal arm. By Hoeffding’s inequality, the empirical mean $\hat{\mu}\_i$ concentrates around $\mu\_i$ at rate $O(1/\sqrt{n\_i})$. The confidence bonus $\sqrt{2 \ln t / n\_i}$ shrinks similarly, ensuring:

$$ P(\mu\_i \leq \hat{\mu}\_i + \sqrt{2 \ln t / n\_i}) \geq 1 - t^{-4} $$

After $n\_i \approx 8 \ln T / \Delta\_i^2$ pulls, the bonus becomes smaller than the gap $\Delta\_i$, and arm $i$ is no longer competitive. Summing over all suboptimal arms:

$$ R(T) = \sum\_{i : \Delta\_i > 0} \Delta\_i \mathbb{E}[n\_i(T)] = O\left( \sum\_{i} \frac{\ln T}{\Delta\_i} \right) = O\left( \sqrt{KT \ln T} \right) $$

where the last step uses $\sum\_i 1/\Delta\_i \leq \sqrt{K/\Delta\_{\min}}$ and $\Delta\_{\min} = \Omega(1/\sqrt{T})$ in worst case.

Tightness: This bound is minimax-optimal (Lai & Robbins 1985); no algorithm can achieve better worst-case regret for all problem instances.

Thompson Sampling

Thompson Sampling maintains a posterior distribution $P(\mu\_i | \mathcal{D})$ over each arm’s reward. At each round:

Sample $\tilde{\mu}\_i \sim P(\mu\_i | \mathcal{D})$ for each arm
Select $a\_t = \arg\max\_i \tilde{\mu}\_i$
Update posterior with observed reward

For Bernoulli rewards with Beta prior $\text{Beta}(\alpha\_i, \beta\_i)$:

$$ P(\mu\_i | \mathcal{D}) = \text{Beta}(\alpha\_i + s\_i, \beta\_i + f\_i) $$

where $s\_i$ and $f\_i$ are successes and failures for arm $i$.

Regret bound: $R(T) = O\left( \sqrt{KT \ln T} \right)$, matching UCB with better empirical performance.

Contextual Bandits

In recommendation, context (user features, time, session state) affects optimal actions. Contextual bandits model reward as $r = f(\mathbf{x}, a) + \epsilon$.

LinUCB assumes linear rewards:

$$ \mathbb{E}[r | \mathbf{x}, a] = \boldsymbol{\theta}_a^\top \mathbf{x} $$

The algorithm maintains per-arm ridge regression and selects:

$$ a\_t = \arg\max\_a \left( \hat{\boldsymbol{\theta}}\_a^\top \mathbf{x}\_t + \alpha \sqrt{\mathbf{x}\_t^\top \mathbf{A}\_a^{-1} \mathbf{x}\_t} \right) $$

where $\mathbf{A}\_a = \mathbf{X}\_a^\top \mathbf{X}\_a + \lambda \mathbf{I}$ is the design matrix, and $\hat{\boldsymbol{\theta}}\_a = \mathbf{A}\_a^{-1} \mathbf{X}\_a^\top \mathbf{r}\_a$ is the ridge regression estimate.

Regret bound (Li et al. 2010): Under assumptions:

$\|\mathbf{x}\_t\| \leq L$ for all $t$ (bounded features)
$\|\boldsymbol{\theta}\_a\| \leq S$ for all $a$ (bounded parameters)
Noise $\eta\_t$ is conditionally $R$-sub-Gaussian: $\mathbb{E}[\exp(\lambda \eta\_t) | \mathcal{F}\_{t-1}] \leq \exp(\lambda^2 R^2/2)$

The cumulative regret satisfies:

$$ R(T) = O\left( d \sqrt{T \ln(1 + T L^2/\lambda)} \right) $$

where $d$ is feature dimension, and $\alpha = R \sqrt{\ln(2T/\delta)} + \sqrt{\lambda} S$ is the exploration bonus.

Proof intuition: The confidence ellipsoid $\{\boldsymbol{\theta} : \|\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}\_a\|\_{\mathbf{A}\_a} \leq \beta\}$ contains the true $\boldsymbol{\theta}\_a$ with high probability by self-normalized concentration (Abbasi-Yadkori et al. 2011). The term $\sqrt{\mathbf{x}\_t^\top \mathbf{A}\_a^{-1} \mathbf{x}\_t}$ is the width of this ellipsoid in direction $\mathbf{x}\_t$. Regret accumulates only when the algorithm explores regions of high uncertainty, which shrinks as data accumulates.

Neural bandits extend this to non-linear reward functions using neural networks with uncertainty estimates (dropout, ensembles, or neural tangent kernel approximations). Regret analysis for neural bandits requires additional assumptions on function class complexity (e.g., eluder dimension, NTK approximation quality).

Algorithm	Regret Bound	Computation	Assumptions
ε-greedy	$O(T^{2/3})$	Low	None
UCB	$O(\sqrt{KT \ln T})$	Low	Sub-Gaussian rewards
Thompson Sampling	$O(\sqrt{KT \ln T})$	Medium	Prior specification
LinUCB	$O(d\sqrt{T \ln T})$	Medium	Linear rewards
Neural Bandits	Varies	High	Non-linear; uncertainty estimation

Questions or feedback?