Recommendation Systems Part 3: Production Systems

Recommendation Systems Part 3: Production Systems


Questions or feedback?

I'd love to hear your thoughts on this article. Feel free to reach out:

Part 3 of 6 | ← Part 2: Ranking | Part 4: Ethics & Safety →

Evaluation and Metrics

Recommendation systems require rigorous evaluation across offline, online, and long-term dimensions.

Offline Metrics

Metric Definition Use Case
AUC-ROC Area under ROC curve for engagement prediction Pointwise model quality
Log-loss Cross-entropy of predicted probabilities Calibration quality
NDCG@k Normalized discounted cumulative gain at rank k Ranking quality
Recall@k Fraction of relevant items in top-k Retrieval coverage
Hit Rate Whether the engaged item appears in top-k Retrieval success

Offline metrics use held-out interaction logs; they are necessary but not sufficient for production decisions.

Ranking Metric Definitions

Discounted Cumulative Gain (DCG) rewards relevant items at higher positions:

$$ \text{DCG}@k = \sum\_{i=1}^{k} \frac{2^{rel\_i} - 1}{\log\_2(i + 1)} $$

where $rel\_i$ is the relevance grade of item at rank $i$. Normalized DCG (NDCG) divides by the ideal DCG:

$$ \text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k} $$

where $\text{IDCG}@k$ is DCG of the optimal ranking.

Mean Average Precision (MAP) averages precision at each relevant item:

$$ \text{AP} = \frac{1}{|\text{Rel}|} \sum\_{k=1}^{n} P@k \cdot \mathbb{1}[rel\_k = 1] $$

where $P@k = \frac{|\text{relevant in top-}k|}{k}$.

Mean Reciprocal Rank (MRR) captures the position of the first relevant item:

$$ \text{MRR} = \frac{1}{|Q|} \sum\_{q=1}^{|Q|} \frac{1}{\text{rank}\_q} $$

Calibration Metrics

A model is calibrated if predicted probabilities match empirical frequencies:

$$ P(y = 1 | \hat{p} = p) = p \quad \forall p \in [0, 1] $$

Expected Calibration Error (ECE) bins predictions and measures deviation:

$$ \text{ECE} = \sum\_{b=1}^{B} \frac{|B\_b|}{n} \left| \text{acc}(B\_b) - \text{conf}(B\_b) \right| $$

where $\text{acc}(B\_b)$ is the accuracy in bin $b$ and $\text{conf}(B\_b)$ is the average confidence.

Online Metrics (A/B Testing)

Online experiments measure causal impact on user behavior:

Metric Category Examples
Engagement CTR, likes/user, comments/user, watch time
Retention DAU, WAU, session frequency, churn rate
Quality Survey satisfaction, content diversity consumed
Creator health Posts created, follower growth, monetization
Platform safety Reports, policy violations surfaced

Statistical Framework

Let $Y_i(1)$ and $Y_i(0)$ denote potential outcomes for user $i$ under treatment and control. The Average Treatment Effect (ATE) is:

$$ \tau = \mathbb{E}[Y_i(1) - Y_i(0)] $$

Randomization ensures unbiased estimation: $\hat{\tau} = \bar{Y}_T - \bar{Y}_C$.

Variance estimation for the difference in means:

$$ \text{Var}(\hat{\tau}) = \frac{\sigma_T^2}{n_T} + \frac{\sigma_C^2}{n_C} $$

Confidence interval (asymptotic normal):

$$ \hat{\tau} \pm z_{1-\alpha/2} \sqrt{\text{Var}(\hat{\tau})} $$

Sample Size Calculation

For detecting effect size $\delta$ with power $1 - \beta$ at significance level $\alpha$:

$$ n = 2 \left( \frac{z_{1-\alpha/2} + z_{1-\beta}}{\delta / \sigma} \right)^2 $$

where $\sigma$ is the standard deviation and $\delta / \sigma$ is the standardized effect size.

Multiple Testing Correction

Testing many metrics inflates false positive rates. Corrections include:

  • Bonferroni: $\alpha' = \alpha / m$ for $m$ tests (conservative)
  • Benjamini-Hochberg: Controls false discovery rate (FDR) $\leq \alpha$
  • Sequential testing: Peek at results with spending functions (e.g., O’Brien-Fleming)

Variance Reduction Techniques

  • Stratification: Partition users by pre-experiment covariates; average within-stratum estimates.
  • CUPED (Controlled-experiment Using Pre-Experiment Data): Regress outcome on pre-experiment metric:
$$ \tilde{Y}_i = Y_i - \theta (X_i - \bar{X}) $$

where $\theta$ minimizes variance. Variance reduction up to $\rho^2$ where $\rho$ is correlation between $Y$ and $X$.

Network Interference and SUTVA Violations

Standard A/B testing assumes the Stable Unit Treatment Value Assumption (SUTVA): a user’s outcome depends only on their own treatment assignment, not on others’ assignments. Social recommendation systems routinely violate this assumption.

Why interference occurs:

Mechanism Example Bias Direction
Content spillover Treatment users share recommended content with control users Dilutes treatment effect
Social influence Treatment user’s increased engagement affects friends’ feeds Inflates treatment effect
Competition effects Treatment users consume content, reducing availability for control Complicates interpretation
Creator response Creators adapt to treatment group’s engagement patterns Long-term ecosystem shift

Formal model (Hudgens & Halloran 2008; Aronow & Samii 2017):

Let $G = (V, E)$ be the social network, $Z\_i \in \{0,1\}$ be the treatment assignment for user $i$, and $\mathbf{Z} = (Z\_1, \ldots, Z\_n)$ be the full assignment vector. The potential outcome $Y\_i(\mathbf{Z})$ depends on the entire vector $\mathbf{Z}$, not just $Z\_i$.

General interference model: Define neighborhoods $\mathcal{N}(i) \subset V$ (e.g., friends, k-hop neighbors). Assume outcomes depend only on own treatment and neighborhood treatments:

$$ Y\_i(\mathbf{Z}) = Y\_i(Z\_i, \mathbf{Z}\_{\mathcal{N}(i)}) $$

Treatment effect with interference: The individual treatment effect compares:

$$ \tau\_i(\mathbf{Z}\_{-i}) = Y\_i(1, \mathbf{Z}\_{\mathcal{N}(i)}) - Y\_i(0, \mathbf{Z}\_{\mathcal{N}(i)}) $$

which now depends on neighbors’ assignments $\mathbf{Z}\_{\mathcal{N}(i)}$.

Exposure mappings (Aronow & Samii 2017): Partition assignment vectors into exposure conditions. For binary treatment with $k=1$ neighborhood:

  • Direct exposure: $Z\_i = 1$ (user gets treatment)
  • Indirect exposure: $Z\_i = 0$ but $\exists j \in \mathcal{N}(i) : Z\_j = 1$ (friend gets treatment)
  • No exposure: $Z\_i = 0$ and $Z\_j = 0 \ \forall j \in \mathcal{N}(i)$

Spillover effects:

$$ \text{Spillover}\_i = Y\_i(0, \mathbf{1}\_{\mathcal{N}(i)}) - Y\_i(0, \mathbf{0}\_{\mathcal{N}(i)}) $$

compares control users with all-treated vs. all-control neighbors.

Consequences of ignoring interference:

  • Correlated outcomes: Users in the same social cluster have correlated metrics even under random assignment. Standard confidence intervals are too narrow.
  • Biased estimates: If treatment “leaks” to control via social connections, the measured effect underestimates the true effect.
  • Irreproducible results: Effects measured in A/B tests don’t replicate at full rollout because interference patterns change.

Mitigation strategies:

Strategy Approach Trade-off
Cluster randomization Randomize at community/graph-cluster level instead of user level Fewer effective units; higher variance
Ego-network experiments Randomize treatment of a user’s entire ego network (friends + friends-of-friends) Complex implementation; ethical concerns
Geographic randomization Randomize by region where social graphs are denser within than across Confounds with regional effects
Causal graph modeling Explicitly model interference structure; adjust estimates Requires correct interference model
Switchback experiments Alternate treatment/control over time rather than users Carryover effects; time confounds

Detecting interference:

  • Distance-to-treatment analysis: Plot control users’ outcomes against their graph distance to nearest treatment user. Correlation suggests spillover.
  • Cluster-level variance: If between-cluster variance » within-cluster variance, standard errors are underestimated.
  • Rollout discontinuities: Compare estimated effect at 1% rollout vs. 50% rollout. Large discrepancies suggest interference.

For recommendation systems with strong social components, ignoring network effects can lead to shipping changes that perform worse at scale than in the A/B test—or missing changes that would have succeeded.

Long-Term Effects

Short-term engagement gains may harm long-term retention (e.g., clickbait). Platforms track:

  • Cohort retention curves: Do users exposed to the new model return at the same rate after 7/30/90 days?
  • User satisfaction surveys: NPS, CSAT, qualitative feedback.
  • Ecosystem health: Creator churn, content quality trends.

Causal Inference for Long-Term Effects

Difference-in-Differences (DiD) compares trends before and after treatment:

$$ \hat{\tau}\_{\text{DiD}} = (\bar{Y}\_{T,\text{post}} - \bar{Y}\_{T,\text{pre}}) - (\bar{Y}\_{C,\text{post}} - \bar{Y}\_{C,\text{pre}}) $$

Assumes parallel trends in absence of treatment.

Synthetic Control constructs a weighted combination of control units to match pre-treatment outcomes:

$$ \hat{Y}\_{T,t}^{(0)} = \sum\_{j \in \text{control}} w\_j Y\_{j,t} $$

Treatment effect: $\hat{\tau}\_t = Y\_{T,t} - \hat{Y}\_{T,t}^{(0)}$.

Instrumental Variables (IV) addresses selection bias when treatment is endogenous:

$$ \hat{\tau}\_{\text{IV}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(D, Z)} $$

where $Z$ is an instrument affecting outcome $Y$ only through treatment $D$.


Training Infrastructure

Training recommendation models at scale requires specialized infrastructure.

Data Pipeline

flowchart LR
    Logs[(Interaction Logs)] --> ETL[ETL / Feature Join]
    ETL --> Training[Training Data]
    Labels[(Label Generation)] --> Training
    Training --> Shuffle[Global Shuffle]
    Shuffle --> Shards[(Sharded TFRecords)]

Key considerations:

  • Label generation: Define positive/negative labels (e.g., click = positive, impression without click = negative). Handle implicit feedback (no explicit dislikes).
  • Negative sampling: With billions of items, most items are never shown. Sample negatives from impressions, random items, or in-batch negatives.
  • Point-in-time joins: Join features as they existed at interaction time to avoid leakage.

Distributed Training

Models with billions of parameters and terabytes of training data require distributed training:

Approach Description Use Case
Data parallelism Replicate model; partition data Dense models
Model parallelism Partition model across devices Very large models
Embedding sharding Distribute embedding tables across parameter servers Large vocabulary (users, items)
Pipeline parallelism Overlap forward/backward passes across micro-batches Deep models

Frameworks like TensorFlow, PyTorch with DeepSpeed/FSDP, and custom systems (Meta’s DLRM, Google’s TPU pods) enable training at this scale.

Model Compression and Serving Efficiency

Production recommendation models face strict latency and cost constraints. A model that achieves 1% higher engagement but adds 50ms latency will degrade user experience and fail to ship. Compression and serving optimization are not afterthoughts—they’re first-class concerns.

Quantization

Quantization reduces numerical precision, trading model accuracy for speed and memory.

Precision levels:

Precision Bits Range Speedup Accuracy Impact
FP32 (float) 32 ~$10^{-38}$ to $10^{38}$ 1x baseline Baseline
FP16 (half) 16 ~$10^{-8}$ to $6.5 \times 10^4$ 2-3x <0.5% degradation
INT8 (integer) 8 -128 to 127 4-5x 1-2% degradation
INT4 4 -8 to 7 8-10x 3-5% degradation

Quantization-aware training (QAT): Simulate quantization during training by adding fake-quant nodes. The model learns to be robust to precision loss.

Post-training quantization (PTQ): Quantize trained model weights without retraining. Requires calibration dataset to determine quantization scales:

$$ x\_{\text{int}} = \text{round}\left( \frac{x\_{\text{float}} - z}{s} \right) $$

where $s$ is the scale factor and $z$ is the zero-point.

Dynamic vs. static quantization:

  • Static: Quantize both weights and activations offline
  • Dynamic: Quantize weights offline; activations quantized on-the-fly (less speedup but better accuracy)

Per-channel quantization: Use different scales per output channel (more memory, better accuracy).

Tools: TensorFlow Lite, PyTorch Quantization, ONNX Runtime, TensorRT.

Knowledge Distillation

Train a small “student” model to mimic a large “teacher” model.

Setup:

  • Teacher: Large, accurate model (e.g., 1B parameters)
  • Student: Small, fast model (e.g., 100M parameters)
  • Training: Student learns from teacher’s soft targets (logits), not just hard labels

Distillation loss:

$$ \mathcal{L}\_{\text{distill}} = \alpha \cdot \text{KL}(P\_{\text{student}} \| P\_{\text{teacher}}) + (1 - \alpha) \cdot \mathcal{L}\_{\text{CE}}(y, P\_{\text{student}}) $$

where $P$ are softmax probabilities, $y$ are ground-truth labels, and $\alpha \in [0, 1]$ balances teacher guidance vs. label supervision.

Temperature scaling: Soften probability distributions to expose dark knowledge:

$$ P\_i = \frac{\exp(z\_i / T)}{\sum\_j \exp(z\_j / T)} $$

Higher temperature $T$ (e.g., $T=3$) smooths distribution; student learns relative rankings, not just top-1.

Multi-task distillation: Teacher model outputs multiple heads (click, like, share). Student learns all tasks from teacher.

Benefits:

  • 5-10x speedup with <2% accuracy loss
  • Smaller model footprint (fits in CPU cache)
  • Easier deployment (no GPU required)

Model Pruning

Remove unimportant weights or neurons to reduce model size.

Magnitude-based pruning: Remove weights with smallest absolute values:

$$ \text{Prune}(W) = \begin{cases} W_{ij} & \text{if } |W_{ij}| > \tau \\ 0 & \text{otherwise} \end{cases} $$

Structured pruning: Remove entire channels, layers, or attention heads (hardware-friendly).

Iterative pruning: Prune → retrain → prune → retrain. Gradual pruning maintains accuracy better than one-shot.

Pruning ratios: Recommendation models can often be pruned 30-50% with <1% accuracy loss.

Serving Infrastructure

flowchart TB
    subgraph Client ["Client Layer"]
        App[User App] --> LB[Load Balancer]
    end

    subgraph Serving ["Serving Layer"]
        LB --> Gateway[API Gateway]
        Gateway --> FeatureService[Feature Service]
        Gateway --> ModelService[Model Inference]

        FeatureService --> FeatureCache[(Redis: Feature Cache)]
        FeatureService --> FeatureStore[(Feature Store)]

        ModelService --> EmbedCache[(Embedding Cache)]
        ModelService --> ModelServer[Model Server Pool]
    end

    subgraph Backends ["Backend Compute"]
        ModelServer --> GPU1[GPU Pod 1]
        ModelServer --> GPU2[GPU Pod N]
        ModelServer --> CPU[CPU Fallback]
    end

    subgraph Offline ["Offline Pipelines"]
        Batch[Batch Jobs] --> PreComp[(Precomputed Embeddings)]
        PreComp --> EmbedCache
    end

Model serving frameworks:

Framework Strengths Use Case
TensorFlow Serving Production-grade, model versioning, batching TensorFlow models
TorchServe PyTorch native, multi-model serving PyTorch models
Triton Inference Server Multi-framework, GPU optimization, dynamic batching Heterogeneous stacks
ONNX Runtime Cross-platform, lightweight, quantization support Edge deployment, CPU serving
Ray Serve Python-native, autoscaling, multi-model Rapid prototyping, Python pipelines

Caching Strategies

Embedding caches:

  • User embeddings: Cache recent users (LRU eviction). Hit rate: 70-90%.
  • Item embeddings: Cache popular items (weighted LRU). Hit rate: 80-95%.
  • Storage: Redis cluster with 100-500GB memory.

Result caches:

  • Cache final recommendations for deterministic requests (e.g., logged-out homepage).
  • TTL: 5-60 minutes depending on freshness requirements.
  • Invalidation: Triggered by user actions or model updates.

Feature caches:

  • Precompute static user/item features (demographic, historical aggregates).
  • Update cadence: hourly to daily.

Cache hit economics:

  • Cached response: <1ms latency, $0.0001 cost
  • Model inference: 50ms latency, $0.01 cost
  • 90% cache hit rate saves 10x on compute cost

Batching and Throughput Optimization

Dynamic batching: Aggregate requests that arrive within a time window (e.g., 10ms) into a single batch.

Benefits:

  • GPU utilization: 20% (no batching) → 80% (batched)
  • Amortized kernel launch overhead
  • Higher throughput (requests/second)

Trade-off:

  • Adds queuing delay (P99 latency increases)
  • Not suitable for ultra-low-latency applications

Batch size tuning:

  • Too small: Underutilize GPU
  • Too large: OOM errors, increased latency
  • Typical: 16-128 for ranking models

GPU vs. CPU Trade-offs

Metric GPU CPU
Throughput High (1000s req/s) Low (10s req/s)
Latency (P50) 5-20ms 50-200ms
Latency (P99) 20-50ms 200-500ms
Cost per inference $0.001-0.01 $0.0001-0.001
Idle cost High (GPU sits idle) Low (CPU multi-tenant)
Model size limit GPU memory (16-80GB) System memory (100s GB)

Decision heuristic:

  • High traffic (>1M req/day): GPU worth it
  • Low traffic or bursty: CPU + autoscaling
  • Extreme latency requirements (<10ms): GPU mandatory
  • Cost-sensitive: Quantized CPU models

Autoscaling and Capacity Planning

Metrics to monitor:

  • Request rate (req/s)
  • P50/P95/P99 latency
  • GPU/CPU utilization
  • Queue depth (requests waiting)

Scaling triggers:

  • Scale up: P99 latency > SLA for >2 minutes
  • Scale down: CPU utilization < 30% for >10 minutes

Capacity planning:

  • Provision for 2x peak traffic (headroom for traffic spikes)
  • Account for zone failures (N+2 redundancy)
  • Reserve capacity for model experiments (10-20% of fleet)

Regional Deployment and Latency

Deploying models close to users reduces network latency.

Strategy Latency Impact Cost Impact
Single region +50-200ms cross-region Lowest (1 deployment)
Multi-region (replicated) <20ms within region 3-10x (duplicate infrastructure)
Edge deployment <10ms Highest (edge compute expensive)

Recommendation: Multi-region for global platforms; edge for latency-critical features (e.g., real-time notifications).

Model Lifecycle Management

Production recommendation models are never “done”—they evolve continuously. Managing this lifecycle without breaking serving systems requires careful orchestration.

Model Registry and Versioning

Model registry: Centralized store for trained models with metadata:

Metadata Purpose Example
Model ID Unique identifier ranking-v2.3.1-20250121
Training data Dataset version, date range interactions-2025-01-01-to-2025-01-15
Metrics Offline validation metrics AUC: 0.742, Precision@10: 0.31
Framework TensorFlow, PyTorch, JAX pytorch-2.1.0
Lineage Parent model, training code hash git:abc123, parent:v2.3.0
Approval status Human-reviewed, A/B tested approved-for-prod

Popular tools: MLflow, Weights & Biases, proprietary systems.

Versioning strategy:

  • Semantic versioning: major.minor.patch
  • Major: Architecture changes (two-tower → transformer)
  • Minor: Feature additions, dataset updates
  • Patch: Bugfixes, retraining on same data

Shadow Traffic and Gradual Rollout

Never deploy a new model directly to 100% of users. Use staged rollout:

1. Shadow mode:

def serve_request(user, context):
    # Production model
    recs_prod = prod_model.predict(user, context)

    # Shadow model (no user impact)
    recs_shadow = shadow_model.predict(user, context)
    log_shadow_metrics(recs_shadow, recs_prod)

    return recs_prod  # User sees prod results

Purpose:

  • Validate shadow model latency (P99 < SLA)
  • Compare predictions (correlation, overlap)
  • Catch serving bugs before user impact

Duration: 24-72 hours to collect sufficient data.

2. Canary deployment:

Route 1-5% of traffic to new model; monitor closely:

flowchart LR
    Traffic[User Traffic] --> Router{Traffic Router}
    Router -->|"95%"| Prod[Prod Model v1]
    Router -->|"5%"| Canary[Canary Model v2]

    Prod --> Monitor[Monitoring]
    Canary --> Monitor

    Monitor --> Alert{Metrics OK?}
    Alert -->|"yes"| Proceed[Increase to 10%]
    Alert -->|"no"| Rollback[Rollback]

Metrics to watch:

  • Engagement (CTR, time spent)
  • Latency (P50, P95, P99)
  • Error rate
  • User complaints / feedback

Auto-rollback triggers:

  • P99 latency > SLA + 20%
  • Error rate > 1%
  • Engagement drop > 5%

3. Gradual rollout:

If canary looks good, increase percentage: 5% → 10% → 25% → 50% → 100% over days.

4. A/B testing:

For major changes, run multi-week A/B test at 50/50 split to measure long-term impact before full rollout.

Rollback Procedures

Models fail in production. Fast rollback saves engagement.

Automatic rollback:

  • Healthchecks fail (OOM, crashes)
  • Latency SLA violations persist >5 minutes
  • Error rate > threshold

Manual rollback:

  • Engagement metrics drop significantly
  • User reports of bad recommendations spike
  • Discovered bug in feature computation

Rollback mechanism:

# Traffic router checks model version
@app.route('/recommend')
def recommend(user_id):
    active_version = config.get('active_model_version')  # v2.3.1
    model = model_registry.load(active_version)
    return model.predict(user_id)

# Rollback via config update (no code deploy)
config.set('active_model_version', 'v2.3.0')  # rollback

Fast rollback: Config change + cache invalidation < 30 seconds.

Model Retirement

Old models consume storage and confuse debugging. Retirement policy:

Model Status Retention Rationale
Active production Indefinite Currently serving traffic
Previous version 30 days Rollback target
Older versions 90 days Debugging historical issues
Experimental 7 days Failed experiments

Exceptions: Keep models used in published research or regulatory audits.

Multi-Model Serving

Production systems often serve multiple models simultaneously:

Use cases:

Pattern Example Rationale
Model per surface Home feed uses model A; search uses model B Different objectives
Model per user segment New users get cold-start model; active users get personalized model Different data availability
Ensemble Rank using average of 3 models Robustness, better accuracy
Market-specific US uses model A; India uses model B Localization

Serving infrastructure:

class ModelRouter:
    def __init__(self):
        self.models = {
            'home_feed': load_model('home-v2.1.0'),
            'search': load_model('search-v1.8.3'),
            'cold_start': load_model('coldstart-v3.0.1'),
        }

    def route(self, request):
        if request.surface == 'home':
            if request.user.interaction_count < 10:
                return self.models['cold_start']
            else:
                return self.models['home_feed']
        elif request.surface == 'search':
            return self.models['search']

Continuous Training and Retraining Cadence

Models degrade as user behavior shifts. Retraining keeps them fresh.

Retraining strategies:

Frequency Pros Cons
Daily Captures latest trends Expensive; risk of overfitting to noise
Weekly Balance freshness and stability Standard for most systems
Monthly Stable; avoids churn Stale for fast-moving platforms
Event-triggered React to distribution shifts Complex to implement

Continuous training: Train incrementally on new data without full retraining:

  • Warm-start from previous checkpoint
  • Train only on last 7 days of data
  • Use learning rate decay to avoid catastrophic forgetting

Trade-off: Incremental training drifts from optimal; full retraining is expensive. Common pattern: incremental training weekly, full retraining monthly.

Model Deprecation and Migration

Migrating users from old to new model architecture:

Challenge: Incompatible model signatures (features changed, output format different).

Migration strategy:

  1. Dual-write phase: Log features for both old and new models
  2. Parallel serving: Serve old model; compute new model predictions in shadow
  3. Gradual cutover: Route increasing % of traffic to new model
  4. Deprecation: Remove old model after 100% migration

Duration: 4-8 weeks for major migrations to ensure stability.


Feedback Loops and Model Drift

Recommendation systems are closed-loop: the model influences which items users see, which in turn generates the training data for future models.

Positive Feedback Loops

Items recommended more often accumulate more engagement data, making them appear even more relevant. This creates rich-get-richer dynamics:

  • Popular items dominate recommendations.
  • New items struggle to gain visibility.
  • User preferences appear to converge (filter bubbles).

Mitigation Strategies

Strategy Description
Exploration Inject random or uncertain items to gather signal
Propensity scoring Weight training examples by inverse probability of being shown
Counterfactual learning Train on logged data with importance sampling corrections
Freshness boosts Artificially elevate new items to gather initial signal
Randomized experiments Continuously run A/B tests to measure unbiased performance

Counterfactual Learning from Logged Bandit Feedback

Logged interaction data is biased: users only see items selected by the logging policy $\pi\_0$. Naively training on this data learns to mimic $\pi\_0$ rather than optimize reward.

Inverse Propensity Scoring (IPS)

Let $\pi\_0(a | x)$ be the probability that the logging policy showed item $a$ given context $x$. The unbiased estimate of a new policy $\pi$’s value is:

$$ \hat{V}\_{\text{IPS}}(\pi) = \frac{1}{n} \sum\_{i=1}^{n} \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)} r\_i $$

where $r\_i$ is the observed reward. The importance weight $w\_i = \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)}$ corrects for selection bias.

Statistical Properties:

Unbiasedness: Under the overlap (common support) assumption:

$$ \pi(a | x) > 0 \implies \pi\_0(a | x) > 0 \quad \forall x, a $$

the IPS estimator is unbiased (Horvitz & Thompson 1952):

$$ \mathbb{E}[\hat{V}\_{\text{IPS}}(\pi)] = \mathbb{E}\_{x \sim p(x)} \left[ \mathbb{E}\_{a \sim \pi\_0(\cdot | x)} \left[ \frac{\pi(a | x)}{\pi\_0(a | x)} r(x, a) \right] \right] = \mathbb{E}\_{x, a \sim \pi} [r(x, a)] = V(\pi) $$

Variance: The variance grows with the mismatch between policies:

$$ \text{Var}(\hat{V}\_{\text{IPS}}) = \frac{1}{n} \mathbb{E}\_{x \sim p(x)} \left[ \mathbb{E}\_{a \sim \pi\_0(\cdot | x)} \left[ \left( \frac{\pi(a | x)}{\pi\_0(a | x)} \right)^2 \text{Var}(r | x, a) \right] \right] + \frac{1}{n} \text{Var}\_{x} \left[ \mathbb{E}\_{a \sim \pi\_0} \left[ \frac{\pi(a | x)}{\pi\_0(a | x)} r(x, a) \right] \right] $$

Key insight: Variance explodes when $\pi\_0(a | x) \to 0$ for actions where $\pi(a | x)$ is large. In the worst case, $\text{Var}(\hat{V}\_{\text{IPS}}) = O(n^{-1} w\_{\max}^2)$ where $w\_{\max} = \max\_{x, a} \frac{\pi(a | x)}{\pi\_0(a | x)}$.

Concentration: By Hoeffding’s inequality, if rewards are bounded $r \in [0, R]$ and importance weights are clipped $w \leq M$, then with probability $1 - \delta$:

$$ |\hat{V}\_{\text{IPS}}(\pi) - V(\pi)| \leq MR \sqrt{\frac{\ln(2/\delta)}{2n}} $$

Positivity violation: If overlap fails ($\pi(a | x) > 0$ but $\pi\_0(a | x) = 0$ for some $(x, a)$), IPS is undefined. The effective sample size is reduced to contexts where both policies overlap.

Clipped IPS and Self-Normalized IPS

To reduce variance, clipped IPS bounds the importance weights:

$$ \hat{V}\_{\text{clipped}}(\pi) = \frac{1}{n} \sum\_{i=1}^{n} \min\left( M, \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)} \right) r\_i $$

Self-normalized IPS normalizes by the sum of weights:

$$ \hat{V}\_{\text{SNIPS}}(\pi) = \frac{\sum\_{i=1}^{n} w\_i r\_i}{\sum\_{i=1}^{n} w\_i} $$

This is biased but has lower variance and is invariant to scaling of propensities.

Doubly Robust Estimation

Doubly robust (DR) combines a reward model $\hat{r}(x, a)$ with IPS:

$$ \hat{V}\_{\text{DR}}(\pi) = \frac{1}{n} \sum\_{i=1}^{n} \left[ \hat{r}(x\_i, \pi) + w\_i (r\_i - \hat{r}(x\_i, a\_i)) \right] $$

where $\hat{r}(x_i, \pi) = \sum_a \pi(a | x_i) \hat{r}(x_i, a)$.

Statistical properties (Dudík et al. 2011):

Double robustness: The estimator is unbiased if either the propensity model $\pi_0$ or the reward model $\hat{r}$ is correctly specified:

$$ \mathbb{E}[\hat{V}\_{\text{DR}}(\pi)] = V(\pi) \quad \text{if } \pi\_0 = \pi\_0^{\*} \text{ or } \hat{r} = r^{\*} $$

This follows from:

$$ \mathbb{E}[\hat{V}\_{\text{DR}}(\pi)] = \mathbb{E}\_{x} \left[ \hat{r}(x, \pi) \right] + \mathbb{E}\_{x, a \sim \pi\_0} \left[ w(x, a) (r(x, a) - \hat{r}(x, a)) \right] $$

If $\hat{r} = r^{\*}$, the second term vanishes. If $\pi\_0 = \pi\_0^{\*}$, the second term equals $\mathbb{E}\_{x, a \sim \pi} [r(x, a)] - \mathbb{E}\_{x} [\hat{r}(x, \pi)]$ which corrects for model error.

Variance: The variance of DR is:

$$ \text{Var}(\hat{V}\_{\text{DR}}) = \frac{1}{n} \mathbb{E}\_{x, a \sim \pi\_0} \left[ w^2(x, a) (r(x, a) - \hat{r}(x, a))^2 \right] + \frac{1}{n} \text{Var}\_{x} [\hat{r}(x, \pi)] $$

Key insight: When $\hat{r}$ is accurate, the first term (weighted residual variance) is small, giving DR much lower variance than IPS. The variance scales with the squared residuals weighted by $w^2$, not the squared rewards.

Variance reduction condition: DR has lower MSE than IPS when:

$$ \mathbb{E}\_{x, a \sim \pi\_0} \left[ w^2(x, a) \cdot (\hat{r}(x, a) - r(x, a))^2 \right] < \mathbb{E}\_{x, a \sim \pi\_0} \left[ w^2(x, a) \cdot r^2(x, a) \right] - \left( \mathbb{E}\_{x, a \sim \pi} [r(x, a)] \right)^2 $$

In practice, this holds when the reward model captures even 20-30% of the variance in rewards.

Covariance structure: The covariance between model error and importance weights affects bias-variance tradeoff:

$$ \text{Cov}(\hat{r}(x, a), w(x, a)) \neq 0 \implies \text{additional bias/variance} $$

If the reward model is trained on the logged data, model errors may correlate with propensity scores (e.g., model overfits high-propensity actions), introducing subtle bias.

Policy Learning from Logged Data

To learn a policy directly, optimize the IPS-weighted objective:

$$ \hat{\pi} = \arg\max\_{\pi \in \Pi} \frac{1}{n} \sum\_{i=1}^{n} \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)} r\_i $$

Regularization and variance reduction are critical for stable optimization.

POEM (Policy Optimization with Empirical Mean) uses self-normalized objectives with variance regularization:

$$ \hat{\pi} = \arg\max\_{\pi \in \Pi} \frac{\sum\_{i=1}^{n} w\_i r\_i}{\sum\_{i=1}^{n} w\_i} - \lambda \cdot \text{Var}\_w $$

Position Bias Correction

In recommendation, position affects click probability independently of relevance. Let $P(\text{click} | \text{relevant}, \text{position } k) = \theta\_k$ be the position-dependent examination probability. The observed click rate is:

$$ P(\text{click}) = P(\text{relevant}) \cdot P(\text{examined} | \text{position}) $$

Click models (cascade model, dependent click model) estimate $\theta\_k$ and use it to debias training:

$$ \tilde{r}\_i = \frac{r\_i}{\hat{\theta}\_{k\_i}} $$

This inverse propensity weighting corrects for position bias.

Model Drift

User preferences and content distribution shift over time. Models degrade if not retrained:

  • Concept drift: The relationship between features and engagement changes.
  • Data drift: Feature distributions shift (e.g., new content formats, user demographics).

Continuous training pipelines retrain models daily or even hourly on fresh data. Monitoring systems track prediction calibration and trigger alerts on drift.

flowchart LR
    Production[Production Traffic] --> Logs[(Interaction Logs)]
    Logs --> Pipeline[Training Pipeline]
    Pipeline --> NewModel[New Model]
    NewModel --> Validation[Validation]
    Validation -->|pass| Deploy[Model Serving]
    Validation -->|fail| Alert[Alert / Rollback]
    Deploy --> Production

Cold Start and Exploration

New users and new items lack interaction history, making personalization difficult. This isn’t a corner case—it’s a constant reality. Platforms with growth add millions of new users monthly, and content platforms ingest thousands of new items per hour. At any given moment, a significant fraction of traffic involves cold entities.

The cold start problem is actually three distinct problems:

  1. New user cold start: A user signs up with no interaction history. What do you show them on their first session? Their first impression determines whether they become an active user or churn.

  2. New item cold start: A creator uploads content, a seller lists a product, or a news article is published. Without engagement data, collaborative filtering produces no signal. The item sits in a chicken-and-egg trap: it can’t rank well without engagement, and it can’t get engagement without ranking well.

  3. System cold start: A new recommendation system launches with no historical data at all. This is rare but occurs when entering new markets or building entirely new product surfaces.

Why cold start is hard:

Collaborative filtering—the backbone of most recommendation systems—relies on the assumption that users who agreed in the past will agree in the future. But cold entities have no past. Content-based methods provide a fallback, but they typically underperform collaborative methods by 10-30% in engagement metrics.

The business stakes:

Scenario Impact
Poor new user experience 40-60% of users who churn do so in their first week
New item neglect Creators leave platforms where their content doesn’t get discovered
Stale catalog dominance Popular items accumulate engagement, new items can’t compete

The solutions involve careful orchestration of exploration budgets, content understanding, and graceful degradation strategies.

New User Cold Start

Bayesian perspective: A new user $u$ has unknown preference parameters $\boldsymbol{\theta}_u$. Each strategy corresponds to choosing a different prior distribution $P(\boldsymbol{\theta}_u | \text{context})$:

Strategy Bayesian Interpretation Prior Strength
Onboarding surveys $P(\boldsymbol{\theta}\_u \| \text{stated interests})$: Strong informative prior from explicit preferences High variance reduction: $\sigma^2\_{\text{prior}} \approx 0.3\sigma^2\_{\text{pop}}$
Demographic priors $P(\boldsymbol{\theta}\_u \| \text{age, location})$: Hierarchical model conditions on demographics Medium: $\sigma^2\_{\text{prior}} \approx 0.6\sigma^2\_{\text{pop}}$
Social bootstrapping $P(\boldsymbol{\theta}\_u \| \boldsymbol{\theta}\_{\text{friends}})$: Prior centers on social network’s preferences High for dense networks: $\sigma^2\_{\text{prior}} \approx 0.4\sigma^2\_{\text{pop}}$
Exploration-heavy $P(\boldsymbol{\theta}\_u) = \mathcal{N}(\boldsymbol{\mu}\_{\text{pop}}, \boldsymbol{\Sigma}\_{\text{pop}})$: Broad uninformative prior, rapid learning Low: $\sigma^2\_{\text{prior}} = \sigma^2\_{\text{pop}}$
Popularity fallback $P(\boldsymbol{\theta}\_u) = \delta(\boldsymbol{\mu}\_{\text{pop}})$: Point estimate at population mean Maximum bias, zero variance

Prior-data tradeoff: The expected squared error after $k$ interactions is:

$$ \text{MSE}(k) = \underbrace{\text{Bias}^2(\text{prior})}\_{\text{wrong prior}} + \underbrace{O(d/k)}\_{\text{estimation error}} $$

Strong priors reduce initial error but introduce bias if misspecified. Weak priors have high initial error but converge to truth faster.

New Item Cold Start

Bayesian perspective: A new item $i$ has unknown quality/appeal parameters $\mathbf{v}_i$. Prior distributions leverage side information:

Strategy Bayesian Interpretation Prior Precision
Content-based features $P(\mathbf{v}\_i \| \mathbf{c}\_i) = \mathcal{N}(\mathbf{M} \mathbf{c}\_i, \boldsymbol{\Sigma}\_{\text{content}})$ where $\mathbf{c}\_i$ are content features Depends on content informativeness: $R^2 \in [0.3, 0.7]$ typically
Creator signals $P(\mathbf{v}\_i \| \text{creator}\_i) = \mathcal{N}(\bar{\mathbf{v}}\_{\text{creator}}, \boldsymbol{\Sigma}\_{\text{within-creator}})$ High for established creators: $\sigma^2\_{\text{within}} \ll \sigma^2\_{\text{pop}}$
Exploration allocation Allocate traffic $\propto \text{Var}(\mathbf{v}\_i)$ to reduce posterior uncertainty Information gain optimization: $\arg\max\_i H[P(\mathbf{v}\_i)]$
Bandits (UCB/Thompson) Upper confidence bound: $\hat{r}\_i + \beta \sigma\_i$ where $\sigma\_i^2 = \text{Var}(\mathbf{v}\_i \| \mathcal{D}\_i)$ Exploration bonus shrinks as $\sigma\_i^2 \to 0$

Content prior strength: Define the content prior ratio:

$$ \rho\_{\text{content}} = \frac{\text{Var}(\mathbb{E}[\mathbf{v}\_i | \mathbf{c}\_i])}{\text{Var}(\mathbf{v}\_i)} $$
  • $\rho \to 1$: Content fully predicts quality (e.g., news headlines, product specs)
  • $\rho \to 0$: Content uninformative (e.g., abstract art, niche humor)

Sample complexity for new items: To achieve prediction accuracy $\epsilon$, need:

$$ k\_i = O\left( \frac{d(1 - \rho\_{\text{content}})}{\epsilon^2} \right) \quad \text{impressions} $$

High $\rho_{\text{content}}$ (strong content features) dramatically reduces cold-start sample requirements.

Hybrid Approaches

Hybrid models combine collaborative signals (when available) with content-based features (always available). As interaction data accumulates, the model smoothly transitions from content-based to collaborative predictions.

Bayesian Cold-Start Framework

Cold-start can be formalized as Bayesian inference under uncertainty (Agarwal & Chen 2009; Stern et al. 2009). New users/items have unknown parameters; we maintain posterior distributions and update as data arrives.

Hierarchical Bayesian Model for New Users

Prior specification: New user $u$ has latent preference vector $\boldsymbol{\theta}\_u \in \mathbb{R}^d$. Without interaction data, use a population-level prior:

$$ \boldsymbol{\theta}\_u \sim \mathcal{N}(\boldsymbol{\mu}\_{\text{pop}}, \boldsymbol{\Sigma}\_{\text{pop}}) $$

where $\boldsymbol{\mu}\_{\text{pop}}$ and $\boldsymbol{\Sigma}\_{\text{pop}}$ are learned from existing users.

Hierarchical structure: For users with demographic features $\mathbf{x}\_u$, use feature-dependent priors:

$$ \boldsymbol{\theta}\_u | \mathbf{x}\_u \sim \mathcal{N}(\mathbf{W} \mathbf{x}\_u, \boldsymbol{\Sigma}) $$

where $\mathbf{W}$ maps demographics to expected preferences.

Posterior update: After observing interactions $\mathcal{D}\_u = \{(i\_1, r\_1), \ldots, (i\_k, r\_k)\}$, update via Bayes’ rule:

$$ P(\boldsymbol{\theta}\_u | \mathcal{D}\_u) \propto P(\mathcal{D}\_u | \boldsymbol{\theta}\_u) P(\boldsymbol{\theta}\_u | \mathbf{x}\_u) $$

For linear Gaussian models, this yields closed-form posteriors. For matrix factorization:

$$ P(\boldsymbol{\theta}\_u | \mathcal{D}\_u) = \mathcal{N}(\boldsymbol{\mu}\_u^{\text{post}}, \boldsymbol{\Sigma}\_u^{\text{post}}) $$

where:

$$ \boldsymbol{\Sigma}\_u^{\text{post}} = \left( \boldsymbol{\Sigma}^{-1} + \sum\_{i \in \mathcal{D}\_u} \mathbf{v}\_i \mathbf{v}\_i^\top / \sigma^2 \right)^{-1} $$$$ \boldsymbol{\mu}\_u^{\text{post}} = \boldsymbol{\Sigma}\_u^{\text{post}} \left( \boldsymbol{\Sigma}^{-1} \mathbf{W} \mathbf{x}\_u + \sum\_{i \in \mathcal{D}\_u} r\_i \mathbf{v}\_i / \sigma^2 \right) $$

Recommendation: For new user, predict expected rating:

$$ \hat{r}\_{ui} = \mathbb{E}[\boldsymbol{\theta}\_u^\top \mathbf{v}\_i | \mathcal{D}\_u] = \boldsymbol{\mu}\_u^{\text{post} \top} \mathbf{v}\_i $$

Uncertainty quantification: The posterior covariance $\boldsymbol{\Sigma}\_u^{\text{post}}$ captures uncertainty. Items with high predicted variance can be prioritized for exploration (Thompson Sampling).

Sample Complexity Analysis

Question: How many interactions $k$ are needed before the cold-start user’s predictions match warm-start quality?

Bound: Under linear Gaussian model, the prediction error decreases as:

$$ \mathbb{E}\left[ \|\hat{r}\_u - r\_u\|^2 \right] = O\left( \frac{d}{k} \right) $$

where $d$ is the latent dimension. To match warm-start error, need $k = \Omega(d)$ interactions.

Implication: High-dimensional user embeddings ($d > 100$) require substantial data before surpassing population priors. Content-based features reduce effective dimension.

Bayesian New Item Model

Similarly, for new item $i$ with content features $\mathbf{c}\_i$:

$$ \mathbf{v}\_i | \mathbf{c}\_i \sim \mathcal{N}(\mathbf{M} \mathbf{c}\_i, \mathbf{\Sigma}\_{\text{item}}) $$

where $\mathbf{M}$ is a learned content-to-embedding matrix.

Content-based prior strength: The ratio:

$$ \rho = \frac{\|\mathbf{M} \mathbf{c}\_i\|}{\|\boldsymbol{\Sigma}\_{\text{item}}\|} $$

controls prior strength. High $\rho$ means content features are informative; low $\rho$ means high uncertainty.

Active learning: Select new items to show based on information gain:

$$ i^* = \arg\max\_i \text{IG}(\mathbf{v}\_i | \mathcal{D}) = \arg\max\_i H(\mathbf{v}\_i | \mathcal{D}\_{\text{before}}) - H(\mathbf{v}\_i | \mathcal{D}\_{\text{after}}) $$

where $H(\cdot)$ is entropy. This favors items with high uncertainty whose embeddings will be refined most by feedback.

Practical Approximations

Full Bayesian inference is intractable at scale. Practical systems use:

  • MAP estimation: Replace full posterior with point estimate (mode)
  • Variational inference: Approximate posterior with simpler distribution (mean-field)
  • Particle filters: Represent posterior via Monte Carlo samples
  • Neural amortization: Train neural networks to predict posteriors directly from data

Internationalization and Cross-Market Challenges

Expanding to new countries or languages presents a form of “market cold start”—no local data, different user preferences, distinct content ecosystems. Global platforms must handle this systematically.

Language and Content Understanding

Multilingual embeddings:

  • Cross-lingual transfer: Train models on high-resource languages (English, Chinese), transfer to low-resource languages (Swahili, Tagalog)
  • Multilingual encoders: mBERT, XLM-R, LaBSE produce shared embedding spaces across 100+ languages
  • Language-specific fine-tuning: General multilingual models underperform language-specific models by 5-15% but are practical when labeled data is scarce

Translation challenges:

  • Machine translation quality varies (English↔Spanish: high quality; English↔Khmer: moderate)
  • Idiomatic expressions, slang, and cultural references don’t translate directly
  • Translation adds latency (50-200ms per item)

Content availability:

  • Content gap: Popular items in US may not exist in Indonesia
  • Local content bootstrapping: Incentivize local creators; can’t rely on cross-border content alone
  • Language imbalance: English content dominates; other languages are underrepresented

Cross-Cultural Preferences

User behavior varies significantly across cultures:

Dimension Example Variation
Content preferences US: short-form video; Japan: manga and text posts; India: regional language content
Engagement patterns Some cultures share openly; others lurk and consume passively
Social graph structure Tight family networks (Middle East) vs. loose acquaintance networks (US)
Trust signals Verified badges matter more in low-trust environments

Implications:

  • Can’t train one global model and assume it works everywhere
  • Engagement metrics have different distributions across markets
  • Content moderation policies must respect local norms

Market-Specific Models vs. Shared Models

Approach Pros Cons
Single global model Simplicity; cross-market transfer learning Underperforms in each market; ignores cultural nuance
Per-market models Optimized for local preferences Requires data/compute per market; no transfer learning; cold start for new markets
Market adapters Shared backbone + lightweight market-specific layers Best of both worlds

Market adapter pattern:

class MarketAdaptedRanker(nn.Module):
    def __init__(self):
        self.shared_backbone = TransformerEncoder(...)  # Shared
        self.market_adapters = {
            'US': nn.Linear(512, 512),
            'IN': nn.Linear(512, 512),
            'BR': nn.Linear(512, 512),
            ...
        }

    def forward(self, features, market):
        shared_repr = self.shared_backbone(features)
        adapted = self.market_adapters[market](shared_repr)
        return score_head(adapted)

Training strategy: Pre-train on all markets; fine-tune adapters per-market.

Regulatory and Infrastructure Challenges

Data residency: GDPR (EU), LGPD (Brazil), and local laws may require data to stay in-country. This fragments training data and complicates model updates.

Content regulations: What’s acceptable in one country may be illegal in another:

  • Political speech restrictions (China, Middle East)
  • Hate speech definitions vary
  • Misinformation standards differ

Infrastructure: Low-bandwidth regions (rural India, sub-Saharan Africa) require:

  • Lightweight models (quantized, distilled)
  • Aggressive caching
  • Offline-first design

Launch Strategy for New Markets

Phase 1: Pre-launch (before local users)

  1. Deploy content-based models (no interaction data needed)
  2. Bootstrap content from creators in similar markets
  3. Translate popular global content
  4. Set up local trust & safety team

Phase 2: Soft launch (limited users)

  1. Invite local influencers and creators
  2. Collect interaction data
  3. Train initial collaborative models
  4. Test culturally-specific features

Phase 3: Scale (general availability)

  1. Switch to hybrid models (content + collaborative)
  2. Deploy market-specific ranking
  3. Monitor engagement, retention, creator health
  4. Iterate based on local feedback

Metrics to track:

  • New user activation (% who engage in first session)
  • Creator supply (posts per day)
  • Content diversity (avoid relying on translated content)
  • Localization bugs (date/time formats, currency, RTL layout)

Cross-Border Content Recommendations

Should a US user see content from India? Depends:

Arguments for:

  • Serendipity: Exposure to global perspectives
  • Content supply: More content = better recommendations
  • Creator reach: Helps creators find international audiences

Arguments against:

  • Language barriers: Most users don’t want non-native language content
  • Cultural relevance: Content may not resonate
  • Latency: Fetching cross-region content adds latency

Heuristic: Allow cross-border for visual content (short videos, images with minimal text), restrict for text-heavy content unless user explicitly engages with that language.