Recommendation Systems Part 3: Production Systems

Part 3 of 6 | ← Part 2: Ranking | Part 4: Ethics & Safety →

Evaluation and Metrics

Recommendation systems require rigorous evaluation across offline, online, and long-term dimensions.

Offline Metrics

Metric	Definition	Use Case
AUC-ROC	Area under ROC curve for engagement prediction	Pointwise model quality
Log-loss	Cross-entropy of predicted probabilities	Calibration quality
NDCG@k	Normalized discounted cumulative gain at rank k	Ranking quality
Recall@k	Fraction of relevant items in top-k	Retrieval coverage
Hit Rate	Whether the engaged item appears in top-k	Retrieval success

Offline metrics use held-out interaction logs; they are necessary but not sufficient for production decisions.

Ranking Metric Definitions

Discounted Cumulative Gain (DCG) rewards relevant items at higher positions:

$$ \text{DCG}@k = \sum\_{i=1}^{k} \frac{2^{rel\_i} - 1}{\log\_2(i + 1)} $$

where $rel\_i$ is the relevance grade of item at rank $i$. Normalized DCG (NDCG) divides by the ideal DCG:

$$ \text{NDCG}@k = \frac{\text{DCG}@k}{\text{IDCG}@k} $$

where $\text{IDCG}@k$ is DCG of the optimal ranking.

Mean Average Precision (MAP) averages precision at each relevant item:

$$ \text{AP} = \frac{1}{|\text{Rel}|} \sum\_{k=1}^{n} P@k \cdot \mathbb{1}[rel\_k = 1] $$

where $P@k = \frac{|\text{relevant in top-}k|}{k}$.

Mean Reciprocal Rank (MRR) captures the position of the first relevant item:

$$ \text{MRR} = \frac{1}{|Q|} \sum\_{q=1}^{|Q|} \frac{1}{\text{rank}\_q} $$

Calibration Metrics

A model is calibrated if predicted probabilities match empirical frequencies:

$$ P(y = 1 | \hat{p} = p) = p \quad \forall p \in [0, 1] $$

Expected Calibration Error (ECE) bins predictions and measures deviation:

$$ \text{ECE} = \sum\_{b=1}^{B} \frac{|B\_b|}{n} \left| \text{acc}(B\_b) - \text{conf}(B\_b) \right| $$

where $\text{acc}(B\_b)$ is the accuracy in bin $b$ and $\text{conf}(B\_b)$ is the average confidence.

Online Metrics (A/B Testing)

Online experiments measure causal impact on user behavior:

Metric Category	Examples
Engagement	CTR, likes/user, comments/user, watch time
Retention	DAU, WAU, session frequency, churn rate
Quality	Survey satisfaction, content diversity consumed
Creator health	Posts created, follower growth, monetization
Platform safety	Reports, policy violations surfaced

Statistical Framework

Let $Y_i(1)$ and $Y_i(0)$ denote potential outcomes for user $i$ under treatment and control. The Average Treatment Effect (ATE) is:

$$ \tau = \mathbb{E}[Y_i(1) - Y_i(0)] $$

Randomization ensures unbiased estimation: $\hat{\tau} = \bar{Y}_T - \bar{Y}_C$.

Variance estimation for the difference in means:

$$ \text{Var}(\hat{\tau}) = \frac{\sigma_T^2}{n_T} + \frac{\sigma_C^2}{n_C} $$

Confidence interval (asymptotic normal):

$$ \hat{\tau} \pm z_{1-\alpha/2} \sqrt{\text{Var}(\hat{\tau})} $$

Sample Size Calculation

For detecting effect size $\delta$ with power $1 - \beta$ at significance level $\alpha$:

$$ n = 2 \left( \frac{z_{1-\alpha/2} + z_{1-\beta}}{\delta / \sigma} \right)^2 $$

where $\sigma$ is the standard deviation and $\delta / \sigma$ is the standardized effect size.

Multiple Testing Correction

Testing many metrics inflates false positive rates. Corrections include:

Bonferroni: $\alpha' = \alpha / m$ for $m$ tests (conservative)
Benjamini-Hochberg: Controls false discovery rate (FDR) $\leq \alpha$
Sequential testing: Peek at results with spending functions (e.g., O’Brien-Fleming)

Variance Reduction Techniques

Stratification: Partition users by pre-experiment covariates; average within-stratum estimates.
CUPED (Controlled-experiment Using Pre-Experiment Data): Regress outcome on pre-experiment metric:

$$ \tilde{Y}_i = Y_i - \theta (X_i - \bar{X}) $$

where $\theta$ minimizes variance. Variance reduction up to $\rho^2$ where $\rho$ is correlation between $Y$ and $X$.

Network Interference and SUTVA Violations

Standard A/B testing assumes the Stable Unit Treatment Value Assumption (SUTVA): a user’s outcome depends only on their own treatment assignment, not on others’ assignments. Social recommendation systems routinely violate this assumption.

Why interference occurs:

Mechanism	Example	Bias Direction
Content spillover	Treatment users share recommended content with control users	Dilutes treatment effect
Social influence	Treatment user’s increased engagement affects friends’ feeds	Inflates treatment effect
Competition effects	Treatment users consume content, reducing availability for control	Complicates interpretation
Creator response	Creators adapt to treatment group’s engagement patterns	Long-term ecosystem shift

Formal model (Hudgens & Halloran 2008; Aronow & Samii 2017):

Let $G = (V, E)$ be the social network, $Z\_i \in \{0,1\}$ be the treatment assignment for user $i$, and $\mathbf{Z} = (Z\_1, \ldots, Z\_n)$ be the full assignment vector. The potential outcome $Y\_i(\mathbf{Z})$ depends on the entire vector $\mathbf{Z}$, not just $Z\_i$.

General interference model: Define neighborhoods $\mathcal{N}(i) \subset V$ (e.g., friends, k-hop neighbors). Assume outcomes depend only on own treatment and neighborhood treatments:

$$ Y\_i(\mathbf{Z}) = Y\_i(Z\_i, \mathbf{Z}\_{\mathcal{N}(i)}) $$

Treatment effect with interference: The individual treatment effect compares:

$$ \tau\_i(\mathbf{Z}\_{-i}) = Y\_i(1, \mathbf{Z}\_{\mathcal{N}(i)}) - Y\_i(0, \mathbf{Z}\_{\mathcal{N}(i)}) $$

which now depends on neighbors’ assignments $\mathbf{Z}\_{\mathcal{N}(i)}$.

Exposure mappings (Aronow & Samii 2017): Partition assignment vectors into exposure conditions. For binary treatment with $k=1$ neighborhood:

Direct exposure: $Z\_i = 1$ (user gets treatment)
Indirect exposure: $Z\_i = 0$ but $\exists j \in \mathcal{N}(i) : Z\_j = 1$ (friend gets treatment)
No exposure: $Z\_i = 0$ and $Z\_j = 0 \ \forall j \in \mathcal{N}(i)$

Spillover effects:

$$ \text{Spillover}\_i = Y\_i(0, \mathbf{1}\_{\mathcal{N}(i)}) - Y\_i(0, \mathbf{0}\_{\mathcal{N}(i)}) $$

compares control users with all-treated vs. all-control neighbors.

Consequences of ignoring interference:

Correlated outcomes: Users in the same social cluster have correlated metrics even under random assignment. Standard confidence intervals are too narrow.
Biased estimates: If treatment “leaks” to control via social connections, the measured effect underestimates the true effect.
Irreproducible results: Effects measured in A/B tests don’t replicate at full rollout because interference patterns change.

Mitigation strategies:

Strategy	Approach	Trade-off
Cluster randomization	Randomize at community/graph-cluster level instead of user level	Fewer effective units; higher variance
Ego-network experiments	Randomize treatment of a user’s entire ego network (friends + friends-of-friends)	Complex implementation; ethical concerns
Geographic randomization	Randomize by region where social graphs are denser within than across	Confounds with regional effects
Causal graph modeling	Explicitly model interference structure; adjust estimates	Requires correct interference model
Switchback experiments	Alternate treatment/control over time rather than users	Carryover effects; time confounds

Detecting interference:

Distance-to-treatment analysis: Plot control users’ outcomes against their graph distance to nearest treatment user. Correlation suggests spillover.
Cluster-level variance: If between-cluster variance » within-cluster variance, standard errors are underestimated.
Rollout discontinuities: Compare estimated effect at 1% rollout vs. 50% rollout. Large discrepancies suggest interference.

For recommendation systems with strong social components, ignoring network effects can lead to shipping changes that perform worse at scale than in the A/B test—or missing changes that would have succeeded.

Long-Term Effects

Short-term engagement gains may harm long-term retention (e.g., clickbait). Platforms track:

Cohort retention curves: Do users exposed to the new model return at the same rate after 7/30/90 days?
User satisfaction surveys: NPS, CSAT, qualitative feedback.
Ecosystem health: Creator churn, content quality trends.

Causal Inference for Long-Term Effects

Difference-in-Differences (DiD) compares trends before and after treatment:

$$ \hat{\tau}\_{\text{DiD}} = (\bar{Y}\_{T,\text{post}} - \bar{Y}\_{T,\text{pre}}) - (\bar{Y}\_{C,\text{post}} - \bar{Y}\_{C,\text{pre}}) $$

Assumes parallel trends in absence of treatment.

Synthetic Control constructs a weighted combination of control units to match pre-treatment outcomes:

$$ \hat{Y}\_{T,t}^{(0)} = \sum\_{j \in \text{control}} w\_j Y\_{j,t} $$

Treatment effect: $\hat{\tau}\_t = Y\_{T,t} - \hat{Y}\_{T,t}^{(0)}$.

Instrumental Variables (IV) addresses selection bias when treatment is endogenous:

$$ \hat{\tau}\_{\text{IV}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(D, Z)} $$

where $Z$ is an instrument affecting outcome $Y$ only through treatment $D$.

Training Infrastructure

Training recommendation models at scale requires specialized infrastructure.

Data Pipeline

flowchart LR
    Logs[(Interaction Logs)] --> ETL[ETL / Feature Join]
    ETL --> Training[Training Data]
    Labels[(Label Generation)] --> Training
    Training --> Shuffle[Global Shuffle]
    Shuffle --> Shards[(Sharded TFRecords)]

Key considerations:

Label generation: Define positive/negative labels (e.g., click = positive, impression without click = negative). Handle implicit feedback (no explicit dislikes).
Negative sampling: With billions of items, most items are never shown. Sample negatives from impressions, random items, or in-batch negatives.
Point-in-time joins: Join features as they existed at interaction time to avoid leakage.

Distributed Training

Models with billions of parameters and terabytes of training data require distributed training:

Approach	Description	Use Case
Data parallelism	Replicate model; partition data	Dense models
Model parallelism	Partition model across devices	Very large models
Embedding sharding	Distribute embedding tables across parameter servers	Large vocabulary (users, items)
Pipeline parallelism	Overlap forward/backward passes across micro-batches	Deep models

Frameworks like TensorFlow, PyTorch with DeepSpeed/FSDP, and custom systems (Meta’s DLRM, Google’s TPU pods) enable training at this scale.

Model Compression and Serving Efficiency

Production recommendation models face strict latency and cost constraints. A model that achieves 1% higher engagement but adds 50ms latency will degrade user experience and fail to ship. Compression and serving optimization are not afterthoughts—they’re first-class concerns.

Quantization

Quantization reduces numerical precision, trading model accuracy for speed and memory.

Precision levels:

Precision	Bits	Range	Speedup	Accuracy Impact
FP32 (float)	32	~$10^{-38}$ to $10^{38}$	1x baseline	Baseline
FP16 (half)	16	~$10^{-8}$ to $6.5 \times 10^4$	2-3x	<0.5% degradation
INT8 (integer)	8	-128 to 127	4-5x	1-2% degradation
INT4	4	-8 to 7	8-10x	3-5% degradation

Quantization-aware training (QAT): Simulate quantization during training by adding fake-quant nodes. The model learns to be robust to precision loss.

Post-training quantization (PTQ): Quantize trained model weights without retraining. Requires calibration dataset to determine quantization scales:

$$ x\_{\text{int}} = \text{round}\left( \frac{x\_{\text{float}} - z}{s} \right) $$

where $s$ is the scale factor and $z$ is the zero-point.

Dynamic vs. static quantization:

Static: Quantize both weights and activations offline
Dynamic: Quantize weights offline; activations quantized on-the-fly (less speedup but better accuracy)

Per-channel quantization: Use different scales per output channel (more memory, better accuracy).

Tools: TensorFlow Lite, PyTorch Quantization, ONNX Runtime, TensorRT.

Knowledge Distillation

Train a small “student” model to mimic a large “teacher” model.

Setup:

Teacher: Large, accurate model (e.g., 1B parameters)
Student: Small, fast model (e.g., 100M parameters)
Training: Student learns from teacher’s soft targets (logits), not just hard labels

Distillation loss:

$$ \mathcal{L}\_{\text{distill}} = \alpha \cdot \text{KL}(P\_{\text{student}} \| P\_{\text{teacher}}) + (1 - \alpha) \cdot \mathcal{L}\_{\text{CE}}(y, P\_{\text{student}}) $$

where $P$ are softmax probabilities, $y$ are ground-truth labels, and $\alpha \in [0, 1]$ balances teacher guidance vs. label supervision.

Temperature scaling: Soften probability distributions to expose dark knowledge:

$$ P\_i = \frac{\exp(z\_i / T)}{\sum\_j \exp(z\_j / T)} $$

Higher temperature $T$ (e.g., $T=3$) smooths distribution; student learns relative rankings, not just top-1.

Multi-task distillation: Teacher model outputs multiple heads (click, like, share). Student learns all tasks from teacher.

Benefits:

5-10x speedup with <2% accuracy loss
Smaller model footprint (fits in CPU cache)
Easier deployment (no GPU required)

Model Pruning

Remove unimportant weights or neurons to reduce model size.

Magnitude-based pruning: Remove weights with smallest absolute values:

$$ \text{Prune}(W) = \begin{cases} W_{ij} & \text{if } |W_{ij}| > \tau \\ 0 & \text{otherwise} \end{cases} $$

Structured pruning: Remove entire channels, layers, or attention heads (hardware-friendly).

Iterative pruning: Prune → retrain → prune → retrain. Gradual pruning maintains accuracy better than one-shot.

Pruning ratios: Recommendation models can often be pruned 30-50% with <1% accuracy loss.

Serving Infrastructure

flowchart TB
    subgraph Client ["Client Layer"]
        App[User App] --> LB[Load Balancer]
    end

    subgraph Serving ["Serving Layer"]
        LB --> Gateway[API Gateway]
        Gateway --> FeatureService[Feature Service]
        Gateway --> ModelService[Model Inference]

        FeatureService --> FeatureCache[(Redis: Feature Cache)]
        FeatureService --> FeatureStore[(Feature Store)]

        ModelService --> EmbedCache[(Embedding Cache)]
        ModelService --> ModelServer[Model Server Pool]
    end

    subgraph Backends ["Backend Compute"]
        ModelServer --> GPU1[GPU Pod 1]
        ModelServer --> GPU2[GPU Pod N]
        ModelServer --> CPU[CPU Fallback]
    end

    subgraph Offline ["Offline Pipelines"]
        Batch[Batch Jobs] --> PreComp[(Precomputed Embeddings)]
        PreComp --> EmbedCache
    end

Model serving frameworks:

Framework	Strengths	Use Case
TensorFlow Serving	Production-grade, model versioning, batching	TensorFlow models
TorchServe	PyTorch native, multi-model serving	PyTorch models
Triton Inference Server	Multi-framework, GPU optimization, dynamic batching	Heterogeneous stacks
ONNX Runtime	Cross-platform, lightweight, quantization support	Edge deployment, CPU serving
Ray Serve	Python-native, autoscaling, multi-model	Rapid prototyping, Python pipelines

Caching Strategies

Embedding caches:

User embeddings: Cache recent users (LRU eviction). Hit rate: 70-90%.
Item embeddings: Cache popular items (weighted LRU). Hit rate: 80-95%.
Storage: Redis cluster with 100-500GB memory.

Result caches:

Cache final recommendations for deterministic requests (e.g., logged-out homepage).
TTL: 5-60 minutes depending on freshness requirements.
Invalidation: Triggered by user actions or model updates.

Feature caches:

Precompute static user/item features (demographic, historical aggregates).
Update cadence: hourly to daily.

Cache hit economics:

Cached response: <1ms latency, $0.0001 cost
Model inference: 50ms latency, $0.01 cost
90% cache hit rate saves 10x on compute cost

Batching and Throughput Optimization

Dynamic batching: Aggregate requests that arrive within a time window (e.g., 10ms) into a single batch.

Benefits:

GPU utilization: 20% (no batching) → 80% (batched)
Amortized kernel launch overhead
Higher throughput (requests/second)

Trade-off:

Adds queuing delay (P99 latency increases)
Not suitable for ultra-low-latency applications

Batch size tuning:

Too small: Underutilize GPU
Too large: OOM errors, increased latency
Typical: 16-128 for ranking models

GPU vs. CPU Trade-offs

Metric	GPU	CPU
Throughput	High (1000s req/s)	Low (10s req/s)
Latency (P50)	5-20ms	50-200ms
Latency (P99)	20-50ms	200-500ms
Cost per inference	$0.001-0.01	$0.0001-0.001
Idle cost	High (GPU sits idle)	Low (CPU multi-tenant)
Model size limit	GPU memory (16-80GB)	System memory (100s GB)

Decision heuristic:

High traffic (>1M req/day): GPU worth it
Low traffic or bursty: CPU + autoscaling
Extreme latency requirements (<10ms): GPU mandatory
Cost-sensitive: Quantized CPU models

Autoscaling and Capacity Planning

Metrics to monitor:

Request rate (req/s)
P50/P95/P99 latency
GPU/CPU utilization
Queue depth (requests waiting)

Scaling triggers:

Scale up: P99 latency > SLA for >2 minutes
Scale down: CPU utilization < 30% for >10 minutes

Capacity planning:

Provision for 2x peak traffic (headroom for traffic spikes)
Account for zone failures (N+2 redundancy)
Reserve capacity for model experiments (10-20% of fleet)

Regional Deployment and Latency

Deploying models close to users reduces network latency.

Strategy	Latency Impact	Cost Impact
Single region	+50-200ms cross-region	Lowest (1 deployment)
Multi-region (replicated)	<20ms within region	3-10x (duplicate infrastructure)
Edge deployment	<10ms	Highest (edge compute expensive)

Recommendation: Multi-region for global platforms; edge for latency-critical features (e.g., real-time notifications).

Model Lifecycle Management

Production recommendation models are never “done”—they evolve continuously. Managing this lifecycle without breaking serving systems requires careful orchestration.

Model Registry and Versioning

Model registry: Centralized store for trained models with metadata:

Metadata	Purpose	Example
Model ID	Unique identifier	`ranking-v2.3.1-20250121`
Training data	Dataset version, date range	`interactions-2025-01-01-to-2025-01-15`
Metrics	Offline validation metrics	AUC: 0.742, Precision@10: 0.31
Framework	TensorFlow, PyTorch, JAX	`pytorch-2.1.0`
Lineage	Parent model, training code hash	git:`abc123`, parent:`v2.3.0`
Approval status	Human-reviewed, A/B tested	`approved-for-prod`

Popular tools: MLflow, Weights & Biases, proprietary systems.

Versioning strategy:

Semantic versioning: major.minor.patch
Major: Architecture changes (two-tower → transformer)
Minor: Feature additions, dataset updates
Patch: Bugfixes, retraining on same data

Shadow Traffic and Gradual Rollout

Never deploy a new model directly to 100% of users. Use staged rollout:

1. Shadow mode:

def serve_request(user, context):
    # Production model
    recs_prod = prod_model.predict(user, context)

    # Shadow model (no user impact)
    recs_shadow = shadow_model.predict(user, context)
    log_shadow_metrics(recs_shadow, recs_prod)

    return recs_prod  # User sees prod results

Purpose:

Validate shadow model latency (P99 < SLA)
Compare predictions (correlation, overlap)
Catch serving bugs before user impact

Duration: 24-72 hours to collect sufficient data.

2. Canary deployment:

Route 1-5% of traffic to new model; monitor closely:

flowchart LR
    Traffic[User Traffic] --> Router{Traffic Router}
    Router -->|"95%"| Prod[Prod Model v1]
    Router -->|"5%"| Canary[Canary Model v2]

    Prod --> Monitor[Monitoring]
    Canary --> Monitor

    Monitor --> Alert{Metrics OK?}
    Alert -->|"yes"| Proceed[Increase to 10%]
    Alert -->|"no"| Rollback[Rollback]

Metrics to watch:

Engagement (CTR, time spent)
Latency (P50, P95, P99)
Error rate
User complaints / feedback

Auto-rollback triggers:

P99 latency > SLA + 20%
Error rate > 1%
Engagement drop > 5%

3. Gradual rollout:

If canary looks good, increase percentage: 5% → 10% → 25% → 50% → 100% over days.

4. A/B testing:

For major changes, run multi-week A/B test at 50/50 split to measure long-term impact before full rollout.

Rollback Procedures

Models fail in production. Fast rollback saves engagement.

Automatic rollback:

Healthchecks fail (OOM, crashes)
Latency SLA violations persist >5 minutes
Error rate > threshold

Manual rollback:

Engagement metrics drop significantly
User reports of bad recommendations spike
Discovered bug in feature computation

Rollback mechanism:

# Traffic router checks model version
@app.route('/recommend')
def recommend(user_id):
    active_version = config.get('active_model_version')  # v2.3.1
    model = model_registry.load(active_version)
    return model.predict(user_id)

# Rollback via config update (no code deploy)
config.set('active_model_version', 'v2.3.0')  # rollback

Fast rollback: Config change + cache invalidation < 30 seconds.

Model Retirement

Old models consume storage and confuse debugging. Retirement policy:

Model Status	Retention	Rationale
Active production	Indefinite	Currently serving traffic
Previous version	30 days	Rollback target
Older versions	90 days	Debugging historical issues
Experimental	7 days	Failed experiments

Exceptions: Keep models used in published research or regulatory audits.

Multi-Model Serving

Production systems often serve multiple models simultaneously:

Use cases:

Pattern	Example	Rationale
Model per surface	Home feed uses model A; search uses model B	Different objectives
Model per user segment	New users get cold-start model; active users get personalized model	Different data availability
Ensemble	Rank using average of 3 models	Robustness, better accuracy
Market-specific	US uses model A; India uses model B	Localization

Serving infrastructure:

class ModelRouter:
    def __init__(self):
        self.models = {
            'home_feed': load_model('home-v2.1.0'),
            'search': load_model('search-v1.8.3'),
            'cold_start': load_model('coldstart-v3.0.1'),
        }

    def route(self, request):
        if request.surface == 'home':
            if request.user.interaction_count < 10:
                return self.models['cold_start']
            else:
                return self.models['home_feed']
        elif request.surface == 'search':
            return self.models['search']

Continuous Training and Retraining Cadence

Models degrade as user behavior shifts. Retraining keeps them fresh.

Retraining strategies:

Frequency	Pros	Cons
Daily	Captures latest trends	Expensive; risk of overfitting to noise
Weekly	Balance freshness and stability	Standard for most systems
Monthly	Stable; avoids churn	Stale for fast-moving platforms
Event-triggered	React to distribution shifts	Complex to implement

Continuous training: Train incrementally on new data without full retraining:

Warm-start from previous checkpoint
Train only on last 7 days of data
Use learning rate decay to avoid catastrophic forgetting

Trade-off: Incremental training drifts from optimal; full retraining is expensive. Common pattern: incremental training weekly, full retraining monthly.

Model Deprecation and Migration

Migrating users from old to new model architecture:

Challenge: Incompatible model signatures (features changed, output format different).

Migration strategy:

Dual-write phase: Log features for both old and new models
Parallel serving: Serve old model; compute new model predictions in shadow
Gradual cutover: Route increasing % of traffic to new model
Deprecation: Remove old model after 100% migration

Duration: 4-8 weeks for major migrations to ensure stability.

Feedback Loops and Model Drift

Recommendation systems are closed-loop: the model influences which items users see, which in turn generates the training data for future models.

Positive Feedback Loops

Items recommended more often accumulate more engagement data, making them appear even more relevant. This creates rich-get-richer dynamics:

Popular items dominate recommendations.
New items struggle to gain visibility.
User preferences appear to converge (filter bubbles).

Mitigation Strategies

Strategy	Description
Exploration	Inject random or uncertain items to gather signal
Propensity scoring	Weight training examples by inverse probability of being shown
Counterfactual learning	Train on logged data with importance sampling corrections
Freshness boosts	Artificially elevate new items to gather initial signal
Randomized experiments	Continuously run A/B tests to measure unbiased performance

Counterfactual Learning from Logged Bandit Feedback

Logged interaction data is biased: users only see items selected by the logging policy $\pi\_0$. Naively training on this data learns to mimic $\pi\_0$ rather than optimize reward.

Inverse Propensity Scoring (IPS)

Let $\pi\_0(a | x)$ be the probability that the logging policy showed item $a$ given context $x$. The unbiased estimate of a new policy $\pi$’s value is:

$$ \hat{V}\_{\text{IPS}}(\pi) = \frac{1}{n} \sum\_{i=1}^{n} \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)} r\_i $$

where $r\_i$ is the observed reward. The importance weight $w\_i = \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)}$ corrects for selection bias.

Statistical Properties:

Unbiasedness: Under the overlap (common support) assumption:

$$ \pi(a | x) > 0 \implies \pi\_0(a | x) > 0 \quad \forall x, a $$

the IPS estimator is unbiased (Horvitz & Thompson 1952):

$$ \mathbb{E}[\hat{V}\_{\text{IPS}}(\pi)] = \mathbb{E}\_{x \sim p(x)} \left[ \mathbb{E}\_{a \sim \pi\_0(\cdot | x)} \left[ \frac{\pi(a | x)}{\pi\_0(a | x)} r(x, a) \right] \right] = \mathbb{E}\_{x, a \sim \pi} [r(x, a)] = V(\pi) $$

Variance: The variance grows with the mismatch between policies:

$$ \text{Var}(\hat{V}\_{\text{IPS}}) = \frac{1}{n} \mathbb{E}\_{x \sim p(x)} \left[ \mathbb{E}\_{a \sim \pi\_0(\cdot | x)} \left[ \left( \frac{\pi(a | x)}{\pi\_0(a | x)} \right)^2 \text{Var}(r | x, a) \right] \right] + \frac{1}{n} \text{Var}\_{x} \left[ \mathbb{E}\_{a \sim \pi\_0} \left[ \frac{\pi(a | x)}{\pi\_0(a | x)} r(x, a) \right] \right] $$

Key insight: Variance explodes when $\pi\_0(a | x) \to 0$ for actions where $\pi(a | x)$ is large. In the worst case, $\text{Var}(\hat{V}\_{\text{IPS}}) = O(n^{-1} w\_{\max}^2)$ where $w\_{\max} = \max\_{x, a} \frac{\pi(a | x)}{\pi\_0(a | x)}$.

Concentration: By Hoeffding’s inequality, if rewards are bounded $r \in [0, R]$ and importance weights are clipped $w \leq M$, then with probability $1 - \delta$:

$$ |\hat{V}\_{\text{IPS}}(\pi) - V(\pi)| \leq MR \sqrt{\frac{\ln(2/\delta)}{2n}} $$

Positivity violation: If overlap fails ($\pi(a | x) > 0$ but $\pi\_0(a | x) = 0$ for some $(x, a)$), IPS is undefined. The effective sample size is reduced to contexts where both policies overlap.

Clipped IPS and Self-Normalized IPS

To reduce variance, clipped IPS bounds the importance weights:

$$ \hat{V}\_{\text{clipped}}(\pi) = \frac{1}{n} \sum\_{i=1}^{n} \min\left( M, \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)} \right) r\_i $$

Self-normalized IPS normalizes by the sum of weights:

$$ \hat{V}\_{\text{SNIPS}}(\pi) = \frac{\sum\_{i=1}^{n} w\_i r\_i}{\sum\_{i=1}^{n} w\_i} $$

This is biased but has lower variance and is invariant to scaling of propensities.

Doubly Robust Estimation

Doubly robust (DR) combines a reward model $\hat{r}(x, a)$ with IPS:

$$ \hat{V}\_{\text{DR}}(\pi) = \frac{1}{n} \sum\_{i=1}^{n} \left[ \hat{r}(x\_i, \pi) + w\_i (r\_i - \hat{r}(x\_i, a\_i)) \right] $$

where $\hat{r}(x_i, \pi) = \sum_a \pi(a | x_i) \hat{r}(x_i, a)$.

Statistical properties (Dudík et al. 2011):

Double robustness: The estimator is unbiased if either the propensity model $\pi_0$ or the reward model $\hat{r}$ is correctly specified:

$$ \mathbb{E}[\hat{V}\_{\text{DR}}(\pi)] = V(\pi) \quad \text{if } \pi\_0 = \pi\_0^{\*} \text{ or } \hat{r} = r^{\*} $$

This follows from:

$$ \mathbb{E}[\hat{V}\_{\text{DR}}(\pi)] = \mathbb{E}\_{x} \left[ \hat{r}(x, \pi) \right] + \mathbb{E}\_{x, a \sim \pi\_0} \left[ w(x, a) (r(x, a) - \hat{r}(x, a)) \right] $$

If $\hat{r} = r^{\*}$, the second term vanishes. If $\pi\_0 = \pi\_0^{\*}$, the second term equals $\mathbb{E}\_{x, a \sim \pi} [r(x, a)] - \mathbb{E}\_{x} [\hat{r}(x, \pi)]$ which corrects for model error.

Variance: The variance of DR is:

$$ \text{Var}(\hat{V}\_{\text{DR}}) = \frac{1}{n} \mathbb{E}\_{x, a \sim \pi\_0} \left[ w^2(x, a) (r(x, a) - \hat{r}(x, a))^2 \right] + \frac{1}{n} \text{Var}\_{x} [\hat{r}(x, \pi)] $$

Key insight: When $\hat{r}$ is accurate, the first term (weighted residual variance) is small, giving DR much lower variance than IPS. The variance scales with the squared residuals weighted by $w^2$, not the squared rewards.

Variance reduction condition: DR has lower MSE than IPS when:

$$ \mathbb{E}\_{x, a \sim \pi\_0} \left[ w^2(x, a) \cdot (\hat{r}(x, a) - r(x, a))^2 \right] < \mathbb{E}\_{x, a \sim \pi\_0} \left[ w^2(x, a) \cdot r^2(x, a) \right] - \left( \mathbb{E}\_{x, a \sim \pi} [r(x, a)] \right)^2 $$

In practice, this holds when the reward model captures even 20-30% of the variance in rewards.

Covariance structure: The covariance between model error and importance weights affects bias-variance tradeoff:

$$ \text{Cov}(\hat{r}(x, a), w(x, a)) \neq 0 \implies \text{additional bias/variance} $$

If the reward model is trained on the logged data, model errors may correlate with propensity scores (e.g., model overfits high-propensity actions), introducing subtle bias.

Policy Learning from Logged Data

To learn a policy directly, optimize the IPS-weighted objective:

$$ \hat{\pi} = \arg\max\_{\pi \in \Pi} \frac{1}{n} \sum\_{i=1}^{n} \frac{\pi(a\_i | x\_i)}{\pi\_0(a\_i | x\_i)} r\_i $$

Regularization and variance reduction are critical for stable optimization.

POEM (Policy Optimization with Empirical Mean) uses self-normalized objectives with variance regularization:

$$ \hat{\pi} = \arg\max\_{\pi \in \Pi} \frac{\sum\_{i=1}^{n} w\_i r\_i}{\sum\_{i=1}^{n} w\_i} - \lambda \cdot \text{Var}\_w $$

Position Bias Correction

In recommendation, position affects click probability independently of relevance. Let $P(\text{click} | \text{relevant}, \text{position } k) = \theta\_k$ be the position-dependent examination probability. The observed click rate is:

$$ P(\text{click}) = P(\text{relevant}) \cdot P(\text{examined} | \text{position}) $$

Click models (cascade model, dependent click model) estimate $\theta\_k$ and use it to debias training:

$$ \tilde{r}\_i = \frac{r\_i}{\hat{\theta}\_{k\_i}} $$

This inverse propensity weighting corrects for position bias.

Model Drift

User preferences and content distribution shift over time. Models degrade if not retrained:

Concept drift: The relationship between features and engagement changes.
Data drift: Feature distributions shift (e.g., new content formats, user demographics).

Continuous training pipelines retrain models daily or even hourly on fresh data. Monitoring systems track prediction calibration and trigger alerts on drift.

flowchart LR
    Production[Production Traffic] --> Logs[(Interaction Logs)]
    Logs --> Pipeline[Training Pipeline]
    Pipeline --> NewModel[New Model]
    NewModel --> Validation[Validation]
    Validation -->|pass| Deploy[Model Serving]
    Validation -->|fail| Alert[Alert / Rollback]
    Deploy --> Production

Cold Start and Exploration

New users and new items lack interaction history, making personalization difficult. This isn’t a corner case—it’s a constant reality. Platforms with growth add millions of new users monthly, and content platforms ingest thousands of new items per hour. At any given moment, a significant fraction of traffic involves cold entities.

The cold start problem is actually three distinct problems:

New user cold start: A user signs up with no interaction history. What do you show them on their first session? Their first impression determines whether they become an active user or churn.
New item cold start: A creator uploads content, a seller lists a product, or a news article is published. Without engagement data, collaborative filtering produces no signal. The item sits in a chicken-and-egg trap: it can’t rank well without engagement, and it can’t get engagement without ranking well.
System cold start: A new recommendation system launches with no historical data at all. This is rare but occurs when entering new markets or building entirely new product surfaces.

Why cold start is hard:

Collaborative filtering—the backbone of most recommendation systems—relies on the assumption that users who agreed in the past will agree in the future. But cold entities have no past. Content-based methods provide a fallback, but they typically underperform collaborative methods by 10-30% in engagement metrics.

The business stakes:

Scenario	Impact
Poor new user experience	40-60% of users who churn do so in their first week
New item neglect	Creators leave platforms where their content doesn’t get discovered
Stale catalog dominance	Popular items accumulate engagement, new items can’t compete

The solutions involve careful orchestration of exploration budgets, content understanding, and graceful degradation strategies.

New User Cold Start

Bayesian perspective: A new user $u$ has unknown preference parameters $\boldsymbol{\theta}_u$. Each strategy corresponds to choosing a different prior distribution $P(\boldsymbol{\theta}_u | \text{context})$:

Strategy	Bayesian Interpretation	Prior Strength
Onboarding surveys	$P(\boldsymbol{\theta}\_u \\| \text{stated interests})$: Strong informative prior from explicit preferences	High variance reduction: $\sigma^2\_{\text{prior}} \approx 0.3\sigma^2\_{\text{pop}}$
Demographic priors	$P(\boldsymbol{\theta}\_u \\| \text{age, location})$: Hierarchical model conditions on demographics	Medium: $\sigma^2\_{\text{prior}} \approx 0.6\sigma^2\_{\text{pop}}$
Social bootstrapping	$P(\boldsymbol{\theta}\_u \\| \boldsymbol{\theta}\_{\text{friends}})$: Prior centers on social network’s preferences	High for dense networks: $\sigma^2\_{\text{prior}} \approx 0.4\sigma^2\_{\text{pop}}$
Exploration-heavy	$P(\boldsymbol{\theta}\_u) = \mathcal{N}(\boldsymbol{\mu}\_{\text{pop}}, \boldsymbol{\Sigma}\_{\text{pop}})$: Broad uninformative prior, rapid learning	Low: $\sigma^2\_{\text{prior}} = \sigma^2\_{\text{pop}}$
Popularity fallback	$P(\boldsymbol{\theta}\_u) = \delta(\boldsymbol{\mu}\_{\text{pop}})$: Point estimate at population mean	Maximum bias, zero variance

Prior-data tradeoff: The expected squared error after $k$ interactions is:

$$ \text{MSE}(k) = \underbrace{\text{Bias}^2(\text{prior})}\_{\text{wrong prior}} + \underbrace{O(d/k)}\_{\text{estimation error}} $$

Strong priors reduce initial error but introduce bias if misspecified. Weak priors have high initial error but converge to truth faster.

New Item Cold Start

Bayesian perspective: A new item $i$ has unknown quality/appeal parameters $\mathbf{v}_i$. Prior distributions leverage side information:

Strategy	Bayesian Interpretation	Prior Precision
Content-based features	$P(\mathbf{v}\_i \\| \mathbf{c}\_i) = \mathcal{N}(\mathbf{M} \mathbf{c}\_i, \boldsymbol{\Sigma}\_{\text{content}})$ where $\mathbf{c}\_i$ are content features	Depends on content informativeness: $R^2 \in [0.3, 0.7]$ typically
Creator signals	$P(\mathbf{v}\_i \\| \text{creator}\_i) = \mathcal{N}(\bar{\mathbf{v}}\_{\text{creator}}, \boldsymbol{\Sigma}\_{\text{within-creator}})$	High for established creators: $\sigma^2\_{\text{within}} \ll \sigma^2\_{\text{pop}}$
Exploration allocation	Allocate traffic $\propto \text{Var}(\mathbf{v}\_i)$ to reduce posterior uncertainty	Information gain optimization: $\arg\max\_i H[P(\mathbf{v}\_i)]$
Bandits (UCB/Thompson)	Upper confidence bound: $\hat{r}\_i + \beta \sigma\_i$ where $\sigma\_i^2 = \text{Var}(\mathbf{v}\_i \\| \mathcal{D}\_i)$	Exploration bonus shrinks as $\sigma\_i^2 \to 0$

Content prior strength: Define the content prior ratio:

$$ \rho\_{\text{content}} = \frac{\text{Var}(\mathbb{E}[\mathbf{v}\_i | \mathbf{c}\_i])}{\text{Var}(\mathbf{v}\_i)} $$

$\rho \to 1$: Content fully predicts quality (e.g., news headlines, product specs)
$\rho \to 0$: Content uninformative (e.g., abstract art, niche humor)

Sample complexity for new items: To achieve prediction accuracy $\epsilon$, need:

$$ k\_i = O\left( \frac{d(1 - \rho\_{\text{content}})}{\epsilon^2} \right) \quad \text{impressions} $$

High $\rho_{\text{content}}$ (strong content features) dramatically reduces cold-start sample requirements.

Hybrid Approaches

Hybrid models combine collaborative signals (when available) with content-based features (always available). As interaction data accumulates, the model smoothly transitions from content-based to collaborative predictions.

Bayesian Cold-Start Framework

Cold-start can be formalized as Bayesian inference under uncertainty (Agarwal & Chen 2009; Stern et al. 2009). New users/items have unknown parameters; we maintain posterior distributions and update as data arrives.

Hierarchical Bayesian Model for New Users

Prior specification: New user $u$ has latent preference vector $\boldsymbol{\theta}\_u \in \mathbb{R}^d$. Without interaction data, use a population-level prior:

$$ \boldsymbol{\theta}\_u \sim \mathcal{N}(\boldsymbol{\mu}\_{\text{pop}}, \boldsymbol{\Sigma}\_{\text{pop}}) $$

where $\boldsymbol{\mu}\_{\text{pop}}$ and $\boldsymbol{\Sigma}\_{\text{pop}}$ are learned from existing users.

Hierarchical structure: For users with demographic features $\mathbf{x}\_u$, use feature-dependent priors:

$$ \boldsymbol{\theta}\_u | \mathbf{x}\_u \sim \mathcal{N}(\mathbf{W} \mathbf{x}\_u, \boldsymbol{\Sigma}) $$

where $\mathbf{W}$ maps demographics to expected preferences.

Posterior update: After observing interactions $\mathcal{D}\_u = \{(i\_1, r\_1), \ldots, (i\_k, r\_k)\}$, update via Bayes’ rule:

$$ P(\boldsymbol{\theta}\_u | \mathcal{D}\_u) \propto P(\mathcal{D}\_u | \boldsymbol{\theta}\_u) P(\boldsymbol{\theta}\_u | \mathbf{x}\_u) $$

For linear Gaussian models, this yields closed-form posteriors. For matrix factorization:

$$ P(\boldsymbol{\theta}\_u | \mathcal{D}\_u) = \mathcal{N}(\boldsymbol{\mu}\_u^{\text{post}}, \boldsymbol{\Sigma}\_u^{\text{post}}) $$

where:

$$ \boldsymbol{\Sigma}\_u^{\text{post}} = \left( \boldsymbol{\Sigma}^{-1} + \sum\_{i \in \mathcal{D}\_u} \mathbf{v}\_i \mathbf{v}\_i^\top / \sigma^2 \right)^{-1} $$$$ \boldsymbol{\mu}\_u^{\text{post}} = \boldsymbol{\Sigma}\_u^{\text{post}} \left( \boldsymbol{\Sigma}^{-1} \mathbf{W} \mathbf{x}\_u + \sum\_{i \in \mathcal{D}\_u} r\_i \mathbf{v}\_i / \sigma^2 \right) $$

Recommendation: For new user, predict expected rating:

$$ \hat{r}\_{ui} = \mathbb{E}[\boldsymbol{\theta}\_u^\top \mathbf{v}\_i | \mathcal{D}\_u] = \boldsymbol{\mu}\_u^{\text{post} \top} \mathbf{v}\_i $$

Uncertainty quantification: The posterior covariance $\boldsymbol{\Sigma}\_u^{\text{post}}$ captures uncertainty. Items with high predicted variance can be prioritized for exploration (Thompson Sampling).

Sample Complexity Analysis

Question: How many interactions $k$ are needed before the cold-start user’s predictions match warm-start quality?

Bound: Under linear Gaussian model, the prediction error decreases as:

$$ \mathbb{E}\left[ \|\hat{r}\_u - r\_u\|^2 \right] = O\left( \frac{d}{k} \right) $$

where $d$ is the latent dimension. To match warm-start error, need $k = \Omega(d)$ interactions.

Implication: High-dimensional user embeddings ($d > 100$) require substantial data before surpassing population priors. Content-based features reduce effective dimension.

Bayesian New Item Model

Similarly, for new item $i$ with content features $\mathbf{c}\_i$:

$$ \mathbf{v}\_i | \mathbf{c}\_i \sim \mathcal{N}(\mathbf{M} \mathbf{c}\_i, \mathbf{\Sigma}\_{\text{item}}) $$

where $\mathbf{M}$ is a learned content-to-embedding matrix.

Content-based prior strength: The ratio:

$$ \rho = \frac{\|\mathbf{M} \mathbf{c}\_i\|}{\|\boldsymbol{\Sigma}\_{\text{item}}\|} $$

controls prior strength. High $\rho$ means content features are informative; low $\rho$ means high uncertainty.

Active learning: Select new items to show based on information gain:

$$ i^* = \arg\max\_i \text{IG}(\mathbf{v}\_i | \mathcal{D}) = \arg\max\_i H(\mathbf{v}\_i | \mathcal{D}\_{\text{before}}) - H(\mathbf{v}\_i | \mathcal{D}\_{\text{after}}) $$

where $H(\cdot)$ is entropy. This favors items with high uncertainty whose embeddings will be refined most by feedback.

Practical Approximations

Full Bayesian inference is intractable at scale. Practical systems use:

MAP estimation: Replace full posterior with point estimate (mode)
Variational inference: Approximate posterior with simpler distribution (mean-field)
Particle filters: Represent posterior via Monte Carlo samples
Neural amortization: Train neural networks to predict posteriors directly from data

Internationalization and Cross-Market Challenges

Expanding to new countries or languages presents a form of “market cold start”—no local data, different user preferences, distinct content ecosystems. Global platforms must handle this systematically.

Language and Content Understanding

Multilingual embeddings:

Cross-lingual transfer: Train models on high-resource languages (English, Chinese), transfer to low-resource languages (Swahili, Tagalog)
Multilingual encoders: mBERT, XLM-R, LaBSE produce shared embedding spaces across 100+ languages
Language-specific fine-tuning: General multilingual models underperform language-specific models by 5-15% but are practical when labeled data is scarce

Translation challenges:

Machine translation quality varies (English↔Spanish: high quality; English↔Khmer: moderate)
Idiomatic expressions, slang, and cultural references don’t translate directly
Translation adds latency (50-200ms per item)

Content availability:

Content gap: Popular items in US may not exist in Indonesia
Local content bootstrapping: Incentivize local creators; can’t rely on cross-border content alone
Language imbalance: English content dominates; other languages are underrepresented

Cross-Cultural Preferences

User behavior varies significantly across cultures:

Dimension	Example Variation
Content preferences	US: short-form video; Japan: manga and text posts; India: regional language content
Engagement patterns	Some cultures share openly; others lurk and consume passively
Social graph structure	Tight family networks (Middle East) vs. loose acquaintance networks (US)
Trust signals	Verified badges matter more in low-trust environments

Implications:

Can’t train one global model and assume it works everywhere
Engagement metrics have different distributions across markets
Content moderation policies must respect local norms

Market-Specific Models vs. Shared Models

Approach	Pros	Cons
Single global model	Simplicity; cross-market transfer learning	Underperforms in each market; ignores cultural nuance
Per-market models	Optimized for local preferences	Requires data/compute per market; no transfer learning; cold start for new markets
Market adapters	Shared backbone + lightweight market-specific layers	Best of both worlds

Market adapter pattern:

class MarketAdaptedRanker(nn.Module):
    def __init__(self):
        self.shared_backbone = TransformerEncoder(...)  # Shared
        self.market_adapters = {
            'US': nn.Linear(512, 512),
            'IN': nn.Linear(512, 512),
            'BR': nn.Linear(512, 512),
            ...
        }

    def forward(self, features, market):
        shared_repr = self.shared_backbone(features)
        adapted = self.market_adapters[market](shared_repr)
        return score_head(adapted)

Training strategy: Pre-train on all markets; fine-tune adapters per-market.

Regulatory and Infrastructure Challenges

Data residency: GDPR (EU), LGPD (Brazil), and local laws may require data to stay in-country. This fragments training data and complicates model updates.

Content regulations: What’s acceptable in one country may be illegal in another:

Political speech restrictions (China, Middle East)
Hate speech definitions vary
Misinformation standards differ

Infrastructure: Low-bandwidth regions (rural India, sub-Saharan Africa) require:

Lightweight models (quantized, distilled)
Aggressive caching
Offline-first design

Launch Strategy for New Markets

Phase 1: Pre-launch (before local users)

Deploy content-based models (no interaction data needed)
Bootstrap content from creators in similar markets
Translate popular global content
Set up local trust & safety team

Phase 2: Soft launch (limited users)

Invite local influencers and creators
Collect interaction data
Train initial collaborative models
Test culturally-specific features

Phase 3: Scale (general availability)

Switch to hybrid models (content + collaborative)
Deploy market-specific ranking
Monitor engagement, retention, creator health
Iterate based on local feedback

Metrics to track:

New user activation (% who engage in first session)
Creator supply (posts per day)
Content diversity (avoid relying on translated content)
Localization bugs (date/time formats, currency, RTL layout)

Cross-Border Content Recommendations

Should a US user see content from India? Depends:

Arguments for:

Serendipity: Exposure to global perspectives
Content supply: More content = better recommendations
Creator reach: Helps creators find international audiences

Arguments against:

Language barriers: Most users don’t want non-native language content
Cultural relevance: Content may not resonate
Latency: Fetching cross-region content adds latency

Heuristic: Allow cross-border for visual content (short videos, images with minimal text), restrict for text-heavy content unless user explicitly engages with that language.

Questions or feedback?