Recommendation Systems Part 5: Implementation & Operations

Part 5 of 6 | ← Part 4: Ethics & Safety | Part 6: Advanced Topics →

Case Study: Feed Ranking Architecture

The following synthesizes public disclosures from major platforms (Meta, TikTok, YouTube, Twitter/X) into a representative architecture.

Request Flow

sequenceDiagram
    participant Client
    participant Gateway
    participant Retrieval
    participant Ranking
    participant Reranking
    participant Collator

    Client->>Gateway: Feed request
    Gateway->>Retrieval: Get candidates (user_id, context)
    Retrieval-->>Gateway: 5000 candidates
    Gateway->>Ranking: Score candidates
    Ranking-->>Gateway: 500 scored items
    Gateway->>Reranking: Apply diversity, policy
    Reranking-->>Gateway: 100 items
    Gateway->>Collator: Mix organic + ads + notifications
    Collator-->>Gateway: 50 items
    Gateway-->>Client: Feed response (paginated)

Component Details

Component	Implementation
Retrieval	Two-tower model (user/item embeddings) + graph-based (friends’ posts) + trending
Ranking	Multi-task DCN with 100+ features; outputs P(click), P(like), P(share), P(hide), E(watch_time)
Re-ranking	MMR for diversity; policy filters; creator frequency caps
Feed Collator	Interleaves organic posts, ads (from separate auction), and system notifications

Latency Budget

Stage	Target Latency (P99)
Feature fetch	10 ms
Retrieval	30 ms
Ranking (500 items)	50 ms
Re-ranking	10 ms
Total	< 150 ms

Caching, precomputation, and model optimization keep end-to-end latency within budget.

Ads Integration and Monetization Lanes

The feed collator orchestrates multiple content streams—organic recommendations, sponsored content, and system notifications—while respecting user experience constraints and revenue targets.

Ad Auction Hand-off

Ads flow through a separate auction system before reaching the feed collator:

Ad retrieval: Candidate ads are retrieved based on user targeting criteria (demographics, interests, lookalike audiences).
Ad ranking: A separate model predicts P(click) and P(conversion); bids are adjusted by predicted value (eCPM = bid × P(click) × P(conversion)).
Auction: Generalized second-price or VCG auction determines winning ads and prices.
Eligibility filtering: Policy checks (frequency caps, brand safety, user opt-outs) filter ineligible ads.
Hand-off to feed collator: Winning ads with their auction prices are passed to the feed feed collator.

Blending Rules and Constraints

The feed collator applies a rule-based policy to interleave content types:

Constraint	Rule	Rationale
Ad load cap	Max 1 ad per 5 organic posts	Preserve user experience
Ad spacing	Minimum 3 organic posts between ads	Avoid ad fatigue
First-position protection	No ads in positions 1–2	Ensure organic engagement on session start
Notification slots	Reserved positions (e.g., position 4) for system notifications	Guarantee delivery of friend requests, milestones
Session pacing	Reduce ad frequency after N ads shown	Diminishing returns on ad attention

These rules are typically expressed as hard constraints in an integer program or enforced via greedy slot allocation.

Exposure Caps and Frequency Management

Multiple cap types prevent over-exposure:

Per-ad caps: Maximum impressions per user per day (e.g., 3)
Per-advertiser caps: Maximum impressions across all ads from one advertiser
Category caps: Limit ads from sensitive categories (e.g., max 1 political ad per session)
Global ad load: Platform-wide target (e.g., 15% of impressions are ads)

Cap state is maintained in low-latency stores (Redis, Memcached) and checked at both auction and blending time.

Budget Pacing

Advertisers set daily or campaign budgets. The system paces spend to avoid exhausting budgets early:

$$ m\_{\text{pace}}(t) = \frac{B\_{\text{remaining}}}{I\_{\text{expected}}(t)} $$

where $m\_{\text{pace}}$ is the pacing multiplier, $B\_{\text{remaining}}$ is remaining budget, and $I\_{\text{expected}}(t)$ is expected remaining impressions at time $t$.

Bids are scaled by the pacing multiplier, throttling delivery when ahead of pace and accelerating when behind. Pacing algorithms balance budget utilization against opportunity cost of missing high-value impressions.

User Experience Guardrails

Beyond ad-specific rules, guardrails protect overall feed quality:

Engagement floor: If predicted engagement for a blended feed falls below threshold, reduce ad load
Diversity requirements: Ads cannot displace diversity-critical organic content (e.g., posts from close friends)
Scroll depth triggers: Adjust ad frequency based on scroll velocity (fast scrollers see fewer ads)
Session quality signals: If user shows disengagement signals (quick bounces, hides), reduce ad load for remainder of session

flowchart TB
    subgraph Organic Pipeline
        Retrieval --> Ranking --> Reranking
    end

    subgraph Ads Pipeline
        AdRetrieval[Ad Retrieval] --> AdRanking[Ad Ranking] --> Auction
    end

    Reranking --> Collator
    Auction --> Collator
    Notifications --> Collator
    Collator --> Constraints[Constraint Checker]
    Constraints --> Feed[Final Feed]

Revenue vs. Experience Trade-offs

The feed collator balances competing objectives:

$$ \text{CollatorScore} = \alpha \cdot \text{OrganicValue} + \beta \cdot \text{AdRevenue} - \gamma \cdot \text{ExperiencePenalty} $$

where OrganicValue captures long-term engagement (retention proxy), AdRevenue is immediate monetization, and ExperiencePenalty penalizes aggressive ad loads. Weights ($\alpha$, $\beta$, $\gamma$) are tuned via online experiments measuring long-term user outcomes.

Implementation Example: Two-Tower Retrieval

The following Python code sketches a two-tower model for candidate retrieval using TensorFlow/Keras conventions.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TwoTowerModel(keras.Model):
    def __init__(self, user_feature_dim, item_feature_dim, embedding_dim=128):
        super().__init__()

        # User tower
        self.user_encoder = keras.Sequential([
            layers.Dense(256, activation='relu'),
            layers.Dense(128, activation='relu'),
            layers.Dense(embedding_dim),
            layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))
        ])

        # Item tower
        self.item_encoder = keras.Sequential([
            layers.Dense(256, activation='relu'),
            layers.Dense(128, activation='relu'),
            layers.Dense(embedding_dim),
            layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))
        ])

        self.temperature = tf.Variable(0.05, trainable=True)

    def call(self, inputs):
        user_features, item_features = inputs
        user_embedding = self.user_encoder(user_features)
        item_embedding = self.item_encoder(item_features)
        return user_embedding, item_embedding

    def compute_loss(self, user_emb, item_emb):
        # In-batch negatives: each item in batch serves as negative for other users
        scores = tf.matmul(user_emb, item_emb, transpose_b=True) / self.temperature
        batch_size = tf.shape(scores)[0]
        labels = tf.range(batch_size)  # Diagonal = positive pairs
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels, scores)
        return tf.reduce_mean(loss)

# Training loop sketch
model = TwoTowerModel(user_feature_dim=64, item_feature_dim=128)
optimizer = keras.optimizers.Adam(learning_rate=1e-3)

for batch in training_dataset:
    user_features, item_features = batch
    with tf.GradientTape() as tape:
        user_emb, item_emb = model([user_features, item_features])
        loss = model.compute_loss(user_emb, item_emb)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# After training: build ANN index over item embeddings
item_embeddings = model.item_encoder(all_item_features)
ann_index = build_hnsw_index(item_embeddings)  # e.g., FAISS, ScaNN

# Serving: embed user, query ANN index
def retrieve_candidates(user_features, k=1000):
    user_emb = model.user_encoder(user_features)
    candidates, scores = ann_index.search(user_emb, k)
    return candidates

Implementation Example: Ranking Model

A simplified ranking model with multi-task outputs:

class RankingModel(keras.Model):
    def __init__(self, feature_dim, hidden_dims=[512, 256, 128]):
        super().__init__()

        # Shared layers
        self.shared = keras.Sequential([
            layers.Dense(dim, activation='relu')
            for dim in hidden_dims
        ])

        # Task-specific heads
        self.click_head = layers.Dense(1, activation='sigmoid', name='p_click')
        self.like_head = layers.Dense(1, activation='sigmoid', name='p_like')
        self.share_head = layers.Dense(1, activation='sigmoid', name='p_share')
        self.hide_head = layers.Dense(1, activation='sigmoid', name='p_hide')
        self.watch_head = layers.Dense(1, activation='linear', name='watch_time')

    def call(self, features):
        shared_repr = self.shared(features)
        return {
            'p_click': self.click_head(shared_repr),
            'p_like': self.like_head(shared_repr),
            'p_share': self.share_head(shared_repr),
            'p_hide': self.hide_head(shared_repr),
            'watch_time': self.watch_head(shared_repr)
        }

    def compute_final_score(self, predictions, weights):
        """Combine task predictions into final ranking score."""
        score = (
            weights['click'] * predictions['p_click'] +
            weights['like'] * predictions['p_like'] +
            weights['share'] * predictions['p_share'] -
            weights['hide'] * predictions['p_hide'] +
            weights['watch'] * tf.math.log1p(predictions['watch_time'])
        )
        return score

# Scoring at serving time
weights = {'click': 1.0, 'like': 2.0, 'share': 3.0, 'hide': 5.0, 'watch': 0.1}
predictions = ranking_model(candidate_features)
scores = ranking_model.compute_final_score(predictions, weights)
ranked_candidates = tf.argsort(scores, direction='DESCENDING')

Monitoring and Observability

Recommendation systems fail silently. Unlike a crashed service that triggers immediate alerts, a degraded model continues serving predictions—just bad ones. Users see irrelevant content, engagement drops gradually, and by the time anyone notices, millions of poor experiences have already shipped.

Why recommendation monitoring is uniquely hard:

No ground truth at serving time: You predict P(click) = 0.3, but you won’t know if you were right until the user acts (or doesn’t). Feedback arrives minutes to hours later.
Counterfactual blindness: You only observe outcomes for items you showed. If the model stops recommending a category, you’ll never see engagement on those items—so you can’t tell if the model is wrong.
Slow degradation: A feature pipeline breaking doesn’t crash the system; it just makes predictions worse. A model drifting out of calibration costs engagement but doesn’t page anyone.
Interaction effects: Problems emerge from combinations. A new item embedding + a changed feature distribution + a shifted user population might each look fine in isolation.

What to monitor:

Layer	What Can Go Wrong	Observable Signal
Infrastructure	Latency spikes, partial outages, memory pressure	P99 latency, error rates, resource utilization
Feature pipeline	Stale features, missing values, distribution shift	Feature freshness, null rates, PSI drift scores
Model	Calibration drift, prediction collapse, segment degradation	Predicted vs actual, score distributions, per-cohort metrics
Business	User dissatisfaction, creator churn, revenue drop	Engagement rates, retention curves, revenue per session

The goal is catching problems at the earliest layer—a feature going stale should alert before it causes calibration drift, which should alert before engagement drops.

Who monitors what:

Role	Primary Focus	Typical Response
ML Platform on-call	Infrastructure health, serving latency, feature pipeline freshness	Rollback, failover, pipeline restart
Ranking team	Model calibration, prediction distributions, A/B test guardrails	Model rollback, emergency retraining, feature investigation
Product/Growth	Business metrics, funnel conversion, segment-level engagement	Escalate to ranking, pause experiments, adjust objectives
Trust & Safety	Policy violations, integrity signals, harmful content amplification	Content takedown, model constraints, ranking suppression
Data Science	Long-term trends, cohort analysis, causal attribution	Root cause analysis, strategic recommendations

In practice, these boundaries blur. A CTR drop might start with Growth noticing a metric, escalate to Ranking for model investigation, trace back to Platform finding a stale feature, and ultimately require Data Science to understand the causal chain.

Model Metrics

Metric	Description	Alert Threshold
Prediction distribution	Histogram of model scores	Shift from baseline
Calibration	Predicted vs actual engagement rate	> 5% deviation
Feature coverage	Fraction of requests with all features present	< 99%
Inference latency	P50/P95/P99 model inference time	> SLA

Business Metrics

Metric	Description	Alert Threshold
CTR	Click-through rate	> 10% drop
Engagement rate	Likes + comments + shares per impression	> 10% drop
Session duration	Average time spent per session	> 10% drop
Diversity	Unique creators/topics per user per day	< baseline

Dashboards and Alerts

flowchart LR
    Services --> Metrics[Metrics Collector]
    Metrics --> TSDB[(Time-series DB)]
    TSDB --> Grafana[Dashboards]
    TSDB --> Alerts[Alert Manager]
    Alerts --> Oncall[On-call Engineer]

Questions or feedback?