Recommendation Systems Part 5: Implementation & Operations

Recommendation Systems Part 5: Implementation & Operations


Questions or feedback?

I'd love to hear your thoughts on this article. Feel free to reach out:

Part 5 of 6 | ← Part 4: Ethics & Safety | Part 6: Advanced Topics →

Case Study: Feed Ranking Architecture

The following synthesizes public disclosures from major platforms (Meta, TikTok, YouTube, Twitter/X) into a representative architecture.

Request Flow

sequenceDiagram
    participant Client
    participant Gateway
    participant Retrieval
    participant Ranking
    participant Reranking
    participant Collator

    Client->>Gateway: Feed request
    Gateway->>Retrieval: Get candidates (user_id, context)
    Retrieval-->>Gateway: 5000 candidates
    Gateway->>Ranking: Score candidates
    Ranking-->>Gateway: 500 scored items
    Gateway->>Reranking: Apply diversity, policy
    Reranking-->>Gateway: 100 items
    Gateway->>Collator: Mix organic + ads + notifications
    Collator-->>Gateway: 50 items
    Gateway-->>Client: Feed response (paginated)

Component Details

Component Implementation
Retrieval Two-tower model (user/item embeddings) + graph-based (friends’ posts) + trending
Ranking Multi-task DCN with 100+ features; outputs P(click), P(like), P(share), P(hide), E(watch_time)
Re-ranking MMR for diversity; policy filters; creator frequency caps
Feed Collator Interleaves organic posts, ads (from separate auction), and system notifications

Latency Budget

Stage Target Latency (P99)
Feature fetch 10 ms
Retrieval 30 ms
Ranking (500 items) 50 ms
Re-ranking 10 ms
Total < 150 ms

Caching, precomputation, and model optimization keep end-to-end latency within budget.

Ads Integration and Monetization Lanes

The feed collator orchestrates multiple content streams—organic recommendations, sponsored content, and system notifications—while respecting user experience constraints and revenue targets.

Ad Auction Hand-off

Ads flow through a separate auction system before reaching the feed collator:

  1. Ad retrieval: Candidate ads are retrieved based on user targeting criteria (demographics, interests, lookalike audiences).
  2. Ad ranking: A separate model predicts P(click) and P(conversion); bids are adjusted by predicted value (eCPM = bid × P(click) × P(conversion)).
  3. Auction: Generalized second-price or VCG auction determines winning ads and prices.
  4. Eligibility filtering: Policy checks (frequency caps, brand safety, user opt-outs) filter ineligible ads.
  5. Hand-off to feed collator: Winning ads with their auction prices are passed to the feed feed collator.

Blending Rules and Constraints

The feed collator applies a rule-based policy to interleave content types:

Constraint Rule Rationale
Ad load cap Max 1 ad per 5 organic posts Preserve user experience
Ad spacing Minimum 3 organic posts between ads Avoid ad fatigue
First-position protection No ads in positions 1–2 Ensure organic engagement on session start
Notification slots Reserved positions (e.g., position 4) for system notifications Guarantee delivery of friend requests, milestones
Session pacing Reduce ad frequency after N ads shown Diminishing returns on ad attention

These rules are typically expressed as hard constraints in an integer program or enforced via greedy slot allocation.

Exposure Caps and Frequency Management

Multiple cap types prevent over-exposure:

  • Per-ad caps: Maximum impressions per user per day (e.g., 3)
  • Per-advertiser caps: Maximum impressions across all ads from one advertiser
  • Category caps: Limit ads from sensitive categories (e.g., max 1 political ad per session)
  • Global ad load: Platform-wide target (e.g., 15% of impressions are ads)

Cap state is maintained in low-latency stores (Redis, Memcached) and checked at both auction and blending time.

Budget Pacing

Advertisers set daily or campaign budgets. The system paces spend to avoid exhausting budgets early:

$$ m\_{\text{pace}}(t) = \frac{B\_{\text{remaining}}}{I\_{\text{expected}}(t)} $$

where $m\_{\text{pace}}$ is the pacing multiplier, $B\_{\text{remaining}}$ is remaining budget, and $I\_{\text{expected}}(t)$ is expected remaining impressions at time $t$.

Bids are scaled by the pacing multiplier, throttling delivery when ahead of pace and accelerating when behind. Pacing algorithms balance budget utilization against opportunity cost of missing high-value impressions.

User Experience Guardrails

Beyond ad-specific rules, guardrails protect overall feed quality:

  • Engagement floor: If predicted engagement for a blended feed falls below threshold, reduce ad load
  • Diversity requirements: Ads cannot displace diversity-critical organic content (e.g., posts from close friends)
  • Scroll depth triggers: Adjust ad frequency based on scroll velocity (fast scrollers see fewer ads)
  • Session quality signals: If user shows disengagement signals (quick bounces, hides), reduce ad load for remainder of session
flowchart TB
    subgraph Organic Pipeline
        Retrieval --> Ranking --> Reranking
    end

    subgraph Ads Pipeline
        AdRetrieval[Ad Retrieval] --> AdRanking[Ad Ranking] --> Auction
    end

    Reranking --> Collator
    Auction --> Collator
    Notifications --> Collator
    Collator --> Constraints[Constraint Checker]
    Constraints --> Feed[Final Feed]

Revenue vs. Experience Trade-offs

The feed collator balances competing objectives:

$$ \text{CollatorScore} = \alpha \cdot \text{OrganicValue} + \beta \cdot \text{AdRevenue} - \gamma \cdot \text{ExperiencePenalty} $$

where OrganicValue captures long-term engagement (retention proxy), AdRevenue is immediate monetization, and ExperiencePenalty penalizes aggressive ad loads. Weights ($\alpha$, $\beta$, $\gamma$) are tuned via online experiments measuring long-term user outcomes.


Implementation Example: Two-Tower Retrieval

The following Python code sketches a two-tower model for candidate retrieval using TensorFlow/Keras conventions.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TwoTowerModel(keras.Model):
    def __init__(self, user_feature_dim, item_feature_dim, embedding_dim=128):
        super().__init__()

        # User tower
        self.user_encoder = keras.Sequential([
            layers.Dense(256, activation='relu'),
            layers.Dense(128, activation='relu'),
            layers.Dense(embedding_dim),
            layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))
        ])

        # Item tower
        self.item_encoder = keras.Sequential([
            layers.Dense(256, activation='relu'),
            layers.Dense(128, activation='relu'),
            layers.Dense(embedding_dim),
            layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))
        ])

        self.temperature = tf.Variable(0.05, trainable=True)

    def call(self, inputs):
        user_features, item_features = inputs
        user_embedding = self.user_encoder(user_features)
        item_embedding = self.item_encoder(item_features)
        return user_embedding, item_embedding

    def compute_loss(self, user_emb, item_emb):
        # In-batch negatives: each item in batch serves as negative for other users
        scores = tf.matmul(user_emb, item_emb, transpose_b=True) / self.temperature
        batch_size = tf.shape(scores)[0]
        labels = tf.range(batch_size)  # Diagonal = positive pairs
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels, scores)
        return tf.reduce_mean(loss)

# Training loop sketch
model = TwoTowerModel(user_feature_dim=64, item_feature_dim=128)
optimizer = keras.optimizers.Adam(learning_rate=1e-3)

for batch in training_dataset:
    user_features, item_features = batch
    with tf.GradientTape() as tape:
        user_emb, item_emb = model([user_features, item_features])
        loss = model.compute_loss(user_emb, item_emb)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# After training: build ANN index over item embeddings
item_embeddings = model.item_encoder(all_item_features)
ann_index = build_hnsw_index(item_embeddings)  # e.g., FAISS, ScaNN

# Serving: embed user, query ANN index
def retrieve_candidates(user_features, k=1000):
    user_emb = model.user_encoder(user_features)
    candidates, scores = ann_index.search(user_emb, k)
    return candidates

Implementation Example: Ranking Model

A simplified ranking model with multi-task outputs:

class RankingModel(keras.Model):
    def __init__(self, feature_dim, hidden_dims=[512, 256, 128]):
        super().__init__()

        # Shared layers
        self.shared = keras.Sequential([
            layers.Dense(dim, activation='relu')
            for dim in hidden_dims
        ])

        # Task-specific heads
        self.click_head = layers.Dense(1, activation='sigmoid', name='p_click')
        self.like_head = layers.Dense(1, activation='sigmoid', name='p_like')
        self.share_head = layers.Dense(1, activation='sigmoid', name='p_share')
        self.hide_head = layers.Dense(1, activation='sigmoid', name='p_hide')
        self.watch_head = layers.Dense(1, activation='linear', name='watch_time')

    def call(self, features):
        shared_repr = self.shared(features)
        return {
            'p_click': self.click_head(shared_repr),
            'p_like': self.like_head(shared_repr),
            'p_share': self.share_head(shared_repr),
            'p_hide': self.hide_head(shared_repr),
            'watch_time': self.watch_head(shared_repr)
        }

    def compute_final_score(self, predictions, weights):
        """Combine task predictions into final ranking score."""
        score = (
            weights['click'] * predictions['p_click'] +
            weights['like'] * predictions['p_like'] +
            weights['share'] * predictions['p_share'] -
            weights['hide'] * predictions['p_hide'] +
            weights['watch'] * tf.math.log1p(predictions['watch_time'])
        )
        return score

# Scoring at serving time
weights = {'click': 1.0, 'like': 2.0, 'share': 3.0, 'hide': 5.0, 'watch': 0.1}
predictions = ranking_model(candidate_features)
scores = ranking_model.compute_final_score(predictions, weights)
ranked_candidates = tf.argsort(scores, direction='DESCENDING')

Monitoring and Observability

Recommendation systems fail silently. Unlike a crashed service that triggers immediate alerts, a degraded model continues serving predictions—just bad ones. Users see irrelevant content, engagement drops gradually, and by the time anyone notices, millions of poor experiences have already shipped.

Why recommendation monitoring is uniquely hard:

  • No ground truth at serving time: You predict P(click) = 0.3, but you won’t know if you were right until the user acts (or doesn’t). Feedback arrives minutes to hours later.
  • Counterfactual blindness: You only observe outcomes for items you showed. If the model stops recommending a category, you’ll never see engagement on those items—so you can’t tell if the model is wrong.
  • Slow degradation: A feature pipeline breaking doesn’t crash the system; it just makes predictions worse. A model drifting out of calibration costs engagement but doesn’t page anyone.
  • Interaction effects: Problems emerge from combinations. A new item embedding + a changed feature distribution + a shifted user population might each look fine in isolation.

What to monitor:

Layer What Can Go Wrong Observable Signal
Infrastructure Latency spikes, partial outages, memory pressure P99 latency, error rates, resource utilization
Feature pipeline Stale features, missing values, distribution shift Feature freshness, null rates, PSI drift scores
Model Calibration drift, prediction collapse, segment degradation Predicted vs actual, score distributions, per-cohort metrics
Business User dissatisfaction, creator churn, revenue drop Engagement rates, retention curves, revenue per session

The goal is catching problems at the earliest layer—a feature going stale should alert before it causes calibration drift, which should alert before engagement drops.

Who monitors what:

Role Primary Focus Typical Response
ML Platform on-call Infrastructure health, serving latency, feature pipeline freshness Rollback, failover, pipeline restart
Ranking team Model calibration, prediction distributions, A/B test guardrails Model rollback, emergency retraining, feature investigation
Product/Growth Business metrics, funnel conversion, segment-level engagement Escalate to ranking, pause experiments, adjust objectives
Trust & Safety Policy violations, integrity signals, harmful content amplification Content takedown, model constraints, ranking suppression
Data Science Long-term trends, cohort analysis, causal attribution Root cause analysis, strategic recommendations

In practice, these boundaries blur. A CTR drop might start with Growth noticing a metric, escalate to Ranking for model investigation, trace back to Platform finding a stale feature, and ultimately require Data Science to understand the causal chain.

Model Metrics

Metric Description Alert Threshold
Prediction distribution Histogram of model scores Shift from baseline
Calibration Predicted vs actual engagement rate > 5% deviation
Feature coverage Fraction of requests with all features present < 99%
Inference latency P50/P95/P99 model inference time > SLA

Business Metrics

Metric Description Alert Threshold
CTR Click-through rate > 10% drop
Engagement rate Likes + comments + shares per impression > 10% drop
Session duration Average time spent per session > 10% drop
Diversity Unique creators/topics per user per day < baseline

Dashboards and Alerts

flowchart LR
    Services --> Metrics[Metrics Collector]
    Metrics --> TSDB[(Time-series DB)]
    TSDB --> Grafana[Dashboards]
    TSDB --> Alerts[Alert Manager]
    Alerts --> Oncall[On-call Engineer]