Part 5 of 6 | ← Part 4: Ethics & Safety | Part 6: Advanced Topics →
Case Study: Feed Ranking Architecture
The following synthesizes public disclosures from major platforms (Meta, TikTok, YouTube, Twitter/X) into a representative architecture.
Request Flow
sequenceDiagram
participant Client
participant Gateway
participant Retrieval
participant Ranking
participant Reranking
participant Collator
Client->>Gateway: Feed request
Gateway->>Retrieval: Get candidates (user_id, context)
Retrieval-->>Gateway: 5000 candidates
Gateway->>Ranking: Score candidates
Ranking-->>Gateway: 500 scored items
Gateway->>Reranking: Apply diversity, policy
Reranking-->>Gateway: 100 items
Gateway->>Collator: Mix organic + ads + notifications
Collator-->>Gateway: 50 items
Gateway-->>Client: Feed response (paginated)
Component Details
| Component | Implementation |
|---|---|
| Retrieval | Two-tower model (user/item embeddings) + graph-based (friends’ posts) + trending |
| Ranking | Multi-task DCN with 100+ features; outputs P(click), P(like), P(share), P(hide), E(watch_time) |
| Re-ranking | MMR for diversity; policy filters; creator frequency caps |
| Feed Collator | Interleaves organic posts, ads (from separate auction), and system notifications |
Latency Budget
| Stage | Target Latency (P99) |
|---|---|
| Feature fetch | 10 ms |
| Retrieval | 30 ms |
| Ranking (500 items) | 50 ms |
| Re-ranking | 10 ms |
| Total | < 150 ms |
Caching, precomputation, and model optimization keep end-to-end latency within budget.
Ads Integration and Monetization Lanes
The feed collator orchestrates multiple content streams—organic recommendations, sponsored content, and system notifications—while respecting user experience constraints and revenue targets.
Ad Auction Hand-off
Ads flow through a separate auction system before reaching the feed collator:
- Ad retrieval: Candidate ads are retrieved based on user targeting criteria (demographics, interests, lookalike audiences).
- Ad ranking: A separate model predicts P(click) and P(conversion); bids are adjusted by predicted value (eCPM = bid × P(click) × P(conversion)).
- Auction: Generalized second-price or VCG auction determines winning ads and prices.
- Eligibility filtering: Policy checks (frequency caps, brand safety, user opt-outs) filter ineligible ads.
- Hand-off to feed collator: Winning ads with their auction prices are passed to the feed feed collator.
Blending Rules and Constraints
The feed collator applies a rule-based policy to interleave content types:
| Constraint | Rule | Rationale |
|---|---|---|
| Ad load cap | Max 1 ad per 5 organic posts | Preserve user experience |
| Ad spacing | Minimum 3 organic posts between ads | Avoid ad fatigue |
| First-position protection | No ads in positions 1–2 | Ensure organic engagement on session start |
| Notification slots | Reserved positions (e.g., position 4) for system notifications | Guarantee delivery of friend requests, milestones |
| Session pacing | Reduce ad frequency after N ads shown | Diminishing returns on ad attention |
These rules are typically expressed as hard constraints in an integer program or enforced via greedy slot allocation.
Exposure Caps and Frequency Management
Multiple cap types prevent over-exposure:
- Per-ad caps: Maximum impressions per user per day (e.g., 3)
- Per-advertiser caps: Maximum impressions across all ads from one advertiser
- Category caps: Limit ads from sensitive categories (e.g., max 1 political ad per session)
- Global ad load: Platform-wide target (e.g., 15% of impressions are ads)
Cap state is maintained in low-latency stores (Redis, Memcached) and checked at both auction and blending time.
Budget Pacing
Advertisers set daily or campaign budgets. The system paces spend to avoid exhausting budgets early:
$$ m\_{\text{pace}}(t) = \frac{B\_{\text{remaining}}}{I\_{\text{expected}}(t)} $$where $m\_{\text{pace}}$ is the pacing multiplier, $B\_{\text{remaining}}$ is remaining budget, and $I\_{\text{expected}}(t)$ is expected remaining impressions at time $t$.
Bids are scaled by the pacing multiplier, throttling delivery when ahead of pace and accelerating when behind. Pacing algorithms balance budget utilization against opportunity cost of missing high-value impressions.
User Experience Guardrails
Beyond ad-specific rules, guardrails protect overall feed quality:
- Engagement floor: If predicted engagement for a blended feed falls below threshold, reduce ad load
- Diversity requirements: Ads cannot displace diversity-critical organic content (e.g., posts from close friends)
- Scroll depth triggers: Adjust ad frequency based on scroll velocity (fast scrollers see fewer ads)
- Session quality signals: If user shows disengagement signals (quick bounces, hides), reduce ad load for remainder of session
flowchart TB
subgraph Organic Pipeline
Retrieval --> Ranking --> Reranking
end
subgraph Ads Pipeline
AdRetrieval[Ad Retrieval] --> AdRanking[Ad Ranking] --> Auction
end
Reranking --> Collator
Auction --> Collator
Notifications --> Collator
Collator --> Constraints[Constraint Checker]
Constraints --> Feed[Final Feed]
Revenue vs. Experience Trade-offs
The feed collator balances competing objectives:
$$ \text{CollatorScore} = \alpha \cdot \text{OrganicValue} + \beta \cdot \text{AdRevenue} - \gamma \cdot \text{ExperiencePenalty} $$where OrganicValue captures long-term engagement (retention proxy), AdRevenue is immediate monetization, and ExperiencePenalty penalizes aggressive ad loads. Weights ($\alpha$, $\beta$, $\gamma$) are tuned via online experiments measuring long-term user outcomes.
Implementation Example: Two-Tower Retrieval
The following Python code sketches a two-tower model for candidate retrieval using TensorFlow/Keras conventions.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
class TwoTowerModel(keras.Model):
def __init__(self, user_feature_dim, item_feature_dim, embedding_dim=128):
super().__init__()
# User tower
self.user_encoder = keras.Sequential([
layers.Dense(256, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(embedding_dim),
layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))
])
# Item tower
self.item_encoder = keras.Sequential([
layers.Dense(256, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(embedding_dim),
layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))
])
self.temperature = tf.Variable(0.05, trainable=True)
def call(self, inputs):
user_features, item_features = inputs
user_embedding = self.user_encoder(user_features)
item_embedding = self.item_encoder(item_features)
return user_embedding, item_embedding
def compute_loss(self, user_emb, item_emb):
# In-batch negatives: each item in batch serves as negative for other users
scores = tf.matmul(user_emb, item_emb, transpose_b=True) / self.temperature
batch_size = tf.shape(scores)[0]
labels = tf.range(batch_size) # Diagonal = positive pairs
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels, scores)
return tf.reduce_mean(loss)
# Training loop sketch
model = TwoTowerModel(user_feature_dim=64, item_feature_dim=128)
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
for batch in training_dataset:
user_features, item_features = batch
with tf.GradientTape() as tape:
user_emb, item_emb = model([user_features, item_features])
loss = model.compute_loss(user_emb, item_emb)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# After training: build ANN index over item embeddings
item_embeddings = model.item_encoder(all_item_features)
ann_index = build_hnsw_index(item_embeddings) # e.g., FAISS, ScaNN
# Serving: embed user, query ANN index
def retrieve_candidates(user_features, k=1000):
user_emb = model.user_encoder(user_features)
candidates, scores = ann_index.search(user_emb, k)
return candidates
Implementation Example: Ranking Model
A simplified ranking model with multi-task outputs:
class RankingModel(keras.Model):
def __init__(self, feature_dim, hidden_dims=[512, 256, 128]):
super().__init__()
# Shared layers
self.shared = keras.Sequential([
layers.Dense(dim, activation='relu')
for dim in hidden_dims
])
# Task-specific heads
self.click_head = layers.Dense(1, activation='sigmoid', name='p_click')
self.like_head = layers.Dense(1, activation='sigmoid', name='p_like')
self.share_head = layers.Dense(1, activation='sigmoid', name='p_share')
self.hide_head = layers.Dense(1, activation='sigmoid', name='p_hide')
self.watch_head = layers.Dense(1, activation='linear', name='watch_time')
def call(self, features):
shared_repr = self.shared(features)
return {
'p_click': self.click_head(shared_repr),
'p_like': self.like_head(shared_repr),
'p_share': self.share_head(shared_repr),
'p_hide': self.hide_head(shared_repr),
'watch_time': self.watch_head(shared_repr)
}
def compute_final_score(self, predictions, weights):
"""Combine task predictions into final ranking score."""
score = (
weights['click'] * predictions['p_click'] +
weights['like'] * predictions['p_like'] +
weights['share'] * predictions['p_share'] -
weights['hide'] * predictions['p_hide'] +
weights['watch'] * tf.math.log1p(predictions['watch_time'])
)
return score
# Scoring at serving time
weights = {'click': 1.0, 'like': 2.0, 'share': 3.0, 'hide': 5.0, 'watch': 0.1}
predictions = ranking_model(candidate_features)
scores = ranking_model.compute_final_score(predictions, weights)
ranked_candidates = tf.argsort(scores, direction='DESCENDING')
Monitoring and Observability
Recommendation systems fail silently. Unlike a crashed service that triggers immediate alerts, a degraded model continues serving predictions—just bad ones. Users see irrelevant content, engagement drops gradually, and by the time anyone notices, millions of poor experiences have already shipped.
Why recommendation monitoring is uniquely hard:
- No ground truth at serving time: You predict P(click) = 0.3, but you won’t know if you were right until the user acts (or doesn’t). Feedback arrives minutes to hours later.
- Counterfactual blindness: You only observe outcomes for items you showed. If the model stops recommending a category, you’ll never see engagement on those items—so you can’t tell if the model is wrong.
- Slow degradation: A feature pipeline breaking doesn’t crash the system; it just makes predictions worse. A model drifting out of calibration costs engagement but doesn’t page anyone.
- Interaction effects: Problems emerge from combinations. A new item embedding + a changed feature distribution + a shifted user population might each look fine in isolation.
What to monitor:
| Layer | What Can Go Wrong | Observable Signal |
|---|---|---|
| Infrastructure | Latency spikes, partial outages, memory pressure | P99 latency, error rates, resource utilization |
| Feature pipeline | Stale features, missing values, distribution shift | Feature freshness, null rates, PSI drift scores |
| Model | Calibration drift, prediction collapse, segment degradation | Predicted vs actual, score distributions, per-cohort metrics |
| Business | User dissatisfaction, creator churn, revenue drop | Engagement rates, retention curves, revenue per session |
The goal is catching problems at the earliest layer—a feature going stale should alert before it causes calibration drift, which should alert before engagement drops.
Who monitors what:
| Role | Primary Focus | Typical Response |
|---|---|---|
| ML Platform on-call | Infrastructure health, serving latency, feature pipeline freshness | Rollback, failover, pipeline restart |
| Ranking team | Model calibration, prediction distributions, A/B test guardrails | Model rollback, emergency retraining, feature investigation |
| Product/Growth | Business metrics, funnel conversion, segment-level engagement | Escalate to ranking, pause experiments, adjust objectives |
| Trust & Safety | Policy violations, integrity signals, harmful content amplification | Content takedown, model constraints, ranking suppression |
| Data Science | Long-term trends, cohort analysis, causal attribution | Root cause analysis, strategic recommendations |
In practice, these boundaries blur. A CTR drop might start with Growth noticing a metric, escalate to Ranking for model investigation, trace back to Platform finding a stale feature, and ultimately require Data Science to understand the causal chain.
Model Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Prediction distribution | Histogram of model scores | Shift from baseline |
| Calibration | Predicted vs actual engagement rate | > 5% deviation |
| Feature coverage | Fraction of requests with all features present | < 99% |
| Inference latency | P50/P95/P99 model inference time | > SLA |
Business Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| CTR | Click-through rate | > 10% drop |
| Engagement rate | Likes + comments + shares per impression | > 10% drop |
| Session duration | Average time spent per session | > 10% drop |
| Diversity | Unique creators/topics per user per day | < baseline |
Dashboards and Alerts
flowchart LR
Services --> Metrics[Metrics Collector]
Metrics --> TSDB[(Time-series DB)]
TSDB --> Grafana[Dashboards]
TSDB --> Alerts[Alert Manager]
Alerts --> Oncall[On-call Engineer]