Part 1 of 6 | Part 2: Ranking & Re-ranking →
Introduction
Social media platforms generate value by connecting users with content they find engaging. The recommendation system—the algorithmic machinery that selects which posts, videos, or accounts to surface—is the engine of this engagement. A well-designed recommender balances multiple objectives: user satisfaction, content diversity, creator economics, and platform health. This article examines the architecture of production-grade recommendation systems, drawing on published designs from major platforms while maintaining a principled, systems-oriented perspective.
The core problem is deceptively simple: given a user $u$ and a corpus of candidate items $\mathcal{I}$, produce a ranked list $\hat{\mathcal{I}} \subset \mathcal{I}$ that maximizes some utility function $U(u, \hat{\mathcal{I}})$. In practice, the item corpus may exceed $10^9$ items, latency budgets are measured in tens of milliseconds, and the utility function encodes competing business objectives that shift over time.
Formal Problem Definition
Let $\mathcal{U}$ denote the user space and $\mathcal{I}$ the item space. A recommendation system learns a scoring function $f: \mathcal{U} \times \mathcal{I} \rightarrow \mathbb{R}$ such that for user $u$ and items $i, j \in \mathcal{I}$:
$$ f(u, i) > f(u, j) \iff u \text{ prefers } i \text{ over } j $$The system produces a ranking $\pi_u: [k] \rightarrow \mathcal{I}$ that orders the top-$k$ items by predicted preference. The objective is to maximize expected utility:
$$ \max_f \mathbb{E}_{u \sim P(\mathcal{U})} \left[ U(u, \pi_u^f) \right] $$where $\pi_u^f$ is the ranking induced by $f$ and $U$ is a utility function encoding engagement, satisfaction, or business value.
Utility Decomposition
In practice, utility decomposes into multiple objectives. Let $y^{(t)}\_{u,i} \in \{0,1\}$ indicate whether user $u$ performed action $t$ (click, like, share, etc.) on item $i$. The platform optimizes a weighted combination:
$$ U(u, \pi) = \sum\_{t \in \mathcal{T}} w\_t \sum\_{k=1}^{K} \gamma\_k \cdot \mathbb{E}[y^{(t)}\_{u, \pi(k)}] $$where $w\_t$ are task weights (encoding relative value of clicks vs. shares), $\gamma\_k$ is a position discount (users are less likely to see lower-ranked items), and the expectation is over user behavior given the recommendation.
The position discount often follows a Discounted Cumulative Gain (DCG) formulation:
$$ \gamma\_k = \frac{1}{\log\_2(k + 1)} $$or an exponential decay $\gamma\_k = \alpha^{k-1}$ for some $\alpha \in (0,1)$.
flowchart LR
User[User Request] --> CG[Candidate Generation]
CG --> |1000s of candidates| Ranking[Ranking Model]
Ranking --> |100s scored| Rerank[Re-ranking & Blending]
Rerank --> |10s displayed| Feed[User Feed]
subgraph Offline
Logs[(Interaction Logs)] --> Training[Model Training]
Training --> Models[(Model Registry)]
end
Models --> CG
Models --> Ranking
System Architecture Overview
Production recommendation systems decompose into distinct stages, each trading off recall against precision and latency:
| Stage | Input Size | Output Size | Latency Budget | Primary Goal |
|---|---|---|---|---|
| Candidate Generation | $10^9$ items | $10^3$–$10^4$ | 10–50 ms | High recall; retrieve plausible items |
| Ranking | $10^3$–$10^4$ | $10^2$–$10^3$ | 20–100 ms | Precise scoring; predict engagement |
| Re-ranking | $10^2$–$10^3$ | $10^1$–$10^2$ | 5–20 ms | Diversity, business rules, policy |
| Blending | Multiple sources | Final feed | 5–10 ms | Mix organic, ads, notifications |
Latency budget caveats: These targets assume steady-state operation. In practice, ANN index maintenance creates periodic latency pressure:
| Maintenance Operation | Impact | Mitigation |
|---|---|---|
| Incremental insertion | +5–15% P99 latency during bursts | Rate-limit insertions; batch during off-peak |
| Index shard rebuild | Temporary capacity loss (one shard offline) | Blue-green deployment; over-provision replicas |
| Codebook retraining | CPU contention on index servers | Schedule during low-traffic windows |
| Embedding refresh propagation | Stale results until propagation completes | Staleness budgets; freshness monitoring |
Under-provisioning that ignores maintenance overhead leads to SLA violations during index updates. Capacity planning should account for 15–20% headroom beyond steady-state requirements.
This funnel architecture enables each stage to apply increasingly expensive computation to a shrinking candidate set. The design mirrors information retrieval pipelines: cheap filters narrow the search space before expensive models score the survivors.
Candidate Generation
Candidate generation retrieves a manageable subset of items that might interest the user. Common approaches include:
- Collaborative filtering: Find users similar to $u$ (via matrix factorization, nearest-neighbor search) and recommend items they engaged with. Item-based variants identify items similar to those $u$ has liked.
- Content-based retrieval: Embed items and user preferences into a shared vector space; retrieve items whose embeddings are close to the user’s interest vector.
- Graph-based retrieval: Traverse the social graph (friends, followers, group memberships) to surface content from connected nodes.
- Popularity and recency: Baseline retrievers that surface trending or fresh content regardless of personalization.
Most systems blend multiple retrievers, each producing a candidate stream that is merged downstream.
flowchart TB
subgraph Retrievers
CF[Collaborative Filtering]
CB[Content-Based]
Graph[Graph Traversal]
Trending[Trending / Recency]
end
CF --> Merge[Candidate Merge]
CB --> Merge
Graph --> Merge
Trending --> Merge
Merge --> Dedup[Deduplication]
Dedup --> RankingStage[To Ranking]
Embedding-Based Retrieval
Modern systems encode users and items as dense vectors (embeddings) and cast candidate generation as approximate nearest neighbor (ANN) search. Given user embedding $\mathbf{u} \in \mathbb{R}^d$ and item embeddings $\{\mathbf{v}_i\}$, retrieval finds:
$$ \text{top-}k = \arg\max_{i \in \mathcal{I}} \, \text{sim}(\mathbf{u}, \mathbf{v}_i) $$where $\text{sim}$ is typically cosine similarity or inner product. ANN indices (HNSW, IVF, ScaNN) achieve sub-linear query time, enabling retrieval over billion-scale corpora in milliseconds.
Two-tower models are a prevalent architecture: one neural network encodes the user (from features like history, demographics, context), another encodes the item (from content, metadata, engagement statistics). The towers are trained jointly to maximize similarity for positive (user, item) pairs.
flowchart LR
subgraph User Tower
UF[User Features] --> UE[User Encoder]
UE --> UV[User Vector]
end
subgraph Item Tower
IF[Item Features] --> IE[Item Encoder]
IE --> IV[Item Vector]
end
UV --> Sim[Similarity]
IV --> Sim
Sim --> Score[Retrieval Score]
Matrix Factorization Foundations
The theoretical underpinning of embedding-based retrieval is matrix factorization. Let $\mathbf{R} \in \mathbb{R}^{m \times n}$ be the (sparse) user-item interaction matrix where $R_{ui} = 1$ if user $u$ engaged with item $i$, with $m = \lvert\mathcal{U}\rvert$ users and $n = \lvert\mathcal{I}\rvert$ items. Matrix factorization seeks low-rank factors $\mathbf{P} \in \mathbb{R}^{m \times d}$ and $\mathbf{Q} \in \mathbb{R}^{n \times d}$ such that:
$$ \mathbf{R} \approx \mathbf{P} \mathbf{Q}^\top $$The row $\mathbf{p}\_u$ is the user embedding; $\mathbf{q}\_i$ is the item embedding. The predicted affinity is $\hat{r}\_{ui} = \mathbf{p}\_u^\top \mathbf{q}\_i$.
Weighted Alternating Least Squares (WALS)
For implicit feedback (binary interactions), the Weighted Regularized Matrix Factorization objective is:
$$ \min\_{\mathbf{P}, \mathbf{Q}} \sum\_{(u,i) \in \Omega} c\_{ui} (r\_{ui} - \mathbf{p}\_u^\top \mathbf{q}\_i)^2 + \lambda \left( \|\mathbf{P}\|\_F^2 + \|\mathbf{Q}\|\_F^2 \right) $$where $\Omega$ is the set of observed interactions, $c\_{ui}$ is a confidence weight (higher for explicit engagement, lower for implicit exposure), and $\lambda$ is the regularization coefficient. The alternating least squares (ALS) algorithm fixes $\mathbf{Q}$ and solves for $\mathbf{P}$ in closed form:
$$ \mathbf{p}\_u = \left( \mathbf{Q}^\top \mathbf{C}\_u \mathbf{Q} + \lambda \mathbf{I} \right)^{-1} \mathbf{Q}^\top \mathbf{C}\_u \mathbf{r}\_u $$where $\mathbf{C}\_u = \text{diag}(c\_{u1}, \ldots, c\_{un})$ is the diagonal confidence matrix for user $u$. The procedure alternates between updating $\mathbf{P}$ and $\mathbf{Q}$ until convergence.
Two-Tower Training Objectives
Neural two-tower models extend matrix factorization with non-linear encoders. Let $\phi\_\theta: \mathcal{X}\_u \rightarrow \mathbb{R}^d$ be the user encoder and $\psi\_\phi: \mathcal{X}\_i \rightarrow \mathbb{R}^d$ be the item encoder, parameterized by neural networks. The similarity score is:
$$ s(u, i) = \frac{\phi_\theta(x_u)^\top \psi_\phi(x_i)}{\tau} $$where $\tau > 0$ is a temperature hyperparameter controlling the sharpness of the distribution.
Contrastive Loss (InfoNCE)
Given a batch of $N$ positive pairs $\{(u\_j, i\_j^+)\}\_{j=1}^N$, the InfoNCE loss treats other items in the batch as negatives:
$$ \mathcal{L}\_{\text{InfoNCE}} = -\frac{1}{N} \sum\_{j=1}^{N} \log \frac{\exp(s(u\_j, i\_j^+))}{\sum\_{k=1}^{N} \exp(s(u\_j, i\_k^+))} $$This is equivalent to an $(N-1)$-way classification problem where the model must identify the positive item among in-batch negatives. The gradient with respect to the item encoder pushes positive pairs closer and negative pairs apart in the embedding space.
Sampled Softmax
For large item vocabularies, full softmax is intractable. Sampled softmax approximates the partition function by sampling $K \ll n$ negative items (where $n$ is the total number of items):
$$ \mathcal{L}\_{\text{sampled}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k=1}^{K} \exp(s(u, i\_k^-))} $$where negatives $\{i\_k^-\}$ are drawn from a proposal distribution $Q(i)$ (often item popularity). Importance weighting corrects for sampling bias:
$$ \mathcal{L}\_{\text{corrected}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k=1}^{K} \frac{\exp(s(u, i\_k^-))}{K \cdot Q(i\_k^-)}} $$Negative Sampling Strategies
With billions of items, exhaustive negative sampling is impossible. Training requires strategic selection of negatives that are informative but computationally feasible.
Why negative sampling matters:
- Computational efficiency: Full softmax over billions of items is intractable
- Learning signal: Model quality depends on negative quality—too easy or too hard negatives degrade training
- Data imbalance: Positive examples are rare (clicks on 0.01% of impressions); naive sampling creates severe class imbalance
Negative sampling strategies:
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Random | Sample uniformly from item corpus | Simple, unbiased | Too easy; model doesn’t learn |
| Popularity-biased | Sample proportional to item popularity $P(i) \propto \text{count}(i)^\alpha$ | Matches serving distribution | Still relatively easy |
| In-batch | Treat other positives in batch as negatives | Free (no extra sampling) | Limited diversity (batch size ~1000s) |
| Impression-based | Sample from items shown but not clicked | Reflects user decision boundary | Requires logging all impressions |
| Hard negatives | Mine items similar to positives but not engaged | Maximally informative | Expensive to compute; can be too hard |
Popularity-biased sampling:
Uniform sampling over-represents rare items. Popularity-biased sampling corrects this:
$$ P(i^-) \propto (\text{popularity}(i))^\alpha $$where $\alpha \in [0, 1]$ controls the bias strength. Common values: $\alpha = 0.75$.
In-batch negatives:
Efficient trick for contrastive learning. Given a batch of $(u_j, i_j^+)$ pairs, treat $i_k^+$ ($k \neq j$) as negatives for user $u_j$:
$$ \mathcal{L}\_{\text{in-batch}} = -\frac{1}{N} \sum\_{j=1}^{N} \log \frac{\exp(s(u\_j, i\_j^+))}{\sum\_{k=1}^{N} \exp(s(u\_j, i\_k^+))} $$Advantages:
- Zero sampling overhead
- Naturally diverse (batch contains varied positives)
- Gradient computation is parallelizable
Limitations:
- Negatives are actually other users’ positives (not true negatives)
- Batch size limits negative count (typically 256-2048)
Cached cross-batch negatives:
Maintain a memory bank of recent embeddings. For each positive $(u, i^+)$, sample negatives from the memory bank:
memory_bank = queue(capacity=100K) # FIFO queue
for batch in dataloader:
negatives = sample(memory_bank, k=100)
loss = contrastive_loss(batch, negatives)
memory_bank.enqueue(batch.embeddings)
Increases effective negative count from batch size (~1K) to memory bank size (~100K) with minimal overhead.
Hard Negative Mining
Random negatives are often too easy; the model achieves low loss without learning fine-grained distinctions. Hard negative mining selects negatives that are challenging:
1. In-batch hard negatives:
Within each batch, select the negative with highest score for each positive:
$$ i\_{\text{hard}}^- = \arg\max\_{k \neq j} s(u\_j, i\_k^+) $$Trade-off: Focusing only on hardest negatives can cause training instability (model gets stuck on outliers).
2. Semi-hard negatives:
Select negatives that are harder than the positive, but not the absolute hardest:
$$ i\_{\text{semi-hard}}^- = \arg\max\_{k : s(u, i\_k^-) > s(u, i^+) - m} s(u, i\_k^-) $$where $m$ is a margin. Semi-hard negatives are informative but not adversarial.
3. ANN-based mining:
Periodically mine hard negatives offline:
- Encode all items with current model
- For each positive $(u, i^+)$, query ANN index for nearest neighbors to $i^+$
- Filter out items user has engaged with
- Store hard negatives for training
Frequency: Every epoch or every few gradient steps. Mining too frequently wastes computation; too infrequently uses stale hard negatives.
4. Dynamic hard negative mining:
During training, maintain an ANN index of item embeddings. At each batch:
- Compute batch embeddings
- Query ANN for hard negatives on-the-fly
- Compute loss with mined negatives
Cost: Adds 5-10ms per batch (ANN query overhead). Only feasible with fast ANN implementations (HNSW, ScaNN).
5. Curriculum learning:
Gradually increase negative difficulty:
- Epoch 1-10: Random negatives
- Epoch 11-20: Popularity-biased negatives
- Epoch 21+: Hard negatives
Prevents model from getting stuck in bad local minima early in training.
Optimal negative count:
More negatives improve model quality but increase compute cost. Empirical findings:
| Negative Count | Relative Accuracy | Training Time |
|---|---|---|
| 10 | Baseline | 1x |
| 100 | +5% | 3x |
| 1000 | +8% | 10x |
| 10000 | +9% | 40x |
Diminishing returns beyond 1000 negatives. Production systems typically use 100-1000 negatives per positive.
Loss formulation with hard negatives:
$$ \mathcal{L}\_{\text{hard}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k \in \text{hard}(u)} \exp(s(u, i\_k^-))} $$Debiasing hard negatives:
Hard negatives are not sampled uniformly, which biases gradients. Correct via importance weighting:
$$ \mathcal{L}\_{\text{corrected}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k} \frac{\exp(s(u, i\_k^-))}{P\_{\text{sample}}(i\_k^-)}} $$where $P\_{\text{sample}}(i\_k^-)$ is the probability of sampling negative $i\_k^-$.
Approximate Nearest Neighbor Search
Exact $k$-NN search requires $O(n \cdot d)$ time per query—infeasible for billion-scale corpora. ANN algorithms trade exactness for speed.
Locality-Sensitive Hashing (LSH)
LSH hashes similar vectors to the same bucket with high probability. For cosine similarity, random hyperplane hashing uses:
$$ h_{\mathbf{r}}(\mathbf{v}) = \text{sign}(\mathbf{r}^\top \mathbf{v}) $$where $\mathbf{r}$ is a random unit vector. The probability of collision is:
$$ P[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta(\mathbf{u}, \mathbf{v})}{\pi} $$where $\theta$ is the angle between vectors. Multiple hash tables with independent hash functions increase recall.
Hierarchical Navigable Small Worlds (HNSW)
HNSW constructs a multi-layer graph where each node connects to its approximate nearest neighbors. Search begins at the top layer (sparse, long-range connections) and descends to lower layers (dense, local connections). Query complexity is $O(\log n)$ with high probability for well-constructed graphs.
Inverted File Index (IVF)
IVF partitions the embedding space into $C$ Voronoi cells via $k$-means clustering. At query time, only items in the nearest $n_{\text{probe}}$ cells are searched:
$$ \text{Candidates} = \bigcup_{c \in \text{top-}n_{\text{probe}}(\mathbf{q})} \text{Cell}_c $$Product quantization (PQ) further compresses vectors within cells, enabling memory-efficient search over billions of items.
| Algorithm | Index Build | Query Time | Memory | Recall |
|---|---|---|---|---|
| Exact k-NN | $O(1)$ | $O(n \cdot d)$ | $O(n \cdot d)$ | 100% |
| LSH | $O(n \cdot L)$ | $O(L \cdot B)$ | $O(n \cdot L)$ | 80–95% |
| HNSW | $O(n \log n)$ | $O(\log n)$ | $O(n \cdot M)$ | 95–99% |
| IVF-PQ | $O(n \cdot k)$ | $O(n\_{\text{probe}} \cdot n/C)$ | $O(n \cdot m)$ | 90–98% |
Index Maintenance and Freshness
Production ANN indexes must balance query performance against freshness—new items and updated embeddings must become discoverable within acceptable latency bounds. Several strategies address this operational challenge:
Incremental Updates: Some index structures (notably HNSW) support online insertion of new vectors without full rebuilds. New items are inserted into the graph by finding their approximate nearest neighbors and establishing bidirectional edges. However, incremental insertion degrades index quality over time as the graph structure drifts from optimal.
Tiered Indexing: A common pattern separates the corpus into tiers by age:
| Tier | Content Age | Index Type | Update Cadence |
|---|---|---|---|
| Hot | < 1 hour | Brute-force or streaming index | Real-time |
| Warm | 1–24 hours | Incrementally updated HNSW | Minutes |
| Cold | > 24 hours | Batch-rebuilt IVF-PQ | Daily |
Query routing merges results across tiers, applying freshness boosts to hot-tier candidates. This architecture ensures new content is immediately retrievable while maintaining efficient search over the long tail.
Staleness Budgets: Teams define acceptable staleness windows per use case. A “For You” feed might tolerate 15-minute embedding lag, while a search typeahead requires sub-second freshness. Monitoring dashboards track:
- Embedding lag: Time between item creation and index availability
- Index coverage: Fraction of eligible items present in the index
- Recall degradation: Periodic evaluation against ground-truth rankings
Rebuild and Re-sharding: Full index rebuilds occur on a scheduled cadence (daily or weekly) to restore optimal graph structure. As the corpus grows, re-sharding distributes the index across more machines. Strategies include:
- Blue-green deployment: Build new index offline; swap atomically when ready
- Rolling updates: Replace shards incrementally to avoid query disruption
- Incremental PQ codebook updates: Retrain product quantization codebooks on recent embedding distributions
Real-Time Embedding Updates: When user or item embeddings change (e.g., after new interactions), the system must propagate updates:
- Lazy refresh: Re-embed on next query; cache result
- Background refresh: Async job re-computes embeddings and patches the index
- Streaming updates: Kafka-based pipeline pushes embedding deltas to index shards
The choice depends on update frequency and latency requirements. User embeddings (changing with every interaction) often use lazy refresh, while item embeddings (more stable) use batch updates.
Graph-Based Retrieval
Social media platforms naturally form graphs: users follow users, users engage with items, items share topics. Graph-based methods leverage this structure.
Graph Neural Networks (GNNs)
Let $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ be a graph with $N$ nodes and node features $\mathbf{X} \in \mathbb{R}^{N \times d}$. GNNs learn node representations via message passing:
$$ \mathbf{h}\_v^{(l+1)} = \text{UPDATE}\left( \mathbf{h}\_v^{(l)}, \text{AGGREGATE}\left( \{ \mathbf{h}\_u^{(l)} : u \in \mathcal{N}(v) \} \right) \right) $$Graph Convolutional Networks (GCN) use spectral convolution:
$$ \mathbf{H}^{(l+1)} = \sigma\left( \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right) $$where $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ is the adjacency matrix with self-loops, $\tilde{\mathbf{D}}$ is the degree matrix, and $\mathbf{W}^{(l)}$ is a learnable weight matrix.
GraphSAGE samples and aggregates neighborhoods:
$$ \mathbf{h}\_v^{(l+1)} = \sigma\left( \mathbf{W}^{(l)} \cdot \text{CONCAT}\left( \mathbf{h}\_v^{(l)}, \text{AGG}(\{ \mathbf{h}\_u^{(l)} : u \in \mathcal{N}\_{\text{sample}}(v) \}) \right) \right) $$Aggregators include mean, LSTM, and pooling. Sampling bounds computation cost.
Bipartite Graph Models
User-item interactions form bipartite graphs. LightGCN simplifies GCN for collaborative filtering:
$$ \mathbf{e}\_u^{(l+1)} = \sum\_{i \in \mathcal{N}\_u} \frac{1}{\sqrt{d\_u} \sqrt{d\_i}} \mathbf{e}\_i^{(l)} $$$$ \mathbf{e}\_i^{(l+1)} = \sum\_{u \in \mathcal{N}\_i} \frac{1}{\sqrt{d\_i} \sqrt{d\_u}} \mathbf{e}\_u^{(l)} $$
where $d\_u = \lvert\mathcal{N}\_u\rvert$ and $d\_i = \lvert\mathcal{N}\_i\rvert$ are the degrees of user $u$ and item $i$.
Final embeddings combine all layers: $\mathbf{e}\_u = \sum\_{l=0}^{L} \alpha\_l \mathbf{e}\_u^{(l)}$.
Random Walk Methods
DeepWalk and Node2Vec learn embeddings via skip-gram on random walks:
$$ \max \sum\_{v \in \mathcal{V}} \sum\_{c \in \mathcal{N}\_R(v)} \log P(c | \phi(v)) $$where $\mathcal{N}\_R(v)$ is the context window from random walks starting at $v$, and $P(c | \phi(v)) = \frac{\exp(\phi(c)^\top \phi(v))}{\sum\_{u} \exp(\phi(u)^\top \phi(v))}$.
Node2Vec biases walks with parameters $p$ (return) and $q$ (in-out) to balance BFS (local) and DFS (global) exploration:
$$ P(c\_i = x | c\_{i-1} = v) = \begin{cases} \frac{1}{p} & \text{if } d\_{tx} = 0 \text{ (return to } t \text{)} \\ 1 & \text{if } d\_{tx} = 1 \\ \frac{1}{q} & \text{if } d\_{tx} = 2 \end{cases} $$where $t$ is the previous node and $d\_{tx}$ is the shortest path distance.
Personalized PageRank (PPR)
PPR computes stationary distribution of random walks with restarts:
$$ \mathbf{p}\_u = \alpha \mathbf{e}\_u + (1 - \alpha) \mathbf{A}^\top \mathbf{D}^{-1} \mathbf{p}\_u $$where $\alpha$ is the restart probability and $\mathbf{e}\_u$ is the indicator vector for user $u$. Items with high $\mathbf{p}\_u[i]$ are candidates. Efficient approximations (push-based, Monte Carlo) enable real-time computation.
Content Understanding Pipeline
Before items can be retrieved, ranked, or embedded, their raw content must be processed into machine-readable features. Social media platforms ingest diverse content—text posts, images, videos, links—each requiring specialized processing pipelines.
Text Processing
Text content flows through multiple stages:
-
Extraction and cleaning:
- Strip HTML/markdown formatting
- Normalize Unicode (handle emojis, accented characters, RTL scripts)
- Remove or flag excessive punctuation, all-caps text (spam signals)
-
Language detection:
- Classify language (fastText, CLD3) for routing to language-specific pipelines
- Handle code-switching (multilingual posts)
-
Tokenization and normalization:
- Subword tokenization (BPE, WordPiece, SentencePiece) for transformer models
- Hashtag splitting (#MachineLearning → “machine”, “learning”)
- URL extraction and canonicalization
-
Entity extraction:
- Named Entity Recognition (NER) for people, places, organizations
- Mention/tag parsing (@username)
- Topic extraction via keyword models or zero-shot classifiers
-
Embedding generation:
- Pre-trained encoders: BERT, RoBERTa, sentence-transformers
- Domain-specific fine-tuning on social media text
- Multilingual models (XLM-R, mBERT) for cross-lingual retrieval
flowchart LR
Raw[Raw Text] --> Clean[Clean & Normalize]
Clean --> LangDetect[Language Detection]
LangDetect --> Tokenize[Tokenization]
Tokenize --> NER[Entity Extraction]
Tokenize --> Encoder[Text Encoder]
Encoder --> TextEmbed[Text Embedding]
NER --> Features[Feature Store]
TextEmbed --> Features
Image Processing
Images dominate social media consumption—Instagram, Pinterest, and TikTok feeds are fundamentally visual experiences, while even text-centric platforms like Twitter/X see 2-3× higher engagement on posts with images. For recommendation systems, images provide rich semantic signals beyond what text captions convey: the visual composition, objects, scenes, people, emotions, and aesthetic quality all influence user preferences. A user who engages with sunset beach photos has different interests than one who engages with architectural photography, even if both caption their posts with generic hashtags like “#beautiful”.
Processing images for recommendation systems serves multiple purposes: content understanding (what’s in the image?), similarity matching (find visually similar content), safety filtering (detect policy violations), and quality scoring (surface aesthetically pleasing content). Modern systems extract dense visual embeddings that capture semantic meaning, enabling both exact matching (duplicate detection) and approximate matching (similar content retrieval). These embeddings feed into the same retrieval and ranking pipelines as text features, allowing multimodal fusion where visual, textual, and behavioral signals jointly determine relevance.
Images require visual understanding for content-based retrieval and safety:
- Preprocessing:
- Resize, crop, normalize pixel values
- Augmentation during training (rotation, color jitter, crops)
- Aspect ratio handling
- Feature extraction:
- Pre-trained CNNs: ResNet, EfficientNet
- Vision transformers: ViT, Swin
- CLIP for joint vision-language embeddings
- Scene understanding:
- Object detection (YOLO, Faster R-CNN) for salient objects
- Scene classification (indoor/outdoor, activity type)
- Aesthetic quality scoring
- OCR and text-in-image:
- Extract text overlays (memes, infographics)
- Feed to text pipeline for additional signals
- Safety classifiers:
- NSFW detection
- Violence/gore detection
- Hate symbol recognition
Video Processing
Video has become the dominant content format on social media, accounting for over 80% of internet traffic by 2024. Platforms like TikTok, YouTube Shorts, Instagram Reels, and Snapchat Spotlight are entirely video-first, while traditional feeds increasingly prioritize video over static content. The shift is driven by engagement economics: users spend 5-10× longer watching videos than viewing images, and video posts generate substantially higher interaction rates.
For recommendation systems, video presents unique challenges and opportunities. Unlike static images, video encodes temporal dynamics—actions, transitions, narrative arcs—that are critical to understanding content. A cooking video’s appeal depends on the progression from ingredients to finished dish; a comedy sketch relies on timing and punchlines. Audio is equally essential: music tracks drive viral trends (TikTok’s sound-based discovery), speech content conveys meaning beyond visuals, and production quality (background noise, audio clarity) affects perceived professionalism.
The computational challenge is formidable: a 60-second video at 30fps contains 1,800 frames, each requiring image-level processing, plus audio tracks, plus temporal modeling. Processing costs are 10-100× higher than images, creating latency and infrastructure bottlenecks. Production systems must balance comprehensiveness (capture all relevant signals) against efficiency (process billions of videos within reasonable cost). This leads to tiered processing strategies: fast, cheap models run on all content for safety and basic categorization, while expensive deep models run selectively or offline for high-value content.
Video adds temporal and audio dimensions:
-
Frame sampling:
- Uniform sampling: Extract frames at fixed intervals (e.g., 1 fps)
- Shot boundary detection: Sample at scene transitions
- Keyframe extraction: Select representative frames per shot
-
Visual encoding:
- Per-frame embeddings (CNN or ViT)
- Temporal aggregation: average pooling, LSTM, transformer over frame sequence
- Action recognition models (I3D, SlowFast) for activity classification
-
Audio processing:
- Audio extraction and resampling
- Speech-to-text (Whisper, wav2vec) for transcription
- Music/sound classification
- Audio embeddings (VGGish, CLAP) for audio-visual fusion
-
Multimodal fusion:
- Concatenate visual, audio, text embeddings
- Cross-modal attention (video frames attend to audio, text)
- Late fusion: train modality-specific models, combine scores
flowchart TB
Video[Video Upload] --> FrameSample[Frame Sampling]
Video --> AudioExtract[Audio Extraction]
FrameSample --> VisualEnc[Visual Encoder]
FrameSample --> OCR[OCR / Text-in-Video]
AudioExtract --> STT[Speech-to-Text]
AudioExtract --> AudioEnc[Audio Encoder]
VisualEnc --> Temporal[Temporal Aggregation]
Temporal --> VisualEmbed[Visual Embedding]
AudioEnc --> AudioEmbed[Audio Embedding]
STT --> TextPipe[Text Pipeline]
OCR --> TextPipe
TextPipe --> TextEmbed[Text Embedding]
VisualEmbed --> Fusion[Multimodal Fusion]
AudioEmbed --> Fusion
TextEmbed --> Fusion
Fusion --> FinalEmbed[Content Embedding]
Scalability and Cost Trade-offs
Processing billions of items requires careful resource allocation:
| Content Type | Processing Cost | Latency | Strategy |
|---|---|---|---|
| Text | Low | <10ms | Synchronous on upload |
| Image | Medium | 50-200ms | Async pipeline; cache embeddings |
| Short video (<1 min) | High | 1-5s | Async pipeline; batch processing |
| Long video (>5 min) | Very high | 10-60s | Offline batch; sample frames aggressively |
Optimization techniques:
- Lazy processing: Compute embeddings only for items that pass initial filters (spam, policy violations)
- Tiered processing: Fast safety classifiers run immediately; expensive embeddings computed async
- Model distillation: Deploy small, fast models for serving; large models for offline batch processing
- Caching: Embeddings rarely change; cache in K-V store (Redis, Memcached) keyed by content hash
- Incremental updates: When item metadata changes (edit, new comments), avoid reprocessing entire video
Feature Engineering
Feature engineering is the process of transforming raw data—user actions, content attributes, system logs—into numerical representations that machine learning models can consume. In social media recommendation systems, this transformation is critical: the same model architecture can produce radically different results depending on which features it sees. A ranking model that only observes “user clicked this post” performs far worse than one that knows “user watched 80% of a 2-minute cooking video at 2pm on mobile, having previously engaged with 15 other cooking videos this week, and the creator has 50k followers with 5% engagement rate.”
Features are the bridge between the messy reality of user behavior and the mathematical abstractions of prediction models. They must capture subtle signals—a user’s preference shift from travel content to home renovation after a life event, a creator’s quality improving over time, the contextual relevance of sports content during a championship game—while remaining computationally efficient to compute and serve at scale.
Why Feature Engineering Dominates in Social Media
Social media recommendations differ from other ML domains in several ways that make feature engineering especially critical:
Sparse behavioral signals: Most users interact with <0.01% of available content. A typical Facebook user might engage with 10-20 posts per day out of billions created. The model must infer preferences from these sparse signals, relying heavily on features that generalize: if a user likes cooking videos from creator A, which features predict they’ll like creator B? Shared topics, similar production style, overlapping audience demographics, or common hashtags all become crucial features.
Heterogeneous content types: Unlike Netflix (all videos) or Spotify (all music), social media mixes text posts, images, short videos, long videos, links, polls, and stories. Feature engineering must create comparable representations across modalities. A user who engages with fitness images should see fitness videos; the model needs features that capture “fitness interest” independent of format.
Temporal dynamics: User interests shift rapidly. Someone searching for “birthday cake recipes” today has different needs than their historical profile suggests. Features must balance long-term preferences (stored in user embeddings) with short-term intent (recent search queries, current session behavior). Real-time features—what the user did in the last 5 minutes—often outweigh historical features for predicting the next click.
Social context: Content spreads through social graphs. A post’s value depends not just on its content, but on who shared it, who else engaged with it, and whether it’s trending in your network. Features must encode network effects: “3 of your close friends liked this post” is a powerful signal. Graph-based features (mutual friends with creator, overlap in followed accounts) capture social proof.
Creator incentives: Social platforms balance user experience with creator economics. Features must encode not just “will the user like this?” but “should we show this content given our broader objectives?” Features like creator earnings, post monetization status, and historical policy violations inform decisions about surfacing content.
The Feature Engineering Pipeline
Raw data flows through several stages before becoming model inputs:
- Collection: Log user actions (clicks, likes, watch time), content attributes (text, images, metadata), and context (device, time, location)
- Transformation: Convert raw values into model-friendly representations (embeddings, normalizations, discretizations)
- Aggregation: Combine atomic events into statistical summaries (user’s average watch time, creator’s engagement rate over 7 days)
- Storage: Materialize features in low-latency stores for online serving and data lakes for offline training
- Serving: Retrieve features in <10ms during inference while ensuring consistency with training data
The challenge is maintaining this pipeline at social media scale: billions of users, trillions of feature values, and real-time updates as user behavior evolves. Feature stores have emerged as critical infrastructure, providing versioned, point-in-time correct features that prevent train-serve skew.
Features are the raw material of recommendation models. They encode signals about users, items, and the context of each request.
User Features
| Feature Category | Examples | Representation |
|---|---|---|
| Demographics | Age, location, language | Categorical embeddings |
| Engagement history | Liked posts, watch time, comments | Sequence embeddings (Transformer, LSTM) |
| Social graph | Follower count, friend overlap | Graph embeddings (GraphSAGE, GCN) |
| Session context | Time of day, device, referrer | Dense features |
| Long-term preferences | Topic affinities, creator subscriptions | Aggregated embeddings |
Item Features
| Feature Category | Examples | Representation |
|---|---|---|
| Content | Text, images, video frames | Pre-trained encoders (BERT, CLIP, ViT) |
| Metadata | Author, timestamp, hashtags | Categorical embeddings |
| Engagement statistics | Likes, shares, CTR | Normalized counts, log-transforms |
| Quality signals | Spam score, policy violations | Binary or continuous |
| Freshness | Age since publication | Decay functions |
Context Features
| Feature Category | Examples | Representation |
|---|---|---|
| Request context | Device type, network speed | Categorical |
| Temporal | Hour, day of week, holidays | Cyclical encoding (sin/cos) |
| Session state | Items already shown, scroll depth | Set or sequence embeddings |
| Real-time signals | Trending topics, breaking news | Dynamic embeddings |
Feature Stores
Production systems materialize features in feature stores that serve both offline training and online inference with consistent semantics. Key requirements:
- Point-in-time correctness: Training must use features as they existed at interaction time to avoid data leakage.
- Low-latency serving: Online inference requires feature retrieval in single-digit milliseconds.
- Backfill and versioning: Schema changes must propagate to historical data without corrupting training sets.
Platforms like Feast, Tecton, and internal systems at Meta/Google/TikTok address these requirements.
flowchart LR
Events[(Event Stream)] --> Compute[Feature Compute]
Compute --> Store[(Feature Store)]
Store --> Training[Offline Training]
Store --> Serving[Online Serving]
subgraph Feature Store
Offline[(Offline Store / Data Lake)]
Online[(Online Store / KV Cache)]
end
Training-Serving Skew Detection
A critical operational concern is training-serving skew—discrepancies between features computed offline (for training) and online (for inference). Skew causes silent model degradation: the model sees different feature distributions at serving time than it learned from.
Sources of Skew:
| Skew Type | Cause | Example |
|---|---|---|
| Schema drift | Feature definition changes between training and serving | Normalization logic updated in serving but not backfilled |
| Temporal leakage | Training uses future information unavailable at serving | Aggregating engagement counts that include the label event |
| Null/default handling | Different imputation strategies | Training drops nulls; serving uses zeros |
| Computation divergence | Separate codepaths for batch vs. streaming | Floating-point precision differences |
Validation Mechanisms:
- Shadow requests: Route a fraction of live traffic to a shadow pipeline that logs both online-computed and offline-reconstructed features. Compare distributions and flag divergence exceeding thresholds.
- Canary feature checks: Before model deployment, run inference on held-out examples using both training-time and serving-time feature values. Alert if prediction distributions differ significantly.
- Schema versioning: Tag each feature with a version; models declare required versions. Serving rejects requests if versions mismatch.
- Point-in-time audits: Periodically reconstruct historical features and compare against logged serving values. Statistical tests (KS test, PSI) detect drift.
Monitoring and Alerts:
| Metric | Definition | Alert Threshold |
|---|---|---|
| Feature null rate | Fraction of requests with missing features | > 1% for critical features |
| Distribution shift (PSI) | Population Stability Index vs. training baseline | > 0.1 |
| Latency percentiles | Feature fetch P50/P99 | > SLA |
| Coverage | Fraction of entities with fresh features | < 99% |
Dashboards surface feature health alongside model metrics, enabling rapid diagnosis when prediction quality degrades. Automated alerts trigger on null spikes or distribution shifts before downstream models are impacted.