Recommendation Systems Part 1: System Architecture & Feature Engineering

Part 1 of 6 | Part 2: Ranking & Re-ranking →

Introduction

Social media platforms generate value by connecting users with content they find engaging. The recommendation system—the algorithmic machinery that selects which posts, videos, or accounts to surface—is the engine of this engagement. A well-designed recommender balances multiple objectives: user satisfaction, content diversity, creator economics, and platform health. This article examines the architecture of production-grade recommendation systems, drawing on published designs from major platforms while maintaining a principled, systems-oriented perspective.

The core problem is deceptively simple: given a user $u$ and a corpus of candidate items $\mathcal{I}$, produce a ranked list $\hat{\mathcal{I}} \subset \mathcal{I}$ that maximizes some utility function $U(u, \hat{\mathcal{I}})$. In practice, the item corpus may exceed $10^9$ items, latency budgets are measured in tens of milliseconds, and the utility function encodes competing business objectives that shift over time.

Formal Problem Definition

Let $\mathcal{U}$ denote the user space and $\mathcal{I}$ the item space. A recommendation system learns a scoring function $f: \mathcal{U} \times \mathcal{I} \rightarrow \mathbb{R}$ such that for user $u$ and items $i, j \in \mathcal{I}$:

$$ f(u, i) > f(u, j) \iff u \text{ prefers } i \text{ over } j $$

The system produces a ranking $\pi_u: [k] \rightarrow \mathcal{I}$ that orders the top-$k$ items by predicted preference. The objective is to maximize expected utility:

$$ \max_f \mathbb{E}_{u \sim P(\mathcal{U})} \left[ U(u, \pi_u^f) \right] $$

where $\pi_u^f$ is the ranking induced by $f$ and $U$ is a utility function encoding engagement, satisfaction, or business value.

Utility Decomposition

In practice, utility decomposes into multiple objectives. Let $y^{(t)}\_{u,i} \in \{0,1\}$ indicate whether user $u$ performed action $t$ (click, like, share, etc.) on item $i$. The platform optimizes a weighted combination:

$$ U(u, \pi) = \sum\_{t \in \mathcal{T}} w\_t \sum\_{k=1}^{K} \gamma\_k \cdot \mathbb{E}[y^{(t)}\_{u, \pi(k)}] $$

where $w\_t$ are task weights (encoding relative value of clicks vs. shares), $\gamma\_k$ is a position discount (users are less likely to see lower-ranked items), and the expectation is over user behavior given the recommendation.

The position discount often follows a Discounted Cumulative Gain (DCG) formulation:

$$ \gamma\_k = \frac{1}{\log\_2(k + 1)} $$

or an exponential decay $\gamma\_k = \alpha^{k-1}$ for some $\alpha \in (0,1)$.

flowchart LR
    User[User Request] --> CG[Candidate Generation]
    CG --> |1000s of candidates| Ranking[Ranking Model]
    Ranking --> |100s scored| Rerank[Re-ranking & Blending]
    Rerank --> |10s displayed| Feed[User Feed]

    subgraph Offline
        Logs[(Interaction Logs)] --> Training[Model Training]
        Training --> Models[(Model Registry)]
    end

    Models --> CG
    Models --> Ranking

System Architecture Overview

Production recommendation systems decompose into distinct stages, each trading off recall against precision and latency:

Stage	Input Size	Output Size	Latency Budget	Primary Goal
Candidate Generation	$10^9$ items	$10^3$–$10^4$	10–50 ms	High recall; retrieve plausible items
Ranking	$10^3$–$10^4$	$10^2$–$10^3$	20–100 ms	Precise scoring; predict engagement
Re-ranking	$10^2$–$10^3$	$10^1$–$10^2$	5–20 ms	Diversity, business rules, policy
Blending	Multiple sources	Final feed	5–10 ms	Mix organic, ads, notifications

Latency budget caveats: These targets assume steady-state operation. In practice, ANN index maintenance creates periodic latency pressure:

Maintenance Operation	Impact	Mitigation
Incremental insertion	+5–15% P99 latency during bursts	Rate-limit insertions; batch during off-peak
Index shard rebuild	Temporary capacity loss (one shard offline)	Blue-green deployment; over-provision replicas
Codebook retraining	CPU contention on index servers	Schedule during low-traffic windows
Embedding refresh propagation	Stale results until propagation completes	Staleness budgets; freshness monitoring

Under-provisioning that ignores maintenance overhead leads to SLA violations during index updates. Capacity planning should account for 15–20% headroom beyond steady-state requirements.

This funnel architecture enables each stage to apply increasingly expensive computation to a shrinking candidate set. The design mirrors information retrieval pipelines: cheap filters narrow the search space before expensive models score the survivors.

Candidate Generation

Candidate generation retrieves a manageable subset of items that might interest the user. Common approaches include:

Collaborative filtering: Find users similar to $u$ (via matrix factorization, nearest-neighbor search) and recommend items they engaged with. Item-based variants identify items similar to those $u$ has liked.
Content-based retrieval: Embed items and user preferences into a shared vector space; retrieve items whose embeddings are close to the user’s interest vector.
Graph-based retrieval: Traverse the social graph (friends, followers, group memberships) to surface content from connected nodes.
Popularity and recency: Baseline retrievers that surface trending or fresh content regardless of personalization.

Most systems blend multiple retrievers, each producing a candidate stream that is merged downstream.

flowchart TB
    subgraph Retrievers
        CF[Collaborative Filtering]
        CB[Content-Based]
        Graph[Graph Traversal]
        Trending[Trending / Recency]
    end

    CF --> Merge[Candidate Merge]
    CB --> Merge
    Graph --> Merge
    Trending --> Merge
    Merge --> Dedup[Deduplication]
    Dedup --> RankingStage[To Ranking]

Embedding-Based Retrieval

Modern systems encode users and items as dense vectors (embeddings) and cast candidate generation as approximate nearest neighbor (ANN) search. Given user embedding $\mathbf{u} \in \mathbb{R}^d$ and item embeddings $\{\mathbf{v}_i\}$, retrieval finds:

$$ \text{top-}k = \arg\max_{i \in \mathcal{I}} \, \text{sim}(\mathbf{u}, \mathbf{v}_i) $$

where $\text{sim}$ is typically cosine similarity or inner product. ANN indices (HNSW, IVF, ScaNN) achieve sub-linear query time, enabling retrieval over billion-scale corpora in milliseconds.

Two-tower models are a prevalent architecture: one neural network encodes the user (from features like history, demographics, context), another encodes the item (from content, metadata, engagement statistics). The towers are trained jointly to maximize similarity for positive (user, item) pairs.

flowchart LR
    subgraph User Tower
        UF[User Features] --> UE[User Encoder]
        UE --> UV[User Vector]
    end

    subgraph Item Tower
        IF[Item Features] --> IE[Item Encoder]
        IE --> IV[Item Vector]
    end

    UV --> Sim[Similarity]
    IV --> Sim
    Sim --> Score[Retrieval Score]

Matrix Factorization Foundations

The theoretical underpinning of embedding-based retrieval is matrix factorization. Let $\mathbf{R} \in \mathbb{R}^{m \times n}$ be the (sparse) user-item interaction matrix where $R_{ui} = 1$ if user $u$ engaged with item $i$, with $m = \lvert\mathcal{U}\rvert$ users and $n = \lvert\mathcal{I}\rvert$ items. Matrix factorization seeks low-rank factors $\mathbf{P} \in \mathbb{R}^{m \times d}$ and $\mathbf{Q} \in \mathbb{R}^{n \times d}$ such that:

$$ \mathbf{R} \approx \mathbf{P} \mathbf{Q}^\top $$

The row $\mathbf{p}\_u$ is the user embedding; $\mathbf{q}\_i$ is the item embedding. The predicted affinity is $\hat{r}\_{ui} = \mathbf{p}\_u^\top \mathbf{q}\_i$.

Weighted Alternating Least Squares (WALS)

For implicit feedback (binary interactions), the Weighted Regularized Matrix Factorization objective is:

$$ \min\_{\mathbf{P}, \mathbf{Q}} \sum\_{(u,i) \in \Omega} c\_{ui} (r\_{ui} - \mathbf{p}\_u^\top \mathbf{q}\_i)^2 + \lambda \left( \|\mathbf{P}\|\_F^2 + \|\mathbf{Q}\|\_F^2 \right) $$

where $\Omega$ is the set of observed interactions, $c\_{ui}$ is a confidence weight (higher for explicit engagement, lower for implicit exposure), and $\lambda$ is the regularization coefficient. The alternating least squares (ALS) algorithm fixes $\mathbf{Q}$ and solves for $\mathbf{P}$ in closed form:

$$ \mathbf{p}\_u = \left( \mathbf{Q}^\top \mathbf{C}\_u \mathbf{Q} + \lambda \mathbf{I} \right)^{-1} \mathbf{Q}^\top \mathbf{C}\_u \mathbf{r}\_u $$

where $\mathbf{C}\_u = \text{diag}(c\_{u1}, \ldots, c\_{un})$ is the diagonal confidence matrix for user $u$. The procedure alternates between updating $\mathbf{P}$ and $\mathbf{Q}$ until convergence.

Two-Tower Training Objectives

Neural two-tower models extend matrix factorization with non-linear encoders. Let $\phi\_\theta: \mathcal{X}\_u \rightarrow \mathbb{R}^d$ be the user encoder and $\psi\_\phi: \mathcal{X}\_i \rightarrow \mathbb{R}^d$ be the item encoder, parameterized by neural networks. The similarity score is:

$$ s(u, i) = \frac{\phi_\theta(x_u)^\top \psi_\phi(x_i)}{\tau} $$

where $\tau > 0$ is a temperature hyperparameter controlling the sharpness of the distribution.

Contrastive Loss (InfoNCE)

Given a batch of $N$ positive pairs $\{(u\_j, i\_j^+)\}\_{j=1}^N$, the InfoNCE loss treats other items in the batch as negatives:

$$ \mathcal{L}\_{\text{InfoNCE}} = -\frac{1}{N} \sum\_{j=1}^{N} \log \frac{\exp(s(u\_j, i\_j^+))}{\sum\_{k=1}^{N} \exp(s(u\_j, i\_k^+))} $$

This is equivalent to an $(N-1)$-way classification problem where the model must identify the positive item among in-batch negatives. The gradient with respect to the item encoder pushes positive pairs closer and negative pairs apart in the embedding space.

Sampled Softmax

For large item vocabularies, full softmax is intractable. Sampled softmax approximates the partition function by sampling $K \ll n$ negative items (where $n$ is the total number of items):

$$ \mathcal{L}\_{\text{sampled}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k=1}^{K} \exp(s(u, i\_k^-))} $$

where negatives $\{i\_k^-\}$ are drawn from a proposal distribution $Q(i)$ (often item popularity). Importance weighting corrects for sampling bias:

$$ \mathcal{L}\_{\text{corrected}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k=1}^{K} \frac{\exp(s(u, i\_k^-))}{K \cdot Q(i\_k^-)}} $$

Negative Sampling Strategies

With billions of items, exhaustive negative sampling is impossible. Training requires strategic selection of negatives that are informative but computationally feasible.

Why negative sampling matters:

Computational efficiency: Full softmax over billions of items is intractable
Learning signal: Model quality depends on negative quality—too easy or too hard negatives degrade training
Data imbalance: Positive examples are rare (clicks on 0.01% of impressions); naive sampling creates severe class imbalance

Negative sampling strategies:

Strategy	Mechanism	Pros	Cons
Random	Sample uniformly from item corpus	Simple, unbiased	Too easy; model doesn’t learn
Popularity-biased	Sample proportional to item popularity $P(i) \propto \text{count}(i)^\alpha$	Matches serving distribution	Still relatively easy
In-batch	Treat other positives in batch as negatives	Free (no extra sampling)	Limited diversity (batch size ~1000s)
Impression-based	Sample from items shown but not clicked	Reflects user decision boundary	Requires logging all impressions
Hard negatives	Mine items similar to positives but not engaged	Maximally informative	Expensive to compute; can be too hard

Popularity-biased sampling:

Uniform sampling over-represents rare items. Popularity-biased sampling corrects this:

$$ P(i^-) \propto (\text{popularity}(i))^\alpha $$

where $\alpha \in [0, 1]$ controls the bias strength. Common values: $\alpha = 0.75$.

In-batch negatives:

Efficient trick for contrastive learning. Given a batch of $(u_j, i_j^+)$ pairs, treat $i_k^+$ ($k \neq j$) as negatives for user $u_j$:

$$ \mathcal{L}\_{\text{in-batch}} = -\frac{1}{N} \sum\_{j=1}^{N} \log \frac{\exp(s(u\_j, i\_j^+))}{\sum\_{k=1}^{N} \exp(s(u\_j, i\_k^+))} $$

Advantages:

Zero sampling overhead
Naturally diverse (batch contains varied positives)
Gradient computation is parallelizable

Limitations:

Negatives are actually other users’ positives (not true negatives)
Batch size limits negative count (typically 256-2048)

Cached cross-batch negatives:

Maintain a memory bank of recent embeddings. For each positive $(u, i^+)$, sample negatives from the memory bank:

memory_bank = queue(capacity=100K)  # FIFO queue
for batch in dataloader:
    negatives = sample(memory_bank, k=100)
    loss = contrastive_loss(batch, negatives)
    memory_bank.enqueue(batch.embeddings)

Increases effective negative count from batch size (~1K) to memory bank size (~100K) with minimal overhead.

Hard Negative Mining

Random negatives are often too easy; the model achieves low loss without learning fine-grained distinctions. Hard negative mining selects negatives that are challenging:

1. In-batch hard negatives:

Within each batch, select the negative with highest score for each positive:

$$ i\_{\text{hard}}^- = \arg\max\_{k \neq j} s(u\_j, i\_k^+) $$

Trade-off: Focusing only on hardest negatives can cause training instability (model gets stuck on outliers).

2. Semi-hard negatives:

Select negatives that are harder than the positive, but not the absolute hardest:

$$ i\_{\text{semi-hard}}^- = \arg\max\_{k : s(u, i\_k^-) > s(u, i^+) - m} s(u, i\_k^-) $$

where $m$ is a margin. Semi-hard negatives are informative but not adversarial.

3. ANN-based mining:

Periodically mine hard negatives offline:

Encode all items with current model
For each positive $(u, i^+)$, query ANN index for nearest neighbors to $i^+$
Filter out items user has engaged with
Store hard negatives for training

Frequency: Every epoch or every few gradient steps. Mining too frequently wastes computation; too infrequently uses stale hard negatives.

4. Dynamic hard negative mining:

During training, maintain an ANN index of item embeddings. At each batch:

Compute batch embeddings
Query ANN for hard negatives on-the-fly
Compute loss with mined negatives

Cost: Adds 5-10ms per batch (ANN query overhead). Only feasible with fast ANN implementations (HNSW, ScaNN).

5. Curriculum learning:

Gradually increase negative difficulty:

Epoch 1-10: Random negatives
Epoch 11-20: Popularity-biased negatives
Epoch 21+: Hard negatives

Prevents model from getting stuck in bad local minima early in training.

Optimal negative count:

More negatives improve model quality but increase compute cost. Empirical findings:

Negative Count	Relative Accuracy	Training Time
10	Baseline	1x
100	+5%	3x
1000	+8%	10x
10000	+9%	40x

Diminishing returns beyond 1000 negatives. Production systems typically use 100-1000 negatives per positive.

Loss formulation with hard negatives:

$$ \mathcal{L}\_{\text{hard}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k \in \text{hard}(u)} \exp(s(u, i\_k^-))} $$

Debiasing hard negatives:

Hard negatives are not sampled uniformly, which biases gradients. Correct via importance weighting:

$$ \mathcal{L}\_{\text{corrected}} = -\log \frac{\exp(s(u, i^+))}{\exp(s(u, i^+)) + \sum\_{k} \frac{\exp(s(u, i\_k^-))}{P\_{\text{sample}}(i\_k^-)}} $$

where $P\_{\text{sample}}(i\_k^-)$ is the probability of sampling negative $i\_k^-$.

Approximate Nearest Neighbor Search

Exact $k$-NN search requires $O(n \cdot d)$ time per query—infeasible for billion-scale corpora. ANN algorithms trade exactness for speed.

Locality-Sensitive Hashing (LSH)

LSH hashes similar vectors to the same bucket with high probability. For cosine similarity, random hyperplane hashing uses:

$$ h_{\mathbf{r}}(\mathbf{v}) = \text{sign}(\mathbf{r}^\top \mathbf{v}) $$

where $\mathbf{r}$ is a random unit vector. The probability of collision is:

$$ P[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta(\mathbf{u}, \mathbf{v})}{\pi} $$

where $\theta$ is the angle between vectors. Multiple hash tables with independent hash functions increase recall.

Hierarchical Navigable Small Worlds (HNSW)

HNSW constructs a multi-layer graph where each node connects to its approximate nearest neighbors. Search begins at the top layer (sparse, long-range connections) and descends to lower layers (dense, local connections). Query complexity is $O(\log n)$ with high probability for well-constructed graphs.

Inverted File Index (IVF)

IVF partitions the embedding space into $C$ Voronoi cells via $k$-means clustering. At query time, only items in the nearest $n_{\text{probe}}$ cells are searched:

$$ \text{Candidates} = \bigcup_{c \in \text{top-}n_{\text{probe}}(\mathbf{q})} \text{Cell}_c $$

Product quantization (PQ) further compresses vectors within cells, enabling memory-efficient search over billions of items.

Algorithm	Index Build	Query Time	Memory	Recall
Exact k-NN	$O(1)$	$O(n \cdot d)$	$O(n \cdot d)$	100%
LSH	$O(n \cdot L)$	$O(L \cdot B)$	$O(n \cdot L)$	80–95%
HNSW	$O(n \log n)$	$O(\log n)$	$O(n \cdot M)$	95–99%
IVF-PQ	$O(n \cdot k)$	$O(n\_{\text{probe}} \cdot n/C)$	$O(n \cdot m)$	90–98%

Index Maintenance and Freshness

Production ANN indexes must balance query performance against freshness—new items and updated embeddings must become discoverable within acceptable latency bounds. Several strategies address this operational challenge:

Incremental Updates: Some index structures (notably HNSW) support online insertion of new vectors without full rebuilds. New items are inserted into the graph by finding their approximate nearest neighbors and establishing bidirectional edges. However, incremental insertion degrades index quality over time as the graph structure drifts from optimal.

Tiered Indexing: A common pattern separates the corpus into tiers by age:

Tier	Content Age	Index Type	Update Cadence
Hot	< 1 hour	Brute-force or streaming index	Real-time
Warm	1–24 hours	Incrementally updated HNSW	Minutes
Cold	> 24 hours	Batch-rebuilt IVF-PQ	Daily

Query routing merges results across tiers, applying freshness boosts to hot-tier candidates. This architecture ensures new content is immediately retrievable while maintaining efficient search over the long tail.

Staleness Budgets: Teams define acceptable staleness windows per use case. A “For You” feed might tolerate 15-minute embedding lag, while a search typeahead requires sub-second freshness. Monitoring dashboards track:

Embedding lag: Time between item creation and index availability
Index coverage: Fraction of eligible items present in the index
Recall degradation: Periodic evaluation against ground-truth rankings

Rebuild and Re-sharding: Full index rebuilds occur on a scheduled cadence (daily or weekly) to restore optimal graph structure. As the corpus grows, re-sharding distributes the index across more machines. Strategies include:

Blue-green deployment: Build new index offline; swap atomically when ready
Rolling updates: Replace shards incrementally to avoid query disruption
Incremental PQ codebook updates: Retrain product quantization codebooks on recent embedding distributions

Real-Time Embedding Updates: When user or item embeddings change (e.g., after new interactions), the system must propagate updates:

Lazy refresh: Re-embed on next query; cache result
Background refresh: Async job re-computes embeddings and patches the index
Streaming updates: Kafka-based pipeline pushes embedding deltas to index shards

The choice depends on update frequency and latency requirements. User embeddings (changing with every interaction) often use lazy refresh, while item embeddings (more stable) use batch updates.

Graph-Based Retrieval

Social media platforms naturally form graphs: users follow users, users engage with items, items share topics. Graph-based methods leverage this structure.

Graph Neural Networks (GNNs)

Let $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ be a graph with $N$ nodes and node features $\mathbf{X} \in \mathbb{R}^{N \times d}$. GNNs learn node representations via message passing:

$$ \mathbf{h}\_v^{(l+1)} = \text{UPDATE}\left( \mathbf{h}\_v^{(l)}, \text{AGGREGATE}\left( \{ \mathbf{h}\_u^{(l)} : u \in \mathcal{N}(v) \} \right) \right) $$

Graph Convolutional Networks (GCN) use spectral convolution:

$$ \mathbf{H}^{(l+1)} = \sigma\left( \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \right) $$

where $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ is the adjacency matrix with self-loops, $\tilde{\mathbf{D}}$ is the degree matrix, and $\mathbf{W}^{(l)}$ is a learnable weight matrix.

GraphSAGE samples and aggregates neighborhoods:

$$ \mathbf{h}\_v^{(l+1)} = \sigma\left( \mathbf{W}^{(l)} \cdot \text{CONCAT}\left( \mathbf{h}\_v^{(l)}, \text{AGG}(\{ \mathbf{h}\_u^{(l)} : u \in \mathcal{N}\_{\text{sample}}(v) \}) \right) \right) $$

Aggregators include mean, LSTM, and pooling. Sampling bounds computation cost.

Bipartite Graph Models

User-item interactions form bipartite graphs. LightGCN simplifies GCN for collaborative filtering:

$$ \mathbf{e}\_u^{(l+1)} = \sum\_{i \in \mathcal{N}\_u} \frac{1}{\sqrt{d\_u} \sqrt{d\_i}} \mathbf{e}\_i^{(l)} $$

$$ \mathbf{e}\_i^{(l+1)} = \sum\_{u \in \mathcal{N}\_i} \frac{1}{\sqrt{d\_i} \sqrt{d\_u}} \mathbf{e}\_u^{(l)} $$

where $d\_u = \lvert\mathcal{N}\_u\rvert$ and $d\_i = \lvert\mathcal{N}\_i\rvert$ are the degrees of user $u$ and item $i$.

Final embeddings combine all layers: $\mathbf{e}\_u = \sum\_{l=0}^{L} \alpha\_l \mathbf{e}\_u^{(l)}$.

Random Walk Methods

DeepWalk and Node2Vec learn embeddings via skip-gram on random walks:

$$ \max \sum\_{v \in \mathcal{V}} \sum\_{c \in \mathcal{N}\_R(v)} \log P(c | \phi(v)) $$

where $\mathcal{N}\_R(v)$ is the context window from random walks starting at $v$, and $P(c | \phi(v)) = \frac{\exp(\phi(c)^\top \phi(v))}{\sum\_{u} \exp(\phi(u)^\top \phi(v))}$.

Node2Vec biases walks with parameters $p$ (return) and $q$ (in-out) to balance BFS (local) and DFS (global) exploration:

$$ P(c\_i = x | c\_{i-1} = v) = \begin{cases} \frac{1}{p} & \text{if } d\_{tx} = 0 \text{ (return to } t \text{)} \\ 1 & \text{if } d\_{tx} = 1 \\ \frac{1}{q} & \text{if } d\_{tx} = 2 \end{cases} $$

where $t$ is the previous node and $d\_{tx}$ is the shortest path distance.

Personalized PageRank (PPR)

PPR computes stationary distribution of random walks with restarts:

$$ \mathbf{p}\_u = \alpha \mathbf{e}\_u + (1 - \alpha) \mathbf{A}^\top \mathbf{D}^{-1} \mathbf{p}\_u $$

where $\alpha$ is the restart probability and $\mathbf{e}\_u$ is the indicator vector for user $u$. Items with high $\mathbf{p}\_u[i]$ are candidates. Efficient approximations (push-based, Monte Carlo) enable real-time computation.

Content Understanding Pipeline

Before items can be retrieved, ranked, or embedded, their raw content must be processed into machine-readable features. Social media platforms ingest diverse content—text posts, images, videos, links—each requiring specialized processing pipelines.

Text Processing

Text content flows through multiple stages:

Extraction and cleaning:
- Strip HTML/markdown formatting
- Normalize Unicode (handle emojis, accented characters, RTL scripts)
- Remove or flag excessive punctuation, all-caps text (spam signals)
Language detection:
- Classify language (fastText, CLD3) for routing to language-specific pipelines
- Handle code-switching (multilingual posts)
Tokenization and normalization:
- Subword tokenization (BPE, WordPiece, SentencePiece) for transformer models
- Hashtag splitting (#MachineLearning → “machine”, “learning”)
- URL extraction and canonicalization
Entity extraction:
- Named Entity Recognition (NER) for people, places, organizations
- Mention/tag parsing (@username)
- Topic extraction via keyword models or zero-shot classifiers
Embedding generation:
- Pre-trained encoders: BERT, RoBERTa, sentence-transformers
- Domain-specific fine-tuning on social media text
- Multilingual models (XLM-R, mBERT) for cross-lingual retrieval

flowchart LR
    Raw[Raw Text] --> Clean[Clean & Normalize]
    Clean --> LangDetect[Language Detection]
    LangDetect --> Tokenize[Tokenization]
    Tokenize --> NER[Entity Extraction]
    Tokenize --> Encoder[Text Encoder]
    Encoder --> TextEmbed[Text Embedding]
    NER --> Features[Feature Store]
    TextEmbed --> Features

Image Processing

Images dominate social media consumption—Instagram, Pinterest, and TikTok feeds are fundamentally visual experiences, while even text-centric platforms like Twitter/X see 2-3× higher engagement on posts with images. For recommendation systems, images provide rich semantic signals beyond what text captions convey: the visual composition, objects, scenes, people, emotions, and aesthetic quality all influence user preferences. A user who engages with sunset beach photos has different interests than one who engages with architectural photography, even if both caption their posts with generic hashtags like “#beautiful”.

Processing images for recommendation systems serves multiple purposes: content understanding (what’s in the image?), similarity matching (find visually similar content), safety filtering (detect policy violations), and quality scoring (surface aesthetically pleasing content). Modern systems extract dense visual embeddings that capture semantic meaning, enabling both exact matching (duplicate detection) and approximate matching (similar content retrieval). These embeddings feed into the same retrieval and ranking pipelines as text features, allowing multimodal fusion where visual, textual, and behavioral signals jointly determine relevance.

Images require visual understanding for content-based retrieval and safety:

Preprocessing:
- Resize, crop, normalize pixel values
- Augmentation during training (rotation, color jitter, crops)
- Aspect ratio handling
Feature extraction:
- Pre-trained CNNs: ResNet, EfficientNet
- Vision transformers: ViT, Swin
- CLIP for joint vision-language embeddings
Scene understanding:
- Object detection (YOLO, Faster R-CNN) for salient objects
- Scene classification (indoor/outdoor, activity type)
- Aesthetic quality scoring
OCR and text-in-image:
- Extract text overlays (memes, infographics)
- Feed to text pipeline for additional signals
Safety classifiers:
- NSFW detection
- Violence/gore detection
- Hate symbol recognition

Video Processing

Video has become the dominant content format on social media, accounting for over 80% of internet traffic by 2024. Platforms like TikTok, YouTube Shorts, Instagram Reels, and Snapchat Spotlight are entirely video-first, while traditional feeds increasingly prioritize video over static content. The shift is driven by engagement economics: users spend 5-10× longer watching videos than viewing images, and video posts generate substantially higher interaction rates.

For recommendation systems, video presents unique challenges and opportunities. Unlike static images, video encodes temporal dynamics—actions, transitions, narrative arcs—that are critical to understanding content. A cooking video’s appeal depends on the progression from ingredients to finished dish; a comedy sketch relies on timing and punchlines. Audio is equally essential: music tracks drive viral trends (TikTok’s sound-based discovery), speech content conveys meaning beyond visuals, and production quality (background noise, audio clarity) affects perceived professionalism.

The computational challenge is formidable: a 60-second video at 30fps contains 1,800 frames, each requiring image-level processing, plus audio tracks, plus temporal modeling. Processing costs are 10-100× higher than images, creating latency and infrastructure bottlenecks. Production systems must balance comprehensiveness (capture all relevant signals) against efficiency (process billions of videos within reasonable cost). This leads to tiered processing strategies: fast, cheap models run on all content for safety and basic categorization, while expensive deep models run selectively or offline for high-value content.

Video adds temporal and audio dimensions:

Frame sampling:
- Uniform sampling: Extract frames at fixed intervals (e.g., 1 fps)
- Shot boundary detection: Sample at scene transitions
- Keyframe extraction: Select representative frames per shot
Visual encoding:
- Per-frame embeddings (CNN or ViT)
- Temporal aggregation: average pooling, LSTM, transformer over frame sequence
- Action recognition models (I3D, SlowFast) for activity classification
Audio processing:
- Audio extraction and resampling
- Speech-to-text (Whisper, wav2vec) for transcription
- Music/sound classification
- Audio embeddings (VGGish, CLAP) for audio-visual fusion
Multimodal fusion:
- Concatenate visual, audio, text embeddings
- Cross-modal attention (video frames attend to audio, text)
- Late fusion: train modality-specific models, combine scores

flowchart TB
    Video[Video Upload] --> FrameSample[Frame Sampling]
    Video --> AudioExtract[Audio Extraction]

    FrameSample --> VisualEnc[Visual Encoder]
    FrameSample --> OCR[OCR / Text-in-Video]

    AudioExtract --> STT[Speech-to-Text]
    AudioExtract --> AudioEnc[Audio Encoder]

    VisualEnc --> Temporal[Temporal Aggregation]
    Temporal --> VisualEmbed[Visual Embedding]

    AudioEnc --> AudioEmbed[Audio Embedding]
    STT --> TextPipe[Text Pipeline]
    OCR --> TextPipe
    TextPipe --> TextEmbed[Text Embedding]

    VisualEmbed --> Fusion[Multimodal Fusion]
    AudioEmbed --> Fusion
    TextEmbed --> Fusion
    Fusion --> FinalEmbed[Content Embedding]

Scalability and Cost Trade-offs

Processing billions of items requires careful resource allocation:

Content Type	Processing Cost	Latency	Strategy
Text	Low	<10ms	Synchronous on upload
Image	Medium	50-200ms	Async pipeline; cache embeddings
Short video (<1 min)	High	1-5s	Async pipeline; batch processing
Long video (>5 min)	Very high	10-60s	Offline batch; sample frames aggressively

Optimization techniques:

Lazy processing: Compute embeddings only for items that pass initial filters (spam, policy violations)
Tiered processing: Fast safety classifiers run immediately; expensive embeddings computed async
Model distillation: Deploy small, fast models for serving; large models for offline batch processing
Caching: Embeddings rarely change; cache in K-V store (Redis, Memcached) keyed by content hash
Incremental updates: When item metadata changes (edit, new comments), avoid reprocessing entire video

Feature Engineering

Feature engineering is the process of transforming raw data—user actions, content attributes, system logs—into numerical representations that machine learning models can consume. In social media recommendation systems, this transformation is critical: the same model architecture can produce radically different results depending on which features it sees. A ranking model that only observes “user clicked this post” performs far worse than one that knows “user watched 80% of a 2-minute cooking video at 2pm on mobile, having previously engaged with 15 other cooking videos this week, and the creator has 50k followers with 5% engagement rate.”

Features are the bridge between the messy reality of user behavior and the mathematical abstractions of prediction models. They must capture subtle signals—a user’s preference shift from travel content to home renovation after a life event, a creator’s quality improving over time, the contextual relevance of sports content during a championship game—while remaining computationally efficient to compute and serve at scale.

Social media recommendations differ from other ML domains in several ways that make feature engineering especially critical:

Sparse behavioral signals: Most users interact with <0.01% of available content. A typical Facebook user might engage with 10-20 posts per day out of billions created. The model must infer preferences from these sparse signals, relying heavily on features that generalize: if a user likes cooking videos from creator A, which features predict they’ll like creator B? Shared topics, similar production style, overlapping audience demographics, or common hashtags all become crucial features.

Heterogeneous content types: Unlike Netflix (all videos) or Spotify (all music), social media mixes text posts, images, short videos, long videos, links, polls, and stories. Feature engineering must create comparable representations across modalities. A user who engages with fitness images should see fitness videos; the model needs features that capture “fitness interest” independent of format.

Temporal dynamics: User interests shift rapidly. Someone searching for “birthday cake recipes” today has different needs than their historical profile suggests. Features must balance long-term preferences (stored in user embeddings) with short-term intent (recent search queries, current session behavior). Real-time features—what the user did in the last 5 minutes—often outweigh historical features for predicting the next click.

Social context: Content spreads through social graphs. A post’s value depends not just on its content, but on who shared it, who else engaged with it, and whether it’s trending in your network. Features must encode network effects: “3 of your close friends liked this post” is a powerful signal. Graph-based features (mutual friends with creator, overlap in followed accounts) capture social proof.

Creator incentives: Social platforms balance user experience with creator economics. Features must encode not just “will the user like this?” but “should we show this content given our broader objectives?” Features like creator earnings, post monetization status, and historical policy violations inform decisions about surfacing content.

The Feature Engineering Pipeline

Raw data flows through several stages before becoming model inputs:

Collection: Log user actions (clicks, likes, watch time), content attributes (text, images, metadata), and context (device, time, location)
Transformation: Convert raw values into model-friendly representations (embeddings, normalizations, discretizations)
Aggregation: Combine atomic events into statistical summaries (user’s average watch time, creator’s engagement rate over 7 days)
Storage: Materialize features in low-latency stores for online serving and data lakes for offline training
Serving: Retrieve features in <10ms during inference while ensuring consistency with training data

The challenge is maintaining this pipeline at social media scale: billions of users, trillions of feature values, and real-time updates as user behavior evolves. Feature stores have emerged as critical infrastructure, providing versioned, point-in-time correct features that prevent train-serve skew.

Features are the raw material of recommendation models. They encode signals about users, items, and the context of each request.

User Features

Feature Category	Examples	Representation
Demographics	Age, location, language	Categorical embeddings
Engagement history	Liked posts, watch time, comments	Sequence embeddings (Transformer, LSTM)
Social graph	Follower count, friend overlap	Graph embeddings (GraphSAGE, GCN)
Session context	Time of day, device, referrer	Dense features
Long-term preferences	Topic affinities, creator subscriptions	Aggregated embeddings

Item Features

Feature Category	Examples	Representation
Content	Text, images, video frames	Pre-trained encoders (BERT, CLIP, ViT)
Metadata	Author, timestamp, hashtags	Categorical embeddings
Engagement statistics	Likes, shares, CTR	Normalized counts, log-transforms
Quality signals	Spam score, policy violations	Binary or continuous
Freshness	Age since publication	Decay functions

Context Features

Feature Category	Examples	Representation
Request context	Device type, network speed	Categorical
Temporal	Hour, day of week, holidays	Cyclical encoding (sin/cos)
Session state	Items already shown, scroll depth	Set or sequence embeddings
Real-time signals	Trending topics, breaking news	Dynamic embeddings

Feature Stores

Production systems materialize features in feature stores that serve both offline training and online inference with consistent semantics. Key requirements:

Point-in-time correctness: Training must use features as they existed at interaction time to avoid data leakage.
Low-latency serving: Online inference requires feature retrieval in single-digit milliseconds.
Backfill and versioning: Schema changes must propagate to historical data without corrupting training sets.

Platforms like Feast, Tecton, and internal systems at Meta/Google/TikTok address these requirements.

flowchart LR
    Events[(Event Stream)] --> Compute[Feature Compute]
    Compute --> Store[(Feature Store)]
    Store --> Training[Offline Training]
    Store --> Serving[Online Serving]

    subgraph Feature Store
        Offline[(Offline Store / Data Lake)]
        Online[(Online Store / KV Cache)]
    end

Training-Serving Skew Detection

A critical operational concern is training-serving skew—discrepancies between features computed offline (for training) and online (for inference). Skew causes silent model degradation: the model sees different feature distributions at serving time than it learned from.

Sources of Skew:

Skew Type	Cause	Example
Schema drift	Feature definition changes between training and serving	Normalization logic updated in serving but not backfilled
Temporal leakage	Training uses future information unavailable at serving	Aggregating engagement counts that include the label event
Null/default handling	Different imputation strategies	Training drops nulls; serving uses zeros
Computation divergence	Separate codepaths for batch vs. streaming	Floating-point precision differences

Validation Mechanisms:

Shadow requests: Route a fraction of live traffic to a shadow pipeline that logs both online-computed and offline-reconstructed features. Compare distributions and flag divergence exceeding thresholds.
Canary feature checks: Before model deployment, run inference on held-out examples using both training-time and serving-time feature values. Alert if prediction distributions differ significantly.
Schema versioning: Tag each feature with a version; models declare required versions. Serving rejects requests if versions mismatch.
Point-in-time audits: Periodically reconstruct historical features and compare against logged serving values. Statistical tests (KS test, PSI) detect drift.

Monitoring and Alerts:

Metric	Definition	Alert Threshold
Feature null rate	Fraction of requests with missing features	> 1% for critical features
Distribution shift (PSI)	Population Stability Index vs. training baseline	> 0.1
Latency percentiles	Feature fetch P50/P99	> SLA
Coverage	Fraction of entities with fresh features	< 99%

Dashboards surface feature health alongside model metrics, enabling rapid diagnosis when prediction quality degrades. Automated alerts trigger on null spikes or distribution shifts before downstream models are impacted.

Questions or feedback?