Caching Strategies

Caching Strategies


Questions or feedback?

I'd love to hear your thoughts on this article. Feel free to reach out:

Caching is a classical systems technique that reduces access latency by retaining temporary replicas of data closer to the point of consumption. A typical deployment comprises a backing store (for example, a relational database) that maintains the authoritative copy and one or more cache layers that materialize frequently accessed subsets. Although caching shares certain characteristics with colocation and replication, it introduces its own spectrum of latency, consistency, and operational trade-offs. This note surveys the conceptual foundations of caching and examines the strategies that practitioners employ in production systems.

When to Cache

Caching is most attractive when the application workload satisfies one or more of the following conditions:

  • Limited transactional needs. Workloads that can be expressed through key–value lookups rarely require the full expressiveness of SQL or distributed transactions. Examples include user-profile retrieval or session state management.
  • Immutability of the incumbent system. Legacy systems, vendor-managed databases, or services without replication hooks often cannot be modified easily. Introducing an external cache layer may therefore be the only viable latency optimization.
  • Resource constraints. Replicating an entire dataset to every point of presence is often infeasible because of storage or compute limits. Caches allow operators to stage only the “hot” subset in scarce memory close to the client.

In short, caching is a pragmatic response when redesigning the system is impossible or cost prohibitive, yet latency must be reduced.

Architectural Overview

Caching reuses previously computed results: the client first consults the cache; on a miss, it retrieves the data from the backing store, then optionally inserts it into the cache for future access. Two fundamental dimensions characterize caches:

  • Persistence. In-memory caches (e.g., process-local structures or Redis without snapshots) are non-persistent and must be rebuilt after a restart. Persistent caches periodically flush to disk, eliminating cold-start penalties but introducing additional complexity.
  • Transactional semantics. Transactional caches participate in isolation protocols; non-transactional caches trade isolation for simplicity, risking exposure to stale data.

The vocabulary of cache operation mirrors that of classical memory hierarchies: a successful lookup is a cache hit, whereas an unsuccessful lookup is a cache miss. Systems often track negative cache entries—explicit records indicating that a key does not exist in the backing store—to avoid repeatedly querying for known-absent results. DNS resolvers, for instance, cache NXDOMAIN responses to prevent amplification attacks and reduce resolver load. Negative caching requires careful TTL selection: entries must expire quickly enough to reflect legitimate additions to the backing store, yet persist long enough to absorb bursts of repeated lookups for non-existent keys. A common heuristic sets negative TTLs to one-tenth or one-fifth of positive TTLs. Implementations must also decide whether negative entries consume the same eviction budget as positive entries; some systems maintain a separate negative cache with its own size limits to prevent cache pollution from enumeration attacks or typo-driven misses.

Caching Strategies

The behavior of a caching layer depends primarily on its read/write strategy. The following subsections summarize the most prevalent approaches and highlight their operational consequences.

Cache-aside (Lazy Loading)

In the cache-aside pattern the cache remains passive. The application interrogates the cache first, and on a miss it is responsible for fetching the data from the backing store and populating the cache. Figure 1 illustrates the control flow.

sequenceDiagram
    participant App as Application
    participant Cache
    participant DB as Backing Store

    App->>Cache: GET key
    alt hit
        Cache-->>App: value
    else miss
        Cache-->>App: miss
        App->>DB: fetch key
        DB-->>App: value
        App->>Cache: SET key,value
        Cache-->>App: value
    end

Cache-aside requires minimal coordination between cache and database, but it affords no transactional guarantees. Concurrent misses can also stampede the database, inflating tail latency. Mitigation typically involves request coalescing, per-key locks, or “single flight” primitives.

The following Go snippet demonstrates a cache-aside implementation with single-flight protection to prevent concurrent misses from overwhelming the backing store:

import (
    "context"
    "time"
    "golang.org/x/sync/singleflight"
)

type CacheAside[K comparable, V any] struct {
    cache  Cache[K, V]
    db     Database[K, V]
    group  singleflight.Group
    ttl    time.Duration
}

func (c *CacheAside[K, V]) Get(ctx context.Context, key K) (V, error) {
    // Attempt cache lookup
    if val, ok := c.cache.Get(key); ok {
        return val, nil
    }

    // Single-flight: coalesce concurrent requests for the same key
    result, err, _ := c.group.Do(string(fmt.Sprint(key)), func() (any, error) {
        // Double-check cache after acquiring the flight
        if val, ok := c.cache.Get(key); ok {
            return val, nil
        }

        // Fetch from backing store
        val, err := c.db.Fetch(ctx, key)
        if err != nil {
            return *new(V), err
        }

        // Populate cache
        c.cache.Set(key, val, c.ttl)
        return val, nil
    })

    if err != nil {
        return *new(V), err
    }
    return result.(V), nil
}

This pattern ensures that even under high concurrency, only one goroutine fetches a missing key from the database; all other waiters receive the same result.

Read-through Caching

Read-through caches actively participate in miss handling. If the cache cannot locate a key, it queries the backing store on behalf of the application, inserts the result, and returns the value. Consequently, from the client’s perspective a cache miss is invisible.

sequenceDiagram
    participant App as Application
    participant Cache
    participant DB as Backing Store

    App->>Cache: GET key
    alt hit
        Cache-->>App: value
    else miss
        Cache->>DB: fetch key
        DB-->>Cache: value
        Cache->>Cache: store key,value
        Cache-->>App: value
    end

Read-through caches reduce application complexity but require deeper integration because the cache must understand how to query and serialize backing data. They can also perform refresh-ahead updates to hide miss latency by proactively reloading soon-to-expire keys.

Write-through Caching

Write-through caching exposes write operations through the cache. Each update is synchronously propagated to the backing store before the cache acknowledges success.

sequenceDiagram
    participant App as Application
    participant Cache
    participant DB as Backing Store

    App->>Cache: SET key,value
    Cache->>Cache: update entry
    Cache->>DB: write key,value
    DB-->>Cache: ack
    Cache-->>App: ack

The benefit is mechanical sympathy between cache and database—data cannot diverge. The drawback is that write latency is dominated by the backing store’s commit cost, which may erode the very performance gains caching sought to deliver.

Write-behind (Write-back) Caching

Write-behind caches acknowledge updates immediately and flush them to the backing store asynchronously.

sequenceDiagram
    participant App as Application
    participant Cache
    participant DB as Backing Store

    App->>Cache: SET key,value
    Cache->>Cache: update entry
    Cache-->>App: ack (immediate)
    Cache-->>DB: async flush/batch

This strategy minimizes write latency and allows batching, but it sacrifices transactional semantics and introduces durability risk: failures prior to flushing can result in data loss. Production deployments mitigate this risk through several mechanisms:

  • Write-ahead logging (WAL). The cache appends each write to a durable log before acknowledging; on recovery, uncommitted log entries are replayed to the backing store. Redis AOF persistence operates on this principle.
  • Replication before acknowledgment. Writes are synchronously replicated to one or more peer cache nodes before the client receives confirmation, ensuring that at least one replica survives single-node failures.
  • Batched flush with checkpointing. The cache accumulates writes in memory and periodically flushes batches to the backing store, recording a checkpoint offset. Recovery replays from the last checkpoint. This approach trades some durability (the batch window) for throughput.
  • Hybrid policies. Some systems (e.g., storage controllers) classify writes by criticality: metadata uses write-through, while bulk data uses write-behind with replication.

Operators must quantify the acceptable data-loss window and provision replication or WAL accordingly; write-behind without such safeguards is appropriate only for ephemeral or reconstructible data.

Client-side Caching

Client-side caching embeds the cache within the consumer (browser, mobile device, or edge process). Read-through plus write-behind is common, yielding excellent latency and offline resilience. However, local memory overhead increases and global consistency degrades because clients rarely interact directly with the database.

Distributed Caching

Distributed caches deploy multiple cache instances to reduce geographic latency or to scale throughput. Designers must decide how to partition the keyspace (e.g., consistent hashing), whether to replicate partitions for availability, and how to coordinate invalidations. Distributed caches inherit the complexity of partitioned and replicated systems, including the risk of split brain and clock skew.

Strategy Comparison

The following table summarizes the trade-offs among the principal caching strategies:

Strategy Read Latency Write Latency Consistency Complexity Typical Use Cases
Cache-aside Miss: high N/A (app writes to DB) Eventual; stale reads possible Low General-purpose; microservices
Read-through Miss: hidden N/A Eventual Medium ORM integrations; CDN edges
Write-through Hit: low High (sync to DB) Strong Medium Financial systems; metadata stores
Write-behind Hit: low Low (async) Eventual; data loss risk High High-throughput ingestion; logging
Client-side Very low Varies Weak; local divergence Medium Mobile apps; browser caches
Distributed Low (network hop) Varies Configurable High Global services; CDN; session stores

Strategy selection depends on the relative importance of read latency, write latency, consistency guarantees, and operational complexity. Many production systems combine strategies—for example, write-through for critical paths and write-behind for audit logs—to balance these concerns.

Cache Coherency

Whenever the same datum is cached at multiple sites, one must reason about coherency—the property that all caches eventually reflect updates. Hardware designers long ago addressed this challenge with protocols such as MESI (Modified, Exclusive, Shared, Invalid). In MESI, each cache line resides in exactly one of four states:

State Meaning
Modified (M) Line is dirty; this cache holds the only valid copy. Memory is stale.
Exclusive (E) Line is clean; this cache holds the only copy, but it matches memory.
Shared (S) Line is clean and may exist in other caches; all copies match memory.
Invalid (I) Line is not present or has been invalidated; any access is a cache miss.

State transitions occur in response to local processor actions (reads, writes) and bus transactions (snooped reads or invalidations from other cores). The following diagram illustrates a typical sequence: Core 1 loads a line exclusively, Core 2 subsequently reads the same line (both transition to Shared), and finally Core 2 writes the line (Core 1’s copy is invalidated, Core 2’s copy becomes Modified).

sequenceDiagram
    participant Core1
    participant Core2
    participant Bus
    participant Memory

    Core1->>Memory: Read X (miss)
    Memory-->>Core1: X
    Note over Core1: State = Exclusive
    Core2->>Bus: Read X (miss, snoop)
    Bus->>Core1: Snoop read
    Core1-->>Bus: Supply X
    Note over Core1: State = Shared
    Bus-->>Core2: X
    Note over Core2: State = Shared
    Core2->>Bus: Write X
    Bus->>Core1: Invalidate X
    Note over Core1: State = Invalid
    Note over Core2: State = Modified

The MESI protocol guarantees that a write is visible to all cores before any subsequent read can observe stale data—a property called write propagation. Extensions such as MOESI (adding an Owned state) and MESIF (adding a Forward state) optimize specific bus topologies. Understanding hardware coherency is instructive because distributed software caches face analogous trade-offs, albeit with network latencies orders of magnitude larger than cache-line transfers.

Distributed software caches often accept weaker guarantees to avoid the overhead of strict coherency. Typical mechanisms include broadcast invalidations (via pub/sub channels), lease-based ownership, or compare-and-set versioning. Figure 2 sketches a pub/sub-based invalidation pipeline.

flowchart LR
    DB[(Backing Store)] -->|Update key| PubSub[Invalidation Channel]
    PubSub --> CacheA
    PubSub --> CacheB
    CacheA --> ClientsA
    CacheB --> ClientsB

Coherency ensures convergence but says nothing about the ordering of operations; it must therefore be considered alongside the system’s consistency model (e.g., linearizability versus eventual consistency).

Cache Hit Ratio and Working Sets

The cache hit ratio quantifies cache effectiveness:

$$ \text{HitRatio} = \frac{\text{cache hits}}{\text{cache hits} + \text{cache misses}}. $$

Higher hit ratios correspond to lower average latency, but tail latency remains dictated by miss performance because misses incur backing-store costs. As a design pattern, maximize the cache hit ratio by acting on three fronts:

  1. Capacity planning around the working set. If a service routinely touches 500 GB of product metadata (its working set) but the cache only accommodates 50 GB, the cache will perpetually evict useful entries. Either increase cache capacity, compress entries, or reorganize the workload so that only the hottest 50 GB is needed at any instant.

    Estimating the working set requires empirical measurement. Common techniques include:

    • Stack-distance profiling. Instrument the application to record the reuse distance (number of distinct accesses between consecutive references to the same key) for each request. The cumulative distribution of reuse distances directly yields the expected hit ratio for any cache size: a cache of size $C$ captures all reuses with distance $\leq C$.
    • Sampling-based estimation. Rather than tracking every access, sample a fraction (e.g., 1%) and extrapolate. Reservoir sampling or count-min sketches approximate frequency distributions with bounded memory overhead.
    • Miss-rate curves. Many caches (e.g., Caffeine, Redis with MEMORY DOCTOR) expose metrics that allow operators to plot hit ratio against allocated memory. Increasing capacity yields diminishing returns once the curve flattens—this inflection point approximates the working-set boundary.
    • Simulation. Replay production traces against cache simulators configured with varying capacities and policies. Tools such as libCacheSim or Mimircache automate this analysis.

    Once the working set is characterized, operators must account for per-entry overhead (hash-table buckets, linked-list pointers, TTL metadata). A 64-byte payload may consume 120 bytes in Redis due to SDS strings and dict-entry structures. Ignoring overhead leads to under-provisioning and premature eviction.

  2. Policy alignment. Choose eviction policies that mirror the workload’s locality profile. An analytical reporting job that scans the entire dataset should not rely on LRU; a recommendation service with recurring hot keys might benefit from LFU or TinyLFU hybrids. Misaligned policies can do more harm than good.

  3. Access-pattern shaping. Order operations to maximize locality. For example, a machine-learning feature store may process user cohorts in batches; ensuring that all stages operate on the same cohort before moving on keeps that cohort’s data hot in cache. Similarly, ETL pipelines can chunk large tables to avoid blowing out limited cache space.

These tactics should be complemented with miss-path optimizations—indexes, replicas, precomputed materialized views—so that inevitable misses do not dominate tail latency. Collectively they determine whether caching delivers on its promise of low-latency service.

Cache-hit Algorithms

The literature often decomposes “cache-hit algorithms” into four cooperative facets: lookup structures, replacement policies, admission control, and prefetching. Analytical models then quantify the resulting hit rate.

Lookup Structures

  • Direct-mapped caches (CPU L1/L2) map each address to a single cache line via modulo arithmetic. Hit detection is a single tag comparison—ideal for one-cycle access but prone to conflict misses.
  • Set-associative / fully associative caches allow a block to reside in multiple ways; the hardware probes all ways in parallel. Associativity reduces conflicts at the cost of extra comparators.
  • Software caches (Redis, CDN edges, database query caches) typically rely on hash tables augmented with Bloom filters (to avoid needless backend reads), consistent hashing (to balance shards), or tries/prefix trees (URL/path caches). Lookup remains amortized O(1), but implementation must avoid pathological collisions and false negatives.

Replacement Families

Replacement logic maximizes future hits by deciding which entry to evict. Classical categories include:

  • Stack algorithms (LRU, MRU, 2Q, LIRS) that maintain a total ordering.
  • Clock algorithms (CLOCK, CLOCK-Pro) that approximate LRU with low overhead.
  • Adaptive and RL-inspired algorithms (ARC, CAR, LeCaR, TinyLFU, S3-FIFO, SIEVE) that learn workload characteristics dynamically.

Belady’s optimal algorithm (OPT) provides an upper bound on achievable hit rate but requires clairvoyance—it evicts the entry whose next access is furthest in the future. Online algorithms are judged by their competitive ratio relative to OPT: an algorithm $A$ is $c$-competitive if, for any request sequence $\sigma$, the number of misses incurred by $A$ is at most $c$ times the misses incurred by OPT (plus an additive constant). Sleator and Tarjan (1985) proved that LRU is $k$-competitive for a cache of size $k$, meaning LRU never incurs more than $k$ times the misses of OPT on any workload. This bound is tight in the worst case but pessimistic for workloads exhibiting locality; in practice, LRU often approaches OPT on traces with high temporal locality. No deterministic online algorithm can achieve a competitive ratio better than $k$, establishing a fundamental limit on cache performance without knowledge of future accesses.

Admission Control

Not every item should enter the cache. Admission policies prevent cache pollution:

  • Bloom-filter-based admission rejects once-seen items.
  • TinyLFU admission (used by Caffeine’s W-TinyLFU) consults a frequency sketch before admitting an object into the protected region.
  • Probabilistic admission accepts only every (k)-th object, damping streaming workloads.

Admission is often more important than eviction for workloads with high churn.

Prefetching

Prefetching increases hit rate by populating data before it is requested:

  • Sequential prefetchers detect stride patterns (e.g., next block at address (+\Delta)).
  • Correlation prefetchers learn transitions between addresses via history tables.
  • Semantic/ML prefetchers leverage query plans, user behavior, or machine-learning predictions (Netflix internal systems, large-scale filesystems) to anticipate access.

Aggressive prefetching can improve hit rate but risks fetching useless data, thereby increasing miss penalties.

Analytical Models

Academic analysis models hit rate via reuse-distance distributions, stack-distance profiles, or assumed access distributions (Zipfian). Concepts include:

  • Belady’s OPT as the theoretical maximum hit rate.
  • Competitive analysis for online policies.
  • Stack distance / reuse distance as predictors of hit probability.

These models inform policy choice and admission parameters.

Multi-level and Distributed Behavior

In multi-level CPU caches, L1 hit rates exceed 95%, with L2/L3 absorbing remaining misses. Inclusive versus exclusive hierarchies influence how hits propagate across levels. In distributed caches and CDNs, hit semantics expand to local hits (same node), remote hits (peer node via DHT), and origin misses (backend). Algorithms like consistent hashing, hierarchical caching, and sharding-aware LRU attempt to maximize locality at each tier.

Cache Replacement Policies

Because caches are finite, they require eviction policies that decide which entry to remove when capacity is exhausted.

Least Recently Used (LRU)

LRU evicts the entry that has not been referenced for the longest time and therefore suits workloads with pronounced temporal locality—recently accessed items are likely to be requested again soon. Operating-system page caches, Memcached clusters, Redis (default policy), and CDN edge nodes routinely employ LRU or approximations such as CLOCK or segmented LRU. In those environments user navigation patterns exhibit locality (adjacent pages, repeated API calls), which keeps the LRU working set stable. Because LRU maintains a total order of accesses, it also offers predictable behavior for capacity planning: doubling cache size typically doubles the time horizon of protected entries.

stateDiagram-v2
    [*] --> Cache
    Cache --> Cache: get(A)
    Cache --> Cache: get(C)
    Cache --> Cache: get(D)
    note right of Cache: Evict least recently used key

Least Frequently Used (LFU)

LFU removes the entry with the lowest access frequency, favoring globally popular items even if their requests are bursty. This policy excels when popularity distributions are heavy-tailed, as in streaming-media catalogs, app-store listings, or social-network feeds. Modern CDN platforms (Cloudflare, Akamai) use TinyLFU hybrids to ensure that “top talkers” remain cached even when traffic patterns oscillate. Redis 4.0 introduced LFU eviction for the same reason: highly popular keys remain protected even if they fall briefly idle.

stateDiagram-v2
    [*] --> Counters
    Counters --> Counters: get(A) (hits++)
    Counters --> Counters: get(C) (hits++)
    Counters --> Counters: get(D)
    note right of Counters: Evict item with smallest hit count

FIFO and SIEVE

First-in, first-out (FIFO) evicts the oldest element regardless of access history. Although naïve FIFO underperforms on locality-heavy workloads, it remains attractive in hardware contexts (NICs, SSD controllers, embedded caches) where predictable O(1) bookkeeping is essential. Recent work (e.g., SIEVE, Lazy Promotion) augments FIFO with visited bits and a single rotating pointer that scans for unvisited entries. These variants retain the computational simplicity of FIFO while approximating LRU behavior in practice. Large-scale storage systems and CDN prototypes have begun experimenting with such schemes to reduce metadata overhead per cached object.

stateDiagram-v2
    [*] --> Queue
    Queue --> Queue: get(A) -> mark visited
    Queue --> Queue: get(C) -> mark visited
    Queue --> Queue: get(D)
    note right of Queue
        hand scans queue
        visited -> clear bit
        first unvisited -> evict
    end note

These policies eschew per-access promotion, reducing bookkeeping while retaining elements that demonstrate utility between eviction cycles.

ARC, CAR, and TinyLFU Hybrids

Beyond canonical policies, adaptive hybrids capture both recency and frequency:

  • ARC/CAR (IBM research, later open-sourced) maintain separate LRU lists for recent and frequent entries along with “ghost” lists that track recency/frequency of recently evicted items. They continuously rebalance memory between the two lists, yielding near-optimal hit rates across shifting workloads. ARC influences ZFS, IBM storage arrays, and PostgreSQL derivatives.
  • TinyLFU / W-TinyLFU (used in Caffeine, Couchbase DCP caches) pair an admission filter (Count-Min Sketch) with a segmented LRU stack. Only objects predicted to be popular are admitted to the protected segment, drastically reducing churn under high request rates. Google, Netflix, and LinkedIn have published case studies showing W-TinyLFU’s superiority over pure LRU in recommender systems.

These adaptive schemes epitomize modern “cache-hit algorithms”: they integrate admission control, frequency estimation, and recency tracking to approach Belady’s OPT without incurring prohibitive overhead.

Policy Selection Guidelines

  • LRU – ideal for microservice response caches, JVM/CLR object caches, and web browsers where temporal locality dominates. Example: the Linux page cache and NGINX proxy cache rely on LRU-like behavior.
  • LFU (or TinyLFU hybrids) – appropriate for workloads with persistent hot keys such as recommendation services, news feeds, or CDN asset caches. Example: Redis LFU mode and Cloudflare’s cache shield.
  • FIFO/SIEVE – best for high-throughput or resource-constrained environments (routers, embedded devices, massive CDN nodes) where simple pointer manipulation outperforms maintaining LRU lists.

Production systems often adopt adaptive policies—ARC in IBM storage arrays and ZFS, 2Q in PostgreSQL’s buffer manager, or Google’s Adaptive Replacement in web caches—to balance recency and frequency without manual tuning.

Cache Invalidation Mechanisms

Beyond TTL-based expiry, many systems require explicit invalidation to keep caches synchronized with mutable backing stores. Common strategies include:

  • Write-through invalidation. Because every write flows through the cache, the cache can synchronously update or delete entries. Datastores such as DynamoDB Accelerator (DAX) rely on this pattern.
  • Write-behind with dependency tracking. Caches maintain dependency graphs or version vectors; when a backing row changes, the cache invalidates all derived entries. Oracle Coherence and SQL Server AlwaysOn rely on change notification to invalidate dependent cached views.
  • Pub/sub invalidation. A dedicated channel (Redis pub/sub, Kafka topics, SNS fan-out) distributes invalidation messages to cache nodes. This scales to CDNs where origin servers broadcast purge messages for specific URLs.
  • Lease-based protocols. A cache obtains a lease for a key; when the lease expires or a writer revokes it, readers must revalidate. Google’s Chubby and Facebook’s TAO use variants of leases to bound staleness.
  • Versioned writes / compare-and-set. Applications attach version numbers to cached values; updates succeed only if the cache still stores the expected version, preventing ABA-style anomalies.

The choice depends on update frequency, fan-out, and tolerance for stale reads. Systems with strict correctness requirements often combine multiple techniques (e.g., TTL + pub/sub) to hedge against missed signals.

flowchart LR
    Writer --> Cache
    Cache --> DB
    DB --> Event[Change Event]
    Event --> PubSub
    PubSub --> Cache1
    PubSub --> Cache2
    Cache1 --> Clients1
    Cache2 --> Clients2

Cache Stampede Mitigation

Cache-aside architectures are vulnerable to stampedes (also called dogpile or thundering herd) when many clients simultaneously miss the same key and hammer the backing store. Mitigation techniques include:

  • Per-key locking / single-flight. Only one request recomputes a missing value; others block or serve stale data. Implemented via mutexes (memcached CAS, Redis SETNX locks) or libraries such as Go’s singleflight.
  • Request coalescing proxies. An upstream layer aggregates identical requests (e.g., Envoy’s circuit breaking + hedging) so only one reaches the service/cache.
  • Stale-while-revalidate. Serve slightly stale data while asynchronously refreshing the cache. HTTP caching (stale-while-revalidate directive) and CDN edges adopt this pattern to protect origins.
  • Jittered TTLs / probabilistic early refresh. Randomizing expiration times or refreshing before TTL expiry distributes refresh load over time, preventing synchronized misses.
  • Circuit breakers and backpressure. When the backing store degrades, caches temporarily deny refreshes and continue serving stale data, prioritizing availability.

Academic analyses model stampedes as synchronized Poisson arrivals; mitigation reduces the effective concurrency at the database and hence flattens tail latency.

sequenceDiagram
    participant ClientA
    participant ClientB
    participant Cache
    participant DB

    ClientA->>Cache: GET key
    alt miss
        Cache->>Cache: acquire lock
        Cache->>DB: recompute
        DB-->>Cache: value
        Cache-->>ClientA: value
        Cache->>Cache: release lock
    end
    ClientB->>Cache: GET key
    Cache-->>ClientB: wait/serve stale

Consistency Semantics

Caching complicates application-level consistency guarantees. Designers must align cache behavior with the system’s required semantics:

  • Read-your-writes. Achieved through write-through caching, session stickiness, or cache invalidation coupled with client metadata indicating the freshest version.
  • Monotonic reads / bounded staleness. Leases, version clocks, or TTL upper bounds ensure clients do not observe time-traveling data. Google Spanner’s max_staleness and Cosmos DB’s bounded-staleness modes exemplify this integration.
  • Causal or eventual consistency. Many social networks (e.g., Facebook TAO) accept eventual consistency in caches, relying on asynchronous invalidation. Applications must detect and reconcile anomalies (e.g., missing likes) gracefully.
  • Transactional coherence. Some caches (Hibernate second-level cache, Oracle TimesTen) participate in two-phase commit or use write-through with transactional isolation so that cached entries honor serializability.

Formal treatment often models caches as replica lag. Consistency choices dictate invalidation mechanisms and user-visible anomalies.

sequenceDiagram
    participant Client
    participant Cache
    participant DB

    Client->>Cache: WRITE key=value
    Cache->>DB: propagate write (write-through)
    DB-->>Cache: ack
    Cache-->>Client: ack
    Client->>Cache: READ key
    Cache-->>Client: latest value (read-your-writes)

Instrumentation and Adaptive Tuning

Operational success hinges on continuous measurement:

  • Metrics. Track per-key and aggregate hit ratios, miss latency, evictions, admission rejections, and stampede counters. Histograms (P50/P95/P99) capture tail behavior.
  • Tracing. Propagate correlation IDs through cache hits and misses to visualize end-to-end impact (OpenTelemetry spans with cache attributes).
  • Autosizing. Some systems (AWS ElastiCache Auto Scaling, Google Cloud Memorystore) adjust cache node sizes based on monitored pressure. Adaptive caches (Caffeine) expose knobs to tune admission thresholds dynamically.
  • Alerting. Sudden drops in hit ratio or spikes in miss latency often signal invalidation bugs or upstream outages. Instrumentation should differentiate between local, remote, and origin misses in hierarchical caches.

Academic work leverages reuse-distance profiling and adaptive control loops to tune cache parameters automatically; production deployments increasingly adopt similar telemetry-driven feedback.

flowchart LR
    Cache --> Metrics[Metrics Exporter]
    Metrics --> TSDB[Time-series DB]
    TSDB --> Dashboards
    TSDB --> Alerts
    Traces[Tracing Agent] --> Collector --> Dashboards

Security and Multi-tenancy

Shared caches must enforce tenant isolation and guard against data leakage:

  • Namespace isolation. Prefixing keys with tenant IDs or using logical partitions prevents cross-tenant reads. Systems like Redis ACLs and Memcached SASL enable access control at the cache layer.
  • Eviction fairness. When tenants share capacity, weighted quotas or token buckets prevent noisy neighbors from evicting others’ data disproportionately.
  • PII handling and encryption. Sensitive payloads may require encryption-at-rest within the cache or avoidance of caching altogether. CDNs deploy signed cookies/URLs to ensure only authorized users hit cached assets.
  • Side-channel mitigation. Timing attacks (cache probing) and cache-poisoning attacks require validation, request authentication, and sometimes cache busting (e.g., including user-specific headers in cache keys).

Regulated industries often pair caches with audit logging and per-tenant eviction statistics to demonstrate compliance.

flowchart LR
    TenantA -->|namespace A| Cache
    TenantB -->|namespace B| Cache
    Cache -->|quota enforcement| Monitor
    Monitor --> Reports

Write-around Policies and Cache Warming

Write-around caches deliberately bypass caching on writes, inserting only data that has been read at least once. This prevents caches from filling with write-only data (e.g., log ingestion pipelines). Many storage controllers implement a hybrid: sequential writes use write-around, while random writes use write-through.

Cache warming addresses cold-start latency:

  • Snapshot/restore. Persist cache contents periodically (Redis RDB/AOF) and reload on restart.
  • Preload jobs. Issue queries during deployment to populate hot keys (e.g., Netflix pre-warms edge caches with predicted popular titles).
  • Streaming warmup. Use change-data-capture (CDC) feeds to continuously refresh caches before traffic arrives.

These practices reduce the initial miss storm after failover or deploys and maintain predictable latency profiles.

sequenceDiagram
    participant Client
    participant Cache
    participant DB

    Client->>Cache: WRITE (write-around)
    Cache-->>Client: ack
    Cache->>DB: forward write
    Note over Cache: cache not polluted
    Cache->>DB: preload hot keys
    DB-->>Cache: warm data

Time-to-Live (TTL)

Eviction policies determine which entries to discard; time-to-live (TTL) controls when entries expire regardless of capacity. Each cache entry carries an expiration timestamp. When the TTL elapses the cache discards the entry or refreshes it from the backing store.

Common applications include DNS (long TTLs for static records, short TTLs for load-balanced endpoints) and edge caching (short TTLs to propagate configuration changes). TTL is attractive because it implements invalidation without database coordination, but choosing an appropriate value is delicate: long TTLs boost hit ratio at the expense of freshness, whereas short TTLs guarantee freshness but elevate miss rates.

Implementation strategies may evaluate TTL lazily (at read time) or through periodic sweeps. Some caches adopt sliding TTLs that extend lifetimes upon access.

sequenceDiagram
    participant App
    participant Cache
    participant DB

    App->>Cache: GET key
    alt entry fresh
        Cache-->>App: value
    else expired
        Cache-->>App: miss/refresh
        Cache->>DB: fetch key
        DB-->>Cache: value
        Cache->>Cache: reset TTL
        Cache-->>App: value
    end

Materialized Views as Database-managed Caches

Databases can internalize caching via materialized views, which store the results of queries for rapid retrieval. Consider the creation of a daily trade summary view:

CREATE MATERIALIZED VIEW daily_trade_summary AS
SELECT symbol,
       date_trunc('day', trade_time) AS day,
       SUM(volume) AS daily_volume,
       AVG(price)  AS avg_price
FROM trades
GROUP BY symbol, day;
flowchart LR
    trades[(trades table)] -->|aggregate| view[(daily_trade_summary materialized view)]
    writes --> trades
    writes --> view
    analysts --> view

Materialized views inherit the database’s query planner, indexes, and transactional guarantees. Vendors offer manual refresh (explicit recomputation) or incremental refresh (propagating base-table updates). Systems such as Noria generalize the concept by constructing dataflow graphs that maintain query results incrementally outside the primary database, improving latency for complex analytic queries at the cost of eventual consistency.

Memoization

Memoization applies caching at the level of computation rather than storage. A memoized function maintains a mapping from input arguments to previously computed outputs. Subsequent invocations with the same inputs retrieve the stored result, avoiding redundant computation.

flowchart LR
    subgraph Function Cache
    Param[Input args]
    Result[Output]
    end
    Param -->|lookup key| Cache[(Memo table)]
    Cache -->|hit| Result
    Cache -->|miss| Func[Compute]
    Func --> Cache
    Func --> Result

Memoization is especially potent in recursive dynamic-programming algorithms or client-side API wrappers. The primary trade-off is increased memory consumption and the requirement that functions be referentially transparent or accompanied by explicit invalidation rules.

The following Python example illustrates a bounded memoization decorator that evicts entries using LRU semantics to prevent unbounded memory growth:

from functools import lru_cache
from typing import TypeVar, Callable, ParamSpec

P = ParamSpec('P')
R = TypeVar('R')

def memoize(maxsize: int = 128) -> Callable[[Callable[P, R]], Callable[P, R]]:
    """
    Decorator that memoizes function results with LRU eviction.

    Args:
        maxsize: Maximum number of cached results. None for unbounded.

    The decorated function must have hashable arguments.
    """
    def decorator(func: Callable[P, R]) -> Callable[P, R]:
        cached = lru_cache(maxsize=maxsize)(func)
        cached.cache_info = cached.cache_info  # Expose hit/miss statistics
        cached.cache_clear = cached.cache_clear  # Allow manual invalidation
        return cached
    return decorator

@memoize(maxsize=1024)
def fibonacci(n: int) -> int:
    """Compute the nth Fibonacci number with O(n) unique computations."""
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

# Usage: fibonacci(100) computes in microseconds due to memoization
# fibonacci.cache_info() returns CacheInfo(hits=98, misses=101, ...)

For multi-threaded environments, lru_cache provides thread-safe access. In distributed systems, memoization can be extended to remote caches (Redis, Memcached) by serializing arguments into cache keys, though this introduces network latency and requires careful consideration of cache coherency across processes.

HTTP Caching

The HTTP protocol provides a rich vocabulary for cache control that governs behavior across browsers, CDN edges, and reverse proxies. Understanding these headers is essential for web application performance.

Cache-Control Directives

The Cache-Control header is the primary mechanism for specifying caching policy:

Directive Meaning
max-age=N Response may be cached for N seconds from the time of the request.
s-maxage=N Like max-age, but applies only to shared caches (CDNs, proxies).
no-cache Cache must revalidate with origin before serving; does not prevent caching.
no-store Response must not be stored in any cache; use for sensitive data.
private Response is user-specific; shared caches must not store it.
public Response may be cached by any cache, even if normally non-cacheable.
immutable Response will not change during its freshness lifetime; skip revalidation.
stale-while-revalidate=N Serve stale response while asynchronously revalidating for up to N seconds.
stale-if-error=N Serve stale response if origin returns 5xx or is unreachable, for up to N seconds.

Directives may be combined: Cache-Control: public, max-age=31536000, immutable is typical for versioned static assets (JS bundles, images with content hashes).

Conditional Requests and Revalidation

When a cached response expires, clients issue conditional requests to check whether the content has changed:

  • ETag / If-None-Match. The origin includes an ETag (entity tag) header—typically a hash or version identifier—with the response. On revalidation, the client sends If-None-Match: <etag>. If the resource is unchanged, the origin returns 304 Not Modified without a body, saving bandwidth.
  • Last-Modified / If-Modified-Since. The origin includes Last-Modified with a timestamp. The client sends If-Modified-Since: <timestamp> on revalidation. A 304 response indicates the resource is unchanged.

ETags are preferred for precision; Last-Modified is simpler but has second-granularity limitations.

sequenceDiagram
    participant Client
    participant Cache
    participant Origin

    Client->>Cache: GET /resource
    Cache->>Origin: GET /resource (cache miss)
    Origin-->>Cache: 200 OK, ETag: "abc123"
    Cache-->>Client: 200 OK

    Note over Cache: TTL expires

    Client->>Cache: GET /resource
    Cache->>Origin: GET /resource, If-None-Match: "abc123"
    Origin-->>Cache: 304 Not Modified
    Cache->>Cache: Reset TTL
    Cache-->>Client: 200 OK (from cache)

Vary Header and Cache Key Composition

The Vary header instructs caches to partition stored responses by the values of specified request headers. For example, Vary: Accept-Encoding ensures that gzip-compressed and uncompressed responses are cached separately. Common variations include:

  • Vary: Accept-Encoding — Compression negotiation
  • Vary: Accept-Language — Localized content
  • Vary: Cookie — User-specific responses (effectively disables shared caching)

Incorrect Vary configuration leads to cache pollution (storing too many variants) or incorrect responses (serving the wrong variant). CDNs often normalize headers (e.g., collapsing Accept-Encoding: gzip, deflate to a canonical form) to improve hit rates.

CDN-Specific Extensions

CDN providers extend HTTP caching with proprietary headers:

  • Surrogate-Control (Fastly, Akamai): Overrides Cache-Control for edge caches without affecting browser caching.
  • CDN-Cache-Control (Cloudflare): Similar purpose, standardization in progress.
  • Cache-Tag / Surrogate-Key: Allows tagging responses for targeted purging (e.g., invalidate all responses tagged product-123).

These extensions enable fine-grained control over edge behavior while maintaining standard HTTP semantics for downstream clients.

Serialization and Cache Efficiency

The choice of serialization format significantly impacts cache memory efficiency, CPU overhead, and network bandwidth when transmitting cached data between tiers.

Format Trade-offs

Format Size Parse Speed Schema Evolution Human Readable
JSON Large Moderate Flexible Yes
Protocol Buffers Small Fast Backward/forward compatible No
MessagePack Small Fast Flexible No
Avro Small Fast Schema registry required No
FlatBuffers Smallest Zero-copy Limited No

For high-throughput caches, binary formats (Protocol Buffers, MessagePack) typically reduce memory footprint by 30–70% compared to JSON while also decreasing serialization CPU time. Zero-copy formats like FlatBuffers eliminate deserialization entirely for read-heavy workloads, though they complicate schema evolution.

Compression

Layering compression atop serialization further reduces memory consumption:

  • LZ4: Extremely fast compression/decompression; modest compression ratio (2–3×). Ideal for latency-sensitive caches.
  • Zstandard (zstd): Better compression ratio (3–5×) with tunable speed/ratio trade-off. Dictionary mode excels for small, similar objects.
  • Snappy: Google’s format, similar profile to LZ4; widely supported.

The optimal configuration depends on the ratio of memory cost to CPU cost. Caches storing large objects (images, documents) benefit more from compression than caches storing small key-value pairs where per-entry overhead dominates.

Key Design

Cache keys must uniquely identify cached content while remaining compact:

  • Hash-based keys: SHA-256 or xxHash of request parameters produces fixed-size keys but sacrifices debuggability.
  • Structured keys: Concatenating namespace, version, and identifiers (e.g., user:v2:12345:profile) preserves semantics and enables prefix-based invalidation.
  • Normalization: Canonicalizing query parameters, case-folding, and sorting ensure equivalent requests map to the same key.

Key length affects memory overhead and lookup performance; keys exceeding ~256 bytes often warrant hashing.

Multi-Tier Cache Architectures

Production systems frequently deploy caches in hierarchical tiers to balance latency, hit rate, and cost.

Canonical Three-Tier Architecture

flowchart LR
    subgraph Edge ["Edge Tier (CDN)"]
        E1[PoP 1]
        E2[PoP 2]
        E3[PoP N]
    end
    subgraph Regional ["Regional Tier"]
        R1[Regional Cache A]
        R2[Regional Cache B]
    end
    subgraph Origin ["Origin Tier"]
        O1[(Origin Database)]
        O2[Application Servers]
    end

    Client --> E1
    Client --> E2
    E1 --> R1
    E2 --> R1
    E3 --> R2
    R1 --> O2
    R2 --> O2
    O2 --> O1
  • Edge tier (CDN PoPs): Geographically distributed; serves static assets and cacheable API responses with sub-10ms latency. High volume, low hit rate per node but high aggregate hit rate.
  • Regional tier: Aggregates misses from multiple edge nodes; larger capacity, higher hit rate. Reduces origin load by absorbing the “long tail” of requests that miss at individual edges.
  • Origin tier: Database and application servers. Handles cache misses and generates authoritative responses.

Request Coalescing Across Tiers

Multi-tier architectures amplify the stampede problem: a cache miss at the edge can trigger simultaneous requests to the regional tier, which in turn stampede the origin. Mitigations include:

  • Tier-aware single-flight: Each tier independently coalesces requests for the same key.
  • Collapsed forwarding: CDNs (Varnish, Fastly) hold concurrent requests for the same object until the first completes, then serve all waiters from the response.
  • Request hedging with cancellation: Issue parallel requests to multiple regional nodes; cancel redundant requests once one succeeds.
sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant C3 as Client 3
    participant Edge as Edge Cache
    participant Regional as Regional Cache
    participant Origin as Origin Server

    Note over C1,C3: Simultaneous requests for same key
    C1->>Edge: GET /video/123
    C2->>Edge: GET /video/123
    C3->>Edge: GET /video/123

    Edge->>Edge: Coalesce requests (single-flight)
    Edge->>Regional: GET /video/123 (single request)

    Regional->>Regional: Coalesce requests
    Regional->>Origin: GET /video/123 (single request)

    Origin-->>Regional: response
    Regional->>Regional: Cache response
    Regional-->>Edge: response

    Edge->>Edge: Cache response
    Edge-->>C1: response
    Edge-->>C2: response (same data)
    Edge-->>C3: response (same data)

    Note over Edge,Origin: Only 1 request propagated per tier

Consistency Propagation

Invalidations must propagate through all tiers to prevent serving stale data:

  • Push-based invalidation: Origin publishes invalidation events to a message bus; each tier subscribes and purges affected keys. Latency depends on message propagation delay.
  • TTL hierarchy: Edge TTLs are shorter than regional TTLs, which are shorter than origin TTLs. This bounds staleness without explicit invalidation but sacrifices hit rate.
  • Purge APIs: CDNs expose APIs to purge by URL, tag, or prefix. Orchestration systems trigger purges after database commits.
flowchart TB
    subgraph Origin
        DB[(Database)]
        App[Application Server]
    end

    subgraph Message Bus
        PubSub[Invalidation Channel<br/>Kafka/Redis Pub-Sub]
    end

    subgraph Regional Tier
        R1[Regional Cache 1]
        R2[Regional Cache 2]
    end

    subgraph Edge Tier
        E1[Edge PoP 1]
        E2[Edge PoP 2]
        E3[Edge PoP 3]
        E4[Edge PoP 4]
    end

    App -->|1. Write update| DB
    DB -->|2. Publish invalidation| PubSub
    PubSub -->|3a. Invalidate key| R1
    PubSub -->|3b. Invalidate key| R2
    PubSub -->|3c. Invalidate key| E1
    PubSub -->|3d. Invalidate key| E2
    PubSub -->|3e. Invalidate key| E3
    PubSub -->|3f. Invalidate key| E4

    R1 -.purge.-> R1
    R2 -.purge.-> R2
    E1 -.purge.-> E1
    E2 -.purge.-> E2
    E3 -.purge.-> E3
    E4 -.purge.-> E4

    style PubSub fill:#f9f,stroke:#333,stroke-width:2px

TTL hierarchy example:

  • Edge tier: TTL = 60 seconds
  • Regional tier: TTL = 300 seconds (5 minutes)
  • Origin tier: TTL = 3600 seconds (1 hour)

This ensures edge caches refresh more frequently, bounding maximum staleness to 60 seconds even if invalidation messages are delayed.

The CAP theorem constrains multi-tier caches: strong consistency requires synchronous invalidation (sacrificing availability during partitions), while high availability permits temporary staleness.

Testing Cache Behavior

Caches introduce subtle failure modes—stale data, inconsistent state, stampedes—that require systematic testing.

Functional Testing

  • Hit/miss correctness: Verify that identical requests return cached responses and that cache keys correctly discriminate between different requests.
  • TTL behavior: Confirm that entries expire at the expected time. Use time-mocking libraries or configurable clocks to accelerate TTL expiration in tests.
  • Invalidation correctness: After a write, verify that subsequent reads observe the updated value (or a cache miss triggering a fresh fetch).
  • Negative caching: Ensure that lookups for non-existent keys are cached and expire appropriately.

Concurrency Testing

  • Stampede simulation: Issue many concurrent requests for the same uncached key; verify that only one backend request occurs (if single-flight is implemented).
  • Race conditions: Test interleaved reads and writes to detect inconsistencies from non-atomic cache operations.
  • Eviction under pressure: Fill the cache to capacity and verify that eviction policies behave as expected (LRU evicts oldest, LFU evicts least frequent, etc.).

Chaos and Fault Injection

  • Cache node failure: Kill cache nodes and verify that the system degrades gracefully (increased latency, not errors).
  • Network partitions: Simulate partitions between cache tiers; verify that stale data is served (if acceptable) or requests fail fast.
  • Clock skew: Introduce clock drift between nodes to test TTL and lease-based protocols.

Performance Testing

  • Latency distribution: Measure P50, P95, P99 latencies for cache hits and misses separately. Tail latency often reveals contention or GC issues.
  • Throughput under load: Ramp request rate until the cache saturates; identify the bottleneck (CPU, memory bandwidth, network).
  • Working-set simulation: Replay production access patterns (or synthetic Zipfian distributions) to validate hit-rate predictions.

Observability in Tests

Instrument tests to emit the same metrics as production (hit ratio, miss latency, eviction rate). Assertions on these metrics catch regressions that functional tests might miss—for example, a code change that accidentally disables caching would pass functional tests but exhibit a 0% hit ratio.

Consistent Hashing

Distributed caches partition the keyspace across multiple nodes to scale throughput and capacity. Consistent hashing provides a deterministic mapping from keys to nodes that minimizes data movement when the node set changes.

Classical Consistent Hashing

The algorithm treats both keys and nodes as points on a fixed-size hash ring (e.g., $[0, 2^{32})$). A key $k$ is assigned to the first node encountered when traveling clockwise from $\text{hash}(k)$.

Hash function: $h: \text{Key} \to [0, M)$ where $M = 2^{32}$ or $2^{64}$.

Node assignment:

$$ \text{node}(k) = \arg\min\_{n \in N} \{ h(n) \mid h(n) \geq h(k) \} \cup \{ \min\_{n \in N} h(n) \} $$

When a node joins or leaves, only keys in the affected arc move, yielding $O(K/N)$ relocations where $K$ is total keys and $N$ is the node count.

graph TD
    subgraph Hash Ring
    K1[Key A: hash=10] --> N1[Node 1: hash=15]
    K2[Key B: hash=42] --> N2[Node 2: hash=50]
    K3[Key C: hash=88] --> N3[Node 3: hash=90]
    end

Virtual Nodes

A naive ring with $N$ nodes exhibits high variance in load distribution. The standard deviation of load is:

$$ \sigma\_{\text{load}} = \Theta\left( \sqrt{\frac{K}{N}} \right) $$

This causes hotspots where unlucky nodes receive disproportionate load. Virtual nodes mitigate this: each physical node $n$ is replicated into $V$ virtual nodes with distinct hashes $h(n, 1), h(n, 2), \ldots, h(n, V)$.

graph TB
    subgraph Hash Ring with Virtual Nodes
        VN1["Node A-v1<br/>(hash=10)"]
        VN2["Node B-v1<br/>(hash=25)"]
        VN3["Node A-v2<br/>(hash=40)"]
        VN4["Node C-v1<br/>(hash=55)"]
        VN5["Node B-v2<br/>(hash=70)"]
        VN6["Node A-v3<br/>(hash=85)"]
        VN7["Node C-v2<br/>(hash=95)"]
    end

    K1["Key X<br/>hash=15"] --> VN2
    K2["Key Y<br/>hash=50"] --> VN4
    K3["Key Z<br/>hash=80"] --> VN6

    style VN1 fill:#f96,stroke:#333
    style VN3 fill:#f96,stroke:#333
    style VN6 fill:#f96,stroke:#333
    style VN2 fill:#9f6,stroke:#333
    style VN5 fill:#9f6,stroke:#333
    style VN4 fill:#69f,stroke:#333
    style VN7 fill:#69f,stroke:#333

Visualization: 3 physical nodes (A, B, C) each mapped to multiple virtual nodes. Keys distribute more evenly because virtual nodes spread across the ring. Node A is red, Node B is green, Node C is blue.

Load variance with virtual nodes:

$$ \sigma\_{\text{load}} = O\left( \sqrt{\frac{K}{NV}} \right) $$

Empirically, $V = 100$ to $V = 500$ virtual nodes per physical node yields balanced distributions (standard deviation $< 5\%$ of mean load).

Trade-off: More virtual nodes improve balance but increase metadata overhead (routing tables, gossip protocol state). Redis Cluster uses 16,384 hash slots as a middle ground.

Jump Consistent Hash

Google’s jump consistent hash achieves $O(1)$ memory (no ring storage) and $O(\log N)$ computation:

$$ \text{JumpHash}(k, N) \to n \in [0, N) $$

Algorithm:

def jump_hash(key: int, num_buckets: int) -> int:
    b, j = -1, 0
    while j < num_buckets:
        b = j
        key = ((key * 2862933555777941757) + 1) & 0xFFFFFFFFFFFFFFFF
        j = int((b + 1) * (float(1 << 31) / float((key >> 33) + 1)))
    return b

Properties:

  • Minimal movement: Adding a node $N+1$ relocates exactly $\frac{K}{N+1}$ keys (optimal).
  • No virtual nodes needed: Hash function inherently balances load.
  • Limitation: Nodes cannot be removed individually; only supports sequential addition.

Used in Google’s load balancers and Maglev.

Rendezvous (Highest Random Weight) Hashing

HRW hashing computes a score for each node and assigns the key to the highest-scoring node:

$$ \text{node}(k) = \arg\max\_{n \in N} \text{hash}(k \oplus n) $$

where $\oplus$ denotes concatenation or XOR.

Advantages:

  • Arbitrary node changes: Adding or removing any node affects only $O(K/N)$ keys.
  • No ring metadata: Computation is stateless.

Disadvantages:

  • $O(N)$ hashes per lookup: Prohibitive for large $N$. Mitigated by caching or using consistent hashing for coarse sharding + HRW for fine-grained replica selection.

Variance: Load distribution is near-optimal:

$$ \mathbb{E}[\text{load}\_n] = \frac{K}{N}, \quad \sigma\_{\text{load}} = O\left( \sqrt{\frac{K}{N}} \right) $$

Replication and Multi-Homing

For fault tolerance, each key is replicated to $R$ nodes. Common strategies:

  • Ring walk: Assign key to the next $R$ distinct physical nodes on the ring.
  • Independent hash functions: Use $R$ different hash functions to select $R$ independent nodes.

Replication increases load:

$$ \text{Expected load per node} = \frac{R \cdot K}{N} $$

Probability of data loss: If nodes fail independently with probability $p$, the probability that all $R$ replicas fail is:

$$ P(\text{data loss}) = p^R $$

For $R=3$ and $p=0.01$ (1% node failure rate), $P(\text{data loss}) = 10^{-6}$.

Cache Economics and Sizing Models

Caches consume memory, which has cost. Optimal cache sizing balances cache cost against miss cost.

Cost-Benefit Model

Let:

  • $C\_{\text{memory}}(S)$ = cost of cache of size $S$ (linear: $C\_{\text{memory}} = c\_m \cdot S$)
  • $L\_{\text{miss}}$ = average latency penalty per cache miss (database query time)
  • $\lambda$ = request rate (queries per second)
  • $h(S)$ = hit ratio as a function of cache size

Miss rate: $m(S) = 1 - h(S)$

Total cost per time unit:

$$ C\_{\text{total}}(S) = c\_m \cdot S + \lambda \cdot m(S) \cdot L\_{\text{miss}} \cdot c\_{\text{latency}} $$

where $c\_{\text{latency}}$ converts latency into cost (e.g., user dissatisfaction, SLA violations, compute time).

Optimal cache size minimizes total cost:

$$ S^{\*} = \arg\min\_{S} C\_{\text{total}}(S) $$

Marginal analysis: Increase cache size if the marginal cost of memory is less than the marginal benefit from reduced misses:

$$ c\_m < -\lambda \cdot L\_{\text{miss}} \cdot c\_{\text{latency}} \cdot \frac{dh(S)}{dS} $$

Cost optimization visualization:

graph TB
    subgraph Cost Curves
        direction TB
        A["Cache Size (S) →"]
        B["Cost →"]
    end

    subgraph Explanation
        Memory["Memory Cost: c_m × S<br/>(Linear, increasing)"]
        Miss["Miss Cost: λ × m(S) × L_miss × c_latency<br/>(Decreasing as cache grows)"]
        Total["Total Cost: Memory + Miss<br/>(U-shaped curve)"]
        Optimal["Optimal S*: Minimum total cost<br/>(Memory slope = Miss slope)"]
    end

    style Memory fill:#f99,stroke:#333
    style Miss fill:#99f,stroke:#333
    style Total fill:#9f9,stroke:#333
    style Optimal fill:#ff9,stroke:#333,stroke-width:3px

Key insight:

  • Small cache: Miss cost dominates (frequent DB queries)
  • Large cache: Memory cost dominates (expensive RAM, diminishing returns)
  • Optimal point $S^*$: Where marginal memory cost equals marginal miss cost reduction

Concrete example:

  • $c_m = \$10$/GB/month
  • $\lambda = 1000$ req/s
  • $L_{\text{miss}} = 50$ ms
  • $c_{\text{latency}} = \$0.001$/sec (compute cost)
  • $h(S) \approx \ln(S)/\ln(N)$ (Zipfian with $\alpha=1$)

For $N=10^9$ items:

  • At $S=10^5$: high miss cost, low memory cost → increase cache
  • At $S=10^7$: balanced → near optimal
  • At $S=10^9$: memory cost exceeds miss savings → over-provisioned

Working Set Model and Zipfian Distributions

Many workloads follow a Zipfian distribution where the $i$-th most popular item has access probability:

$$ p\_i = \frac{1/i^{\alpha}}{\sum\_{j=1}^{N} 1/j^{\alpha}} $$

where $\alpha$ is the skew parameter ($\alpha \approx 1$ for web caches, $\alpha \approx 0.8$ for CDN).

Hit ratio for a cache storing the top $S$ items:

$$ h(S) = \sum\_{i=1}^{S} p\_i = \frac{H\_S^{(\alpha)}}{H\_N^{(\alpha)}} $$

where $H\_n^{(\alpha)} = \sum\_{k=1}^{n} 1/k^{\alpha}$ is the generalized harmonic number.

Asymptotic behavior:

  • For $\alpha > 1$: $h(S) \to 1 - O(S^{1-\alpha})$ (rapidly saturates)
  • For $\alpha = 1$: $h(S) \approx \frac{\ln S}{\ln N}$ (logarithmic growth)
  • For $\alpha < 1$: Poor cache effectiveness; large $S$ needed

Example: For $\alpha = 1$, $N = 10^9$ items, $S = 10^6$:

$$ h(10^6) \approx \frac{\ln 10^6}{\ln 10^9} = \frac{13.8}{20.7} \approx 67\% $$

TCO Comparison: Cache vs. Database Replication

Scenario: 10,000 req/s, 50 ms database latency, 1 ms cache latency.

Option 1: No cache, scale database

  • Database must handle 10,000 req/s
  • Cost: $N\_{\text{DB}}$ replicas × $\$c_{\text{DB}}$/replica

Option 2: Cache with 80% hit ratio

  • Database handles 2,000 req/s (20% miss rate)
  • Cache handles 8,000 req/s
  • Cost: $c\_{\text{cache}} + N\_{\text{DB,reduced}}$ × $c\_{\text{DB}}$

Breakeven: Cache is cost-effective if:

$$ c\_{\text{cache}} < (N\_{\text{DB}} - N\_{\text{DB,reduced}}) \times c\_{\text{DB}} $$

Example numbers:

  • Database replica: $500/month
  • Cache (Redis 10 GB): $100/month
  • Without cache: need 5 DB replicas = $2,500/month
  • With cache: 1 DB replica + cache = $600/month
  • Savings: $1,900/month

When NOT to Cache

Caching has negative ROI when:

  1. High write rate: If $w/r > 0.5$ (write-to-read ratio), cache invalidation overhead exceeds benefits.
  2. Low access skew: Uniform access ($\alpha \approx 0$) yields $h(S) \approx S/N$, requiring impractically large $S$.
  3. Small dataset: If entire dataset fits in database memory, caching adds latency (extra network hop).
  4. Strict consistency requirements: Transactional workloads may require synchronous cache invalidation, negating latency gains.

Rule of thumb: Cache if $h(S) > 0.6$ and $w/r < 0.3$ for economically feasible $S$.

Geo-Distributed Cache Challenges

Global services deploy caches across multiple geographic regions to minimize latency for users worldwide. This introduces consistency, routing, and failure-handling complexities.

Latency Hierarchy

Latency breakdown for a user in San Francisco accessing content:

Tier Location Latency
L1: Browser cache Local device <1 ms
L2: Edge PoP San Francisco PoP 2–5 ms
L3: Regional cache US-West region 10–20 ms
L4: Origin US-East region 70–100 ms
L5: Database Primary data center 150–300 ms (with query time)
flowchart TD
    User[User in San Francisco] -->|<1ms| L1[L1: Browser Cache<br/>Hit Rate: 30%]
    L1 -->|miss| L2[L2: Edge PoP SF<br/>Latency: 2-5ms<br/>Hit Rate: 50%]
    L2 -->|miss| L3[L3: Regional Cache US-West<br/>Latency: 10-20ms<br/>Hit Rate: 80%]
    L3 -->|miss| L4[L4: Origin US-East<br/>Latency: 70-100ms<br/>Hit Rate: 90%]
    L4 -->|miss| L5[(L5: Database<br/>Latency: 150-300ms)]

    L1 -.hit.-> User
    L2 -.hit.-> L1
    L3 -.hit.-> L2
    L4 -.hit.-> L3
    L5 -.data.-> L4

    style L1 fill:#d4f1d4,stroke:#333
    style L2 fill:#b3e5b3,stroke:#333
    style L3 fill:#92d992,stroke:#333
    style L4 fill:#71cd71,stroke:#333
    style L5 fill:#50c150,stroke:#333

Waterfall: Request cascades down tiers until a hit occurs. Lower tiers (darker green) have higher latency but also higher aggregate hit rates.

Expected latency under cache hierarchy:

$$ E[L] = p\_1 L\_1 + (1-p\_1) p\_2 L\_2 + \ldots + \prod\_{i=1}^{4} (1-p\_i) \cdot L\_5 $$

where $p\_i$ is the hit rate at tier $i$.

Example: $p\_1=0.3, p\_2=0.5, p\_3=0.8, p\_4=0.9$:

$$ E[L] = 0.3(1) + 0.7(0.5)(5) + 0.7(0.5)(0.8)(15) + \ldots \approx 8.2 \text{ ms} $$

Cross-Region Consistency Models

Strong consistency: All regions observe writes in the same order. Requires consensus (Paxos, Raft).

Cost: Write latency $\geq$ RTT to majority of regions:

$$ L\_{\text{write}} \geq \frac{1}{2} \max\_{r\_i, r\_j \in R} \text{RTT}(r\_i, r\_j) $$

For global deployment (e.g., US-West, US-East, EU, Asia), $L\_{\text{write}} \geq 150$ ms.

Eventual consistency: Regions accept writes locally and asynchronously propagate updates.

Staleness bound: Under eventual consistency with propagation delay $\Delta$, a read may observe data up to $\Delta$ seconds stale.

Probabilistic staleness: If updates arrive as a Poisson process with rate $\lambda\_{\text{write}}$, the probability a read observes stale data:

$$ P(\text{stale}) = 1 - e^{-\lambda\_{\text{write}} \Delta} $$

For $\lambda\_{\text{write}} = 10$ updates/sec and $\Delta = 1$ sec:

$$ P(\text{stale}) = 1 - e^{-10} \approx 0.99995 $$

Most reads are stale! This is acceptable for social feeds but unacceptable for financial transactions.

Geo-Routing Strategies

DNS-based routing: Return geographically closest PoP based on client IP.

  • Pro: Simple, leverages existing infrastructure.
  • Con: Coarse granularity (BGP prefixes), ignores real-time congestion.

Anycast routing: Multiple PoPs advertise the same IP; BGP routes traffic to nearest.

  • Pro: Automatic failover, no DNS TTL delays.
  • Con: Routing changes can cause cache misses (different PoP = different cache).

Client-side selection: Client measures latency to all PoPs, selects fastest.

  • Pro: Accounts for real-time conditions.
  • Con: Client-side complexity, exposes topology.

Performance: Anycast reduces median latency by 20–40% vs. DNS in multi-region deployments.

Split-Brain Resolution

Network partitions can isolate cache regions, creating split brain where regions diverge.

Vector clocks track causality:

$$ \text{VC}(r\_i) = [c\_1, c\_2, \ldots, c\_n] $$

where $c\_i$ is the count of updates from region $i$.

Conflict detection: Updates $U\_1$ and $U\_2$ conflict if their vector clocks are incomparable:

$$ \neg (VC(U\_1) \leq VC(U\_2)) \land \neg (VC(U\_2) \leq VC(U\_1)) $$

Resolution strategies:

  • Last-write-wins (LWW): Use timestamp; risks data loss under clock skew.
  • Application-specific merge: E.g., shopping cart merges item sets.
  • CRDT (Conflict-free Replicated Data Types): Mathematical structures guaranteeing convergence (grow-only sets, PN-counters).

Production Case Studies

Facebook TAO

TAO (The Associations and Objects) is Facebook’s distributed cache for the social graph.

Architecture:

  • Objects: Users, photos, posts (stored in MySQL).
  • Associations: Friend relationships, likes, comments (edges in the graph).
  • Cache layer: Memcached fleet (thousands of nodes).
flowchart TB
    subgraph Primary Region
        App1[Application Servers]
        Cache1[Memcached Fleet]
        DB1[(MySQL Master)]
    end

    subgraph Secondary Region
        App2[Application Servers]
        Cache2[Memcached Fleet]
        DB2[(MySQL Follower)]
    end

    Users --> App1
    Users --> App2
    App1 --> Cache1
    App2 --> Cache2
    App1 -.write.-> DB1
    Cache1 -.miss.-> DB1
    Cache2 -.miss.-> DB2
    DB1 -->|async replication| DB2
    DB1 -.invalidation.-> Cache1
    DB1 -.invalidation.-> Cache2

Consistency model: Read-your-writes within a region; eventual across regions.

Implementation:

  • Writes go to master DB in primary region.
  • Invalidations sent via async replication to follower DBs and caches.
  • Lease protocol prevents stale cache population: cache must hold valid lease to serve data.

Lease protocol sequence:

sequenceDiagram
    participant Client
    participant Cache
    participant DB

    Client->>Cache: GET object_123
    alt cache miss
        Cache->>DB: fetch object_123 + request lease
        DB-->>Cache: data + lease (expires in 10s)
        Cache->>Cache: store data + lease token
        Cache-->>Client: data
    end

    Note over DB: Another client writes object_123
    DB->>Cache: invalidate object_123 lease
    Cache->>Cache: mark lease as invalid

    Client->>Cache: GET object_123
    alt lease invalid
        Cache-->>Client: miss (force refetch)
        Cache->>DB: fetch fresh data + new lease
        DB-->>Cache: updated data + new lease
        Cache-->>Client: updated data
    end

Performance:

  • Hit ratio: >99% for object reads, ~95% for association reads.
  • Latency: P50 < 1 ms (cache hit), P99 < 10 ms (includes occasional misses).

Failure mode: During cross-region partition, reads may return stale data but eventually converge.

Twitter Cache Architecture

Twitter uses multi-tier caching for timelines:

  1. Edge cache (PoP): Static assets (CSS, JS).
  2. Regional cache (Redis): Rendered timeline fragments.
  3. In-process cache (JVM): Hot user profiles.
flowchart TB
    Users[Users] --> Edge[Edge Cache - PoP]
    Edge --> Regional[Regional Cache - Redis]
    Regional --> AppServers[Application Servers]
    AppServers --> InProc[In-Process Cache - JVM]
    InProc --> TweetDB[(Tweet Database)]
    InProc --> GraphDB[(Social Graph DB)]

    Edge -.static assets.-> Users
    Regional -.timeline fragments.-> Edge
    InProc -.user profiles.-> AppServers

Timeline construction:

$$ \text{Timeline}(u) = \bigcup\_{f \in \text{following}(u)} \text{TopK}(\text{tweets}(f), k=20) $$

Caching strategy:

flowchart LR
    subgraph Fan-out on Write
        U1[User tweets] --> Cache1[Push to follower timelines]
        Cache1 --> F1[Follower 1 cache]
        Cache1 --> F2[Follower 2 cache]
        Cache1 --> F3[Follower N cache]
        Note1[Write amplification: W = follower_count]
    end

    subgraph Fan-out on Read
        U2[User requests timeline] --> Fetch[Fetch from followed users]
        Fetch --> T1[User 1 tweets]
        Fetch --> T2[User 2 tweets]
        Fetch --> T3[User M tweets]
        Fetch --> Merge[Merge & sort]
        Note2[Read latency: max of all fetches]
    end
  • Fan-out on write: When user $u$ tweets, push to timelines of all followers.
    • Write amplification: $W = \text{follower\_count}(u)$
    • For celebrities ($10^6$ followers), writes are batched and rate-limited.
  • Fan-out on read: Fetch tweets from followed users at read time.
    • Read latency: $L\_{\text{read}} = \max\_{f} L(\text{fetch tweets from } f)$
    • Mitigated by parallel fetch + timeout.

Hybrid approach: Fan-out-write for normal users; fan-out-read for celebrities.

Netflix Edge Caching

Netflix caches video chunks at thousands of Open Connect Appliances (edge servers in ISP networks).

flowchart TB
    subgraph ISP Network 1
        OCA1[Open Connect Appliance]
    end

    subgraph ISP Network 2
        OCA2[Open Connect Appliance]
    end

    subgraph ISP Network N
        OCA3[Open Connect Appliance]
    end

    Users1[Users] --> OCA1
    Users2[Users] --> OCA2
    Users3[Users] --> OCA3

    OCA1 -.cache miss.-> Origin[Netflix Origin Servers]
    OCA2 -.cache miss.-> Origin
    OCA3 -.cache miss.-> Origin

    ML[ML Prediction Model] -.popularity forecast.-> OCA1
    ML -.popularity forecast.-> OCA2
    ML -.popularity forecast.-> OCA3

    Origin --> Storage[(Content Storage)]

Cache decision function:

$$ \text{cache}(c) = \begin{cases} 1 & \text{if popularity}(c) \cdot \text{size}(c)^{-1} > \theta \\ 0 & \text{otherwise} \end{cases} $$

where $c$ is a video chunk, $\text{popularity}$ is predicted views, $\text{size}$ is storage cost, and $\theta$ is threshold.

Prediction model: Uses collaborative filtering + time-series analysis to forecast demand.

Performance:

  • 95% of traffic served from cache (no origin hit).
  • Byte hit ratio: 85% (accounting for video size skew).

Adaptive bitrate caching: Cache multiple resolutions; serve based on client bandwidth.

flowchart TD
    Request[Client requests video] --> Detect[Detect bandwidth]
    Detect --> Decision{Bandwidth?}

    Decision -->|High: >5 Mbps| Check4K{4K cached?}
    Decision -->|Medium: 2-5 Mbps| Check1080{1080p cached?}
    Decision -->|Low: <2 Mbps| Check720{720p cached?}

    Check4K -->|Yes| Serve4K[Serve 4K chunk]
    Check4K -->|No| Fetch4K[Fetch from origin] --> Serve4K

    Check1080 -->|Yes| Serve1080[Serve 1080p chunk]
    Check1080 -->|No| Fetch1080[Fetch from origin] --> Serve1080

    Check720 -->|Yes| Serve720[Serve 720p chunk]
    Check720 -->|No| Fetch720[Fetch from origin] --> Serve720

    Serve4K --> Client[Stream to client]
    Serve1080 --> Client
    Serve720 --> Client

Cache Migration and Versioning

Zero-Downtime Migration

Dual-write strategy:

  1. Phase 1: Shadow writes — Write to old and new cache; read from old.
  2. Phase 2: Backfill — Copy existing data from old to new cache.
  3. Phase 3: Dual reads — Read from new cache; fallback to old on miss.
  4. Phase 4: Cut over — Read only from new cache.
  5. Phase 5: Retire — Decommission old cache.
flowchart LR
    subgraph Phase 1: Shadow Writes
        App1[Application] -->|write| Old1[Old Cache]
        App1 -->|shadow write| New1[New Cache]
        App1 -.read.-> Old1
    end

    subgraph Phase 2: Backfill
        Old2[Old Cache] -->|copy all data| New2[New Cache]
        App2[Application] -.read.-> Old2
    end

    subgraph Phase 3: Dual Reads
        App3[Application] -.read primary.-> New3[New Cache]
        New3 -.miss fallback.-> Old3[Old Cache]
    end

    subgraph Phase 4: Cut Over
        App4[Application] -.read.-> New4[New Cache]
        Old4[Old Cache]
    end

    subgraph Phase 5: Retire
        App5[Application] -.read.-> New5[New Cache]
    end

    Phase1 --> Phase2
    Phase2 --> Phase3
    Phase3 --> Phase4
    Phase4 --> Phase5

    style New1 fill:#9f9,stroke:#333
    style New2 fill:#9f9,stroke:#333
    style New3 fill:#6f6,stroke:#333
    style New4 fill:#3f3,stroke:#333
    style New5 fill:#0f0,stroke:#333
    style Old4 fill:#f99,stroke:#333,stroke-dasharray: 5 5

Timeline: New cache gradually becomes authoritative (green intensifies) while old cache is phased out (red, dashed in Phase 4).

Consistency risk: During dual-write, writes may arrive out of order.

Mitigation: Attach version numbers; new cache accepts only newer versions:

$$ \text{accept}(k, v, t) = \begin{cases} \text{true} & \text{if } t > t\_{\text{cached}}(k) \\ \text{false} & \text{otherwise} \end{cases} $$

Schema Evolution

Serialization versioning: Embed schema version in cached values:

$$ \text{CacheEntry} = (\text{version}, \text{payload}) $$

Backward compatibility: New code must read old versions.

Forward compatibility: Old code should ignore unknown fields (Protocol Buffers, Avro support this).

Gradual rollout: Deploy new version with flag to control read/write format:

if feature_flag("use_v2_schema"):
    cache.set(key, serialize_v2(data))
else:
    cache.set(key, serialize_v1(data))

Advanced Monitoring Patterns

Anomaly Detection

Baseline hit ratio: Establish expected hit ratio $h\_0$ via historical average.

Anomaly threshold: Alert if current hit ratio $h(t)$ deviates:

$$ |h(t) - h\_0| > k \cdot \sigma\_h $$

where $\sigma\_h$ is the standard deviation of historical hit ratios and $k=3$ (three-sigma rule).

Exponentially weighted moving average (EWMA):

$$ \text{EWMA}\_t = \alpha \cdot h(t) + (1 - \alpha) \cdot \text{EWMA}\_{t-1} $$

Detects gradual drift. Typical $\alpha = 0.1$ (10% weight to current sample).

Change-point detection: Use cumulative sum (CUSUM) to detect shifts:

$$ S\_t = \max(0, S\_{t-1} + (h(t) - h\_0 - \epsilon)) $$

Alert when $S\_t > \theta$ (threshold).

Attribution Analysis

Per-tenant hit ratio:

$$ h\_i = \frac{\text{hits}\_i}{\text{hits}\_i + \text{misses}\_i} $$

Cache consumption: Track memory usage per tenant via key prefixes.

Eviction attribution: Log which tenant’s keys are evicted:

$$ r\_{\text{evict},i} = \frac{E\_i}{\Delta t} $$

Identifies tenants causing churn.

Correlation with Upstream Metrics

Miss latency correlation: Plot cache miss rate against database P99 latency.

Expected relationship (Little’s Law):

$$ L\_{\text{DB}} = \frac{N\_{\text{concurrent}}}{T\_{\text{throughput}}} \approx \lambda \cdot m(t) \cdot L\_{\text{query}} $$

If DB latency spikes without corresponding miss-rate increase, root cause is not cache.

Write Coalescing and Batching

Write Combining Buffers

Hardware caches use write combining: accumulate multiple writes to adjacent addresses, flush as a single transaction.

Software analog: Buffer cache writes; flush in batches.

Throughput gain:

$$ \text{Throughput} = \frac{B \cdot W}{L\_{\text{flush}} + B \cdot L\_{\text{write}}} $$

where $B$ is batch size, $W$ is write rate, $L\_{\text{flush}}$ is flush latency, $L\_{\text{write}}$ is per-write latency.

For $L\_{\text{flush}} = 10$ ms, $L\_{\text{write}} = 0.1$ ms, $B = 100$:

$$ \text{Throughput} = \frac{100 W}{10 + 100(0.1)} = \frac{100 W}{20} = 5W $$

Batching increases throughput $5\times$.

Invalidation Batching

Naive invalidation: Publish one message per key update.

Batched invalidation: Accumulate invalidations over window $\Delta t$; publish batch.

Staleness bound: Data can be stale for up to $\Delta t$.

Trade-off:

  • Small $\Delta t$: Low staleness, high message overhead.
  • Large $\Delta t$: High staleness, low overhead.

Optimal window minimizes cost:

$$ C(\Delta t) = c\_{\text{msg}} \cdot \frac{1}{\Delta t} + c\_{\text{stale}} \cdot \Delta t $$

Minimize:

$$ \Delta t^{\*} = \sqrt{\frac{c\_{\text{msg}}}{c\_{\text{stale}}}} $$

For $c\_{\text{msg}} = 0.01$, $c\_{\text{stale}} = 1$:

$$ \Delta t^{\*} = \sqrt{0.01} = 0.1 \text{ seconds} $$

Bloom Filter Invalidation

Before broadcasting invalidation for key $k$, check Bloom filter of recently accessed keys.

False positive rate:

$$ p\_{\text{fp}} = \left( 1 - e^{-kn/m} \right)^k $$

where $n$ is element count, $m$ is bit array size, $k$ is hash function count.

Benefit: Skip invalidations for cold keys, reducing broadcast traffic by 70–90%.

Concluding Remarks

Caching occupies the intersection of performance optimization and distributed systems design. Selecting an appropriate caching strategy requires balancing latency, consistency, storage cost, and implementation complexity. The techniques surveyed in this note—from cache-aside and TTLs to materialized views and memoization—provide a vocabulary for reasoning about these trade-offs in a principled, academically rigorous manner.