Inside a High-Performance Time Series Database in Go

Time series databases (TSDBs) are specialized storage engines optimized for time-stamped data at massive scale. Unlike general-purpose databases, TSDBs exploit temporal locality, high write throughput, and read patterns dominated by range queries and aggregations. This article explores the architecture, algorithms, and implementation techniques for building a production-grade TSDB in Go, examining both storage and query engines with mathematical analysis of performance characteristics.

Modern observability platforms like Prometheus, Grafana, and Datadog rely on time series databases to handle billions of metrics per second. These systems face unique challenges: metrics arrive continuously at high velocity, queries scan large time ranges for trend analysis, and storage costs must remain manageable despite exponential data growth. A general-purpose database like PostgreSQL or MongoDB would struggle under this workload—the access patterns are fundamentally different from transactional (OLTP) or analytical (OLAP) systems.

Time Series Data Model

Fundamental Characteristics

Understanding why time series data is special guides every architectural decision in TSDB design. A time series is a sequence of data points indexed by time:

$$ \mathcal{T} = \{(t_1, v_1), (t_2, v_2), \ldots, (t_n, v_n)\} \quad \text{where } t_1 < t_2 < \cdots < t_n $$

Key properties that distinguish TSDBs:

Append-only writes: $99\%$ of writes are inserts with $t > t\_{\max}$ (monotonically increasing timestamps). This is fundamentally different from OLTP databases where records are randomly inserted, updated, and deleted. Metrics collectors (like Prometheus scrapers or StatsD agents) continuously push new data points but almost never modify historical values. This property allows aggressive optimizations—we can assume writes always go to the “hot” end of the dataset.
High cardinality: Millions of unique series (metrics × label combinations). Modern microservice architectures generate explosive cardinality. Consider an HTTP request counter with labels for service, endpoint, method, and status code. A 100-service system with 20 endpoints each creates $100 \times 20 \times 5 \times 10 = 100{,}000$ unique series just for this one metric. InfluxDB famously struggled with high cardinality until their 2.x rewrite, while Prometheus was designed from the ground up to handle it efficiently through inverted indexing.
Immutable history: Once written, data rarely changes. Unlike financial transactions that may require corrections, metric samples represent observations at a point in time. If a CPU reading was 45% at 10:00:00, that fact never changes. This immutability enables LSM-tree storage architectures and aggressive compression without worrying about update-in-place complexity.
Temporal locality: Recent data accessed more frequently than old data. Dashboards typically show the last 6 hours, alerts evaluate recent windows, and long-term trends use downsampled data. This access pattern—termed “temporal locality”—justifies tiered storage strategies where hot data lives in memory/SSD and cold data moves to cheaper archival storage.

Workload characteristics (Prometheus production data, 2023):

Write rate: $10^6$ – $10^7$ samples/sec
Read/write ratio: $1:10$ (write-heavy, unlike OLTP)
Query patterns:
- Range queries: $80\%$ (e.g., last 6 hours)
- Aggregations: $60\%$ (SUM, AVG, P95)
- Point queries: $<5\%$

Data Model Schema

The data model is the foundation of any database, and for TSDBs, the choice between different modeling approaches has profound performance implications. Early time series databases (RRDtool, Graphite) used hierarchical naming like servers.web01.cpu.usage, which was simple but inflexible. Modern TSDBs universally adopted the metric + labels model pioneered by Google’s Borgmon and popularized by Prometheus.

Why metric + labels?

The key insight: multi-dimensional data is best represented with explicit dimensions, not hierarchical paths. Consider monitoring HTTP requests. The hierarchical approach forces you to choose an order:

# Hierarchical (inflexible)
http.requests.GET.200.count
http.requests.GET.404.count
http.requests.POST.200.count

Want to aggregate by method regardless of status code? You need to query multiple paths. Want to add a new dimension (region)? You need to restructure everything.

The label-based approach solves this:

http_requests_total{method="GET", status="200"}
http_requests_total{method="GET", status="404"}
http_requests_total{method="POST", status="200"}

Now you can aggregate by any dimension or combination: “all GETs”, “all 404s”, “POST requests in us-east”, etc. This flexibility is why Prometheus, InfluxDB 2.x, VictoriaMetrics, and M3DB all use variations of this model.

Time series identifier (metric + labels):

type Series struct {
    Metric string            // e.g., "http_requests_total"
    Labels map[string]string // e.g., {"method": "GET", "status": "200"}
}

Design decisions:

String-based labels: Why not typed values? Simplicity and universality. All exporters can emit strings without type negotiation. The cost is storage (strings are verbose), mitigated by symbol tables (§4.2).
Label naming conventions: Prometheus established conventions that became industry standards:
- __name__ is a special label containing the metric name (allows treating metric name as just another label for queries)
- __ prefix reserved for internal labels (e.g., __address__, __scheme__)
- Snake_case for metric names (http_requests_total, not httpRequestsTotal)
- Label names must match [a-zA-Z_][a-zA-Z0-9_]*
What goes in labels vs metric name?
- Metric name: What you’re measuring (http_requests_total, cpu_usage_seconds)
- Labels: Dimensions of that measurement (method, status, instance, job)
- Anti-pattern: Encoding dimensions in metric name (http_requests_get_total, http_requests_post_total)—this breaks aggregation

Sample (timestamp + value):

type Sample struct {
    Timestamp int64   // Unix nanoseconds
    Value     float64 // IEEE 754 double precision
}

Design decisions:

Why int64 for timestamps? Unix nanoseconds fit in 64 bits until year 2262. Alternative: milliseconds (smaller range) or time.Time (80 bytes!). Nanosecond precision enables high-frequency metrics (10kHz+) and precise ordering.
Why float64 for values? Almost all metrics are numeric (counters, gauges, histograms). Float64 provides:
- Large dynamic range (from $10^{-308}$ to $10^{308}$)
- Sufficient precision (15-17 decimal digits)
- Universal IEEE 754 format (portable across systems)
Trade-off: No exact integer representation above $2^{53}$ (9 quadrillion). For precise large integers, some TSDBs offer int64 value type (InfluxDB), but 99.9% of metrics work fine with float64.
Why not store units or types? Debated in early Prometheus design. Omitted for simplicity—units are convention (metric name like bytes_total implies bytes). InfluxDB tried explicit field types (float, int, string, bool) but added complexity without clear benefits.

Metadata and annotations:

Real production systems need more than just (metric, labels, timestamp, value). Consider:

type SeriesMetadata struct {
    Series      Series            // Metric + labels
    FirstSeen   int64             // First sample timestamp
    LastSeen    int64             // Last sample timestamp
    SampleCount uint64            // Total samples written
    Resolution  time.Duration     // Scrape interval (if known)

    // Optional: Type hints for better querying
    Type        MetricType        // Counter, Gauge, Histogram, Summary
    Unit        string            // "bytes", "seconds", "ratio"
    Help        string            // Human-readable description

    // Operational metadata
    CreatedBy   string            // Job/exporter that created this series
    Retention   time.Duration     // How long to keep this series
}

type MetricType int
const (
    TypeUnknown MetricType = iota
    TypeCounter   // Monotonically increasing (e.g., http_requests_total)
    TypeGauge     // Can go up or down (e.g., memory_usage_bytes)
    TypeHistogram // Distribution of values
    TypeSummary   // Pre-computed quantiles
)

Why metadata matters:

Type hints: Knowing a metric is a counter enables automatic rate calculations: rate(http_requests_total[5m]) is safe because counters only increase
Resolution: Enables intelligent downsampling—don’t average 1s resolution data to 1s!
Retention policies: Different metrics have different value lifespans (SLA metrics: 2 years; debug metrics: 7 days)
Discovery: Help text makes metrics self-documenting for humans and UIs

Real-world implementations:

Prometheus: Stores type and help text in separate lookup (not per sample). Labels are sorted by key for deterministic series ID hashing.
InfluxDB: Uses “measurements” (like metric names) + “tags” (immutable labels) + “fields” (values that can have different types)
VictoriaMetrics: Compatible with Prometheus model, adds optional retention per metric via __retention__ label
M3DB: Namespaces provide coarse-grained separation; series within namespace follow Prometheus model
TimescaleDB: Maps time series to SQL: metric becomes table name, labels become indexed columns, samples become rows

Cardinality explosion:

The flexibility of labels comes with danger. For $m$ label keys with $n$ values each:

$$ |\text{Series}| = n^m $$

Example: 5 labels with 10 values each = $10^5 = 100{,}000$ series.

Cardinality horror stories:

The user_id label mistake: Adding user_id as a label for a service with 10M users creates 10M series per metric. One team at a large company accidentally created 2 billion series, causing a 24-hour outage. Rule: Labels must have bounded cardinality.
The timestamp-in-label antipattern: Some users tried timestamp="2024-01-15T10:30:00" as a label, creating a new series per second. This is fundamentally misunderstanding the model—timestamps are in samples, not labels.
IP addresses: client_ip as a label seems reasonable until you have 100K unique clients per hour. Better: bucket into ranges or omit entirely.

Cardinality management best practices:

Label values should be bounded: Good: status="200", method="GET" (finite set). Bad: user_id="12345", trace_id="abc123" (unbounded).
Use label value allowlists: Many TSDBs support regex filtering on ingestion to drop high-cardinality labels.
Monitor cardinality growth: Prometheus’s prometheus_tsdb_symbol_table_size_bytes tracks unique label values. Alert on rapid growth.
Design for aggregation: If you can’t query sum(metric) without labels, your labels are likely over-specified.
When in doubt, use logging: Not everything belongs in metrics. Individual user events belong in logs or traces, not time series.

Storage Architecture

LSM-Tree Foundation

The storage engine is the heart of any database, and for time series workloads, the choice is critical. Most production TSDBs use Log-Structured Merge Trees (LSM-trees), a data structure invented by Patrick O’Neil in 1996 that trades read performance for exceptional write throughput. This trade-off is perfect for metrics workloads.

Why LSM-trees dominate TSDBs: Traditional B-tree storage engines (used in MySQL, PostgreSQL) are optimized for random reads and updates. They maintain a sorted on-disk structure that requires expensive random I/O for every write. When you’re ingesting millions of samples per second, this becomes a bottleneck—even with SSD, random writes are 100× slower than sequential writes.

LSM-trees flip the model: writes go to an in-memory buffer (MemTable), which is periodically flushed to disk as an immutable sorted file (SSTable). All writes are sequential, achieving near-theoretical disk bandwidth. The cost is paid on reads, which must check multiple SSTables, but caching and clever indexing mitigate this.

Production adoption:

Prometheus: Custom LSM-based TSDB (2017 redesign)
VictoriaMetrics: Optimized LSM variant with merge-free architecture
M3DB (Uber): LSM-tree with custom commitlog
InfluxDB IOx: Apache Arrow-based columnar engine (newer generation)
Cassandra (used by many TSDBs): LevelDB-style LSM-tree

flowchart TD
    Write[Write Request] --> WAL[Write-Ahead Log]
    WAL --> MemTable[Active MemTable<br/>in-memory sorted]
    MemTable --> |Full| Immutable[Immutable MemTable]
    Immutable --> Flush[Flush to Disk]
    Flush --> L0[Level 0 SSTables<br/>overlapping ranges]
    L0 --> |Compact| L1[Level 1 SSTables<br/>non-overlapping]
    L1 --> |Compact| L2[Level 2 SSTables]

    Query[Query Request] --> MemTable
    Query --> Immutable
    Query --> L0
    Query --> L1
    Query --> L2

Write path complexity:

$$ O(\log n) \quad \text{(MemTable insert into sorted structure)} $$

Read path complexity (worst case):

$$ O(k \log n) \quad \text{where } k = \text{number of SSTables} $$

Time-Based Partitioning

One of the most important architectural decisions in TSDB design is how to partition data. Unlike traditional databases that might shard by hash or range of a key, TSDBs partition by time. This leverages the temporal access patterns—queries almost always specify a time range, and data lifecycle management (retention policies) operates on time boundaries.

The block concept: Prometheus pioneered the “block” architecture in their 2.0 TSDB redesign (2017). Rather than treating all data as one continuous sorted stream, data is chunked into time-bounded, immutable blocks. Each block is a self-contained directory with its own index and compressed chunks, typically covering 2 hours of data.

Why 2-hour blocks? This is an empirically-derived sweet spot balancing several trade-offs:

Compaction cost: Smaller blocks mean more frequent compaction (merging multiple blocks into larger ones), wasting I/O
Query overhead: Larger blocks mean fewer index lookups but slower scans for narrow time windows
Memory pressure: The active block’s index lives in memory; larger blocks consume more RAM
Retention efficiency: Deleting old data becomes a simple directory removal

Prometheus tested block sizes from 30 minutes to 24 hours and found 2 hours optimal for most workloads. VictoriaMetrics uses variable block sizes (1 hour, 1 day, 1 month) in a tiered scheme. Partition data into blocks by time ranges (typically 2 hours):

type Block struct {
    MinTime   int64  // Start of time range (inclusive)
    MaxTime   int64  // End of time range (exclusive)
    Directory string // /data/01HQXYZ...
    Index     *Index // Inverted index: labels → series IDs
    Chunks    *ChunkStore
}

Block size optimization:

$$ T\_{\text{block}} = \sqrt{\frac{2 \cdot C\_{\text{create}}}{R\_{\text{write}} \cdot C\_{\text{query}}}} $$

where:

$C\_{\text{create}}$: Cost to create new block (compaction overhead)
$R\_{\text{write}}$: Write rate (samples/sec)
$C\_{\text{query}}$: Per-block query overhead

For Prometheus: $T\_{\text{block}} = 2$ hours (empirically optimal).

Storage layout:

/data/
  01HQXYZ.../          # Block 1 (timestamp-based ID)
    meta.json          # Metadata (time range, stats)
    index              # Inverted index
    chunks/
      000001           # Chunk file 1
      000002           # Chunk file 2
  01HQXZA.../          # Block 2
    ...

Chunk Storage Format

Group samples for same series into chunks (default 120 samples):

type Chunk struct {
    SeriesID  uint64   // Series reference
    MinTime   int64
    MaxTime   int64
    Encoding  byte     // Compression algorithm
    Data      []byte   // Compressed samples
}

Chunk size optimization:

$$ n\_{\text{samples}} = \arg\min_{n} \left( \frac{S\_{\text{overhead}}}{n} + C\_{\text{compress}}(n) + C\_{\text{decompress}}(n) \right) $$

where $S\_{\text{overhead}}$ is fixed per-chunk metadata (16 bytes).

Typical optimal: $n = 120$ samples (2 minutes at 1s scrape interval).

Compression Algorithms

Compression is not optional in time series databases—it’s essential for economic viability. Storing uncompressed metrics at scale is prohibitively expensive. A 1-second scrape interval for 1 million series generates 16 bytes/sample × 86,400 samples/day × 1M series = 1.4 TB/day. At AWS EBS pricing ($0.10/GB/month), that’s $4,200/month just for one day of data!

Fortunately, time series data has exploitable structure. Unlike random binary data (images, encrypted files), metrics exhibit predictable patterns: timestamps arrive at regular intervals, and values change gradually (CPU rarely jumps from 10% to 90% in one second). The compression algorithms used by Prometheus, Gorilla (Facebook), and VictoriaMetrics achieve 10–20× compression by exploiting these properties.

Timestamp Compression (Delta-of-Delta)

The insight: Metrics are scraped at regular intervals (every 15 seconds, every 1 minute). If scraping is reliable, consecutive timestamps have nearly identical deltas. Instead of storing 64-bit Unix timestamps, we can store the change in the delta, which is often zero!

Delta encoding: Store differences instead of absolute values.

$$ \Delta t_i = t_i - t\_{i-1} $$

For 15-second intervals: $\Delta t\_i = 15{,}000{,}000{,}000$ nanoseconds (still 64 bits).

Delta-of-delta (Gorilla, Facebook 2015):

$$ \Delta^2 t_i = \Delta t_i - \Delta t\_{i-1} $$

For perfectly regular 15s intervals, $\Delta^2 t\_i = 0$ (can encode in 1 bit!). Even with clock jitter (±50ms), the delta-of-delta is tiny and compressible with variable-bit encoding.

Why this works: Facebook’s Gorilla paper (2015) observed that 96% of timestamp delta-of-deltas in production were zero, meaning they could be encoded in 1 bit. This technique was subsequently adopted by Prometheus, VictoriaMetrics, and InfluxDB.

Implementation:

type TimestampEncoder struct {
    t0       int64 // First timestamp (full 64 bits)
    tDelta   int64 // Previous delta
    buf      *bitstream.Writer
}

func (e *TimestampEncoder) Encode(t int64) {
    if e.t0 == 0 {
        e.t0 = t
        e.buf.WriteBits(uint64(t), 64)
        return
    }

    delta := t - e.t0
    e.t0 = t

    if e.tDelta == 0 {
        e.tDelta = delta
        e.buf.WriteBits(uint64(delta), 64)
        return
    }

    deltaOfDelta := delta - e.tDelta
    e.tDelta = delta

    // Variable-bit encoding based on deltaOfDelta magnitude
    switch {
    case deltaOfDelta == 0:
        e.buf.WriteBits(0b0, 1) // 1 bit: no change
    case deltaOfDelta >= -63 && deltaOfDelta <= 64:
        e.buf.WriteBits(0b10, 2)
        e.buf.WriteBits(uint64(deltaOfDelta), 7) // 9 bits total
    case deltaOfDelta >= -255 && deltaOfDelta <= 256:
        e.buf.WriteBits(0b110, 3)
        e.buf.WriteBits(uint64(deltaOfDelta), 9) // 12 bits total
    case deltaOfDelta >= -2047 && deltaOfDelta <= 2048:
        e.buf.WriteBits(0b1110, 4)
        e.buf.WriteBits(uint64(deltaOfDelta), 12) // 16 bits total
    default:
        e.buf.WriteBits(0b1111, 4)
        e.buf.WriteBits(uint64(deltaOfDelta), 32) // 36 bits total
    }
}

Compression ratio: For regular 15s intervals:

$$ R\_{\text{compress}} = \frac{64 \text{ bits}}{1 \text{ bit}} = 64\times $$

Average bits per timestamp: $1.37$ bits (Prometheus production data).

Value Compression (XOR Float Encoding)

Timestamp compression gives us 46× savings, but values are trickier. Metrics are IEEE 754 double-precision floats (64 bits), and traditional compression algorithms like gzip struggle with floating-point data because small numerical changes create large binary differences.

The clever observation from Gorilla: Consecutive metric values have similar bit patterns. CPU usage doesn’t instantly jump from 23.456% to 91.123%—it drifts gradually. When we XOR two similar floats, the result has many leading zeros (similar exponent) and trailing zeros (similar mantissa).

XOR encoding (Gorilla):

$$ x_i = v_i \oplus v\_{i-1} $$

where $\oplus$ is bitwise XOR. The magic: if values are close, XOR produces a result with long runs of zeros, which we can represent compactly.

Example: CPU at 45.3% → 45.7%

45.3 as float64: 0x4046999999999999
45.7 as float64: 0x4046CCCCCCCCCCCD
XOR result: 0x0000555555555554 (many leading zeros!)

We store only the meaningful bits (the “control word” indicating how many leading/trailing zeros, plus the significant middle bits). For slowly-changing metrics like temperature or CPU, this achieves 5–8× compression.

Implementation:

type FloatEncoder struct {
    prevValue    uint64 // Previous value as uint64 bits
    prevLeading  uint8  // Leading zeros from previous XOR
    prevTrailing uint8  // Trailing zeros from previous XOR
    buf          *bitstream.Writer
}

func (e *FloatEncoder) Encode(value float64) {
    bits := math.Float64bits(value)

    if e.prevValue == 0 {
        e.prevValue = bits
        e.buf.WriteBits(bits, 64)
        return
    }

    xor := bits ^ e.prevValue
    e.prevValue = bits

    if xor == 0 {
        e.buf.WriteBits(0b0, 1) // 1 bit: value unchanged
        return
    }

    leading := uint8(bits.LeadingZeros64(xor))
    trailing := uint8(bits.TrailingZeros64(xor))

    // If new XOR has same or better leading/trailing zeros, reuse control bits
    if leading >= e.prevLeading && trailing >= e.prevTrailing {
        e.buf.WriteBits(0b10, 2) // Control bit: use previous
        significantBits := 64 - e.prevLeading - e.prevTrailing
        e.buf.WriteBits(xor>>e.prevTrailing, significantBits)
    } else {
        e.prevLeading = leading
        e.prevTrailing = trailing
        e.buf.WriteBits(0b11, 2) // Control bit: new leading/trailing
        e.buf.WriteBits(uint64(leading), 5)
        significantBits := 64 - leading - trailing
        e.buf.WriteBits(uint64(significantBits), 6)
        e.buf.WriteBits(xor>>trailing, significantBits)
    }
}

Compression ratio: Slowly-changing metrics (CPU usage):

$$ R\_{\text{compress}} = \frac{64 \text{ bits}}{12 \text{ bits}} \approx 5.3\times $$

Average bits per value: $1.37$ bits (timestamp) + $12$ bits (value) = $13.37$ bits/sample vs 128 bits uncompressed = 9.6× compression.

Indexing Strategy

Inverted Index

Efficient queries are impossible without proper indexing. A naive approach—scanning all series to find matches—becomes prohibitively expensive at scale. Querying http_requests_total{status="200", method="GET"} across 10 million series would require checking every series’ labels, even though only thousands match.

The inverted index solution: Borrowed from full-text search engines (Lucene, Elasticsearch), an inverted index maps each label value to the series IDs containing it. This is the opposite of the forward direction (series → labels), hence “inverted.”

This was a key innovation in Prometheus 2.0’s TSDB redesign. Earlier versions used LevelDB for indexing, which became a bottleneck under high cardinality. The new inverted index reduced query times by 100× for label-heavy queries.

How other TSDBs handle indexing:

InfluxDB: Initially used simple B-trees, struggled with cardinality, switched to Time-Structured Merge trees in 2.x
VictoriaMetrics: Multi-level inverted index with bloom filters
M3DB: Per-block indexes with LRU caching
TimescaleDB: PostgreSQL’s native B-tree indexes (works because it’s on top of Postgres)

Map labels to series IDs for efficient filtering:

type Index struct {
    // Posting lists: label → sorted list of series IDs
    postings map[string]map[string][]uint64

    // Series metadata: series ID → labels
    series map[uint64]Labels

    // Symbol table for string deduplication
    symbols *SymbolTable
}

type Labels []Label

type Label struct {
    Name  string
    Value string
}

Query: {__name__="http_requests_total", method="GET", status="200"}

Execution plan:

Fetch posting lists for each matcher
Intersect posting lists (series that match ALL labels)

func (idx *Index) Query(matchers []*Matcher) []uint64 {
    var postingLists [][]uint64

    for _, m := range matchers {
        key := m.Name + "=" + m.Value
        postingLists = append(postingLists, idx.postings[m.Name][m.Value])
    }

    // Intersect all posting lists (sorted)
    return intersectSorted(postingLists)
}

func intersectSorted(lists [][]uint64) []uint64 {
    if len(lists) == 0 {
        return nil
    }

    // Start with smallest list for efficiency
    sort.Slice(lists, func(i, j int) bool {
        return len(lists[i]) < len(lists[j])
    })

    result := lists[0]
    for i := 1; i < len(lists); i++ {
        result = intersectTwo(result, lists[i])
    }
    return result
}

func intersectTwo(a, b []uint64) []uint64 {
    result := make([]uint64, 0, min(len(a), len(b)))
    i, j := 0, 0

    for i < len(a) && j < len(b) {
        if a[i] == b[j] {
            result = append(result, a[i])
            i++
            j++
        } else if a[i] < b[j] {
            i++
        } else {
            j++
        }
    }
    return result
}

Intersection complexity:

$$ O\left(\sum_{i=1}^{k} |L_i|\right) $$

where $k$ is number of matchers, $|L\_i|$ is posting list size.

Index Persistence Format

On-disk index structure (Prometheus format):

index file:
┌────────────────────────────┐
│ Symbol Table               │  String deduplication
├────────────────────────────┤
│ Series (offset table)      │  series ID → labels
├────────────────────────────┤
│ Label Indices              │  Inverted index
├────────────────────────────┤
│ Postings                   │  Compressed posting lists
├────────────────────────────┤
│ Label Offset Table         │  Fast label lookup
├────────────────────────────┤
│ Postings Offset Table      │  Fast postings lookup
├────────────────────────────┤
│ TOC (Table of Contents)    │  File structure offsets
└────────────────────────────┘

Symbol table compression: Deduplicate repeated strings.

For 1M series with labels {job="api", instance="host123", region="us-east"}:

Uncompressed: $\sim 100$ bytes/series × $10^6$ = 100 MB
Symbol table: ~10 KB symbols + 4 bytes/reference × 3 labels × $10^6$ = 12 MB

Compression ratio: $\sim 8\times$

Write Path Optimization

The write path must handle millions of samples per second with low latency while guaranteeing durability. This requires careful engineering—naive implementations bottleneck on disk I/O or lock contention.

Write-Ahead Log (WAL)

Purpose: Durability and crash recovery.

Every database faces a fundamental trade-off: durability versus performance. Holding data only in memory (MemTable) risks data loss on crashes. Writing every sample immediately to disk is too slow. The write-ahead log (WAL) solves this by providing a lightweight durability mechanism.

How it works: Before acknowledging a write, append it to a sequential log file (the WAL). This is a fast sequential write—on modern SSDs, sequential writes achieve 500+ MB/s vs 20 MB/s for random writes. If the process crashes, we replay the WAL on restart to reconstruct the in-memory state.

Production implementations:

Prometheus: Uses a segment-based WAL with 128 MB segments, similar to Kafka’s log structure
M3DB: Custom commitlog with configurable fsync policies (sync every write, batch sync every 100ms, etc.)
VictoriaMetrics: Simplified WAL design leveraging OS page cache for buffering
Cassandra: Each node has a separate commitlog with configurable flush intervals

type WAL struct {
    dir        string
    mu         sync.Mutex
    curSegment *os.File
    segmentN   int
}

type WALRecord struct {
    Type      byte   // 1=Series, 2=Samples, 3=Tombstone
    SeriesRef uint64
    Labels    Labels
    Samples   []Sample
}

func (w *WAL) Append(rec *WALRecord) error {
    w.mu.Lock()
    defer w.mu.Unlock()

    // Encode record with length prefix + CRC32 checksum
    data := encodeRecord(rec)

    if w.curSegment.Size() > segmentSize {
        w.rotateSegment()
    }

    _, err := w.curSegment.Write(data)
    return err
}

WAL segment size: 128 MB (trade-off between recovery time and file handles).

Write amplification:

$$ \text{WA} = \frac{\text{bytes written to disk}}{\text{bytes written by user}} = \frac{\text{WAL} + \text{Compaction}}{\text{Samples}} $$

For LSM-tree with WAL: $\text{WA} \approx 2$–$5$ (Prometheus production).

Batched Writes

Problem: Individual writes incur high syscall overhead.

Solution: Batch samples by time window:

type WriteBuffer struct {
    mu      sync.Mutex
    samples map[uint64][]Sample // seriesID → samples
    size    int
    maxSize int // 10,000 samples
}

func (wb *WriteBuffer) Add(seriesID uint64, s Sample) {
    wb.mu.Lock()
    defer wb.mu.Unlock()

    wb.samples[seriesID] = append(wb.samples[seriesID], s)
    wb.size++

    if wb.size >= wb.maxSize {
        wb.flush()
    }
}

func (wb *WriteBuffer) flush() {
    // Write entire batch to WAL + MemTable atomically
    batch := wb.samples
    wb.samples = make(map[uint64][]Sample)
    wb.size = 0

    go func() {
        wal.AppendBatch(batch)
        memtable.InsertBatch(batch)
    }()
}

Throughput improvement:

$$ T\_{\text{batched}} = T\_{\text{unbatched}} \times \frac{B}{1 + O\_{\text{syscall}} \cdot B} $$

where $B$ is batch size, $O\_{\text{syscall}}$ is syscall overhead ratio.

For $B = 10{,}000$, $O\_{\text{syscall}} = 0.01$: ~10× throughput gain.

Query Execution Engine

Query Language

PromQL-style syntax:

rate(http_requests_total{status="200"}[5m])

Abstract syntax tree:

type Expr interface {
    Eval(ctx *EvalContext) Vector
}

type VectorSelector struct {
    Name     string
    Matchers []*Matcher
    Offset   time.Duration
    Range    time.Duration // For range vectors
}

type AggregateExpr struct {
    Op      AggregateOp // SUM, AVG, MAX, MIN, etc.
    Expr    Expr
    Grouping []string
}

type BinaryExpr struct {
    LHS Expr
    Op  BinaryOp // +, -, *, /, etc.
    RHS Expr
}

Query Planner

Optimization passes:

Pushdown: Move filters close to data source
Predicate reordering: Evaluate selective filters first
Chunk pruning: Skip chunks outside query time range

type QueryPlan struct {
    TimeRange TimeRange
    Matchers  []*Matcher
    Aggregations []Aggregation
}

func (qp *QueryPlan) Optimize() {
    // 1. Time-based pruning
    qp.pruneBlocks()

    // 2. Selectivity-based matcher ordering
    sort.Slice(qp.Matchers, func(i, j int) bool {
        return qp.Matchers[i].Selectivity() < qp.Matchers[j].Selectivity()
    })

    // 3. Pushdown aggregations
    qp.pushdownAggregations()
}

Block pruning:

$$ \text{Blocks\_scanned} = \lceil \frac{T\_{\text{query}}}{T\_{\text{block}}} \rceil $$

For 6-hour query with 2-hour blocks: 3 blocks scanned.

Parallel Query Execution

Fan-out to multiple blocks:

func (qe *QueryEngine) Select(ctx context.Context, matchers []*Matcher, tr TimeRange) SeriesSet {
    blocks := qe.blockStore.Blocks(tr)

    var wg sync.WaitGroup
    results := make(chan SeriesSet, len(blocks))

    // Query each block in parallel
    for _, block := range blocks {
        wg.Add(1)
        go func(b *Block) {
            defer wg.Done()
            results <- b.Query(ctx, matchers, tr)
        }(block)
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    // Merge results from all blocks
    return MergeSeriesSets(results)
}

Query latency:

$$ L\_{\text{query}} = L\_{\text{index}} + \max_{i=1}^{k} L\_{\text{block}_i} + L\_{\text{merge}} $$

where parallelism reduces $\max L\_{\text{block}_i}$ vs $\sum L\_{\text{block}_i}$.

Aggregation and Downsampling

Stream Aggregation

Problem: Storing every sample is expensive for high-frequency data.

Solution: Aggregate samples on-the-fly during ingestion.

type StreamAggregator struct {
    interval time.Duration // 1 minute
    current  map[uint64]*AggregateWindow
}

type AggregateWindow struct {
    StartTime int64
    Count     int64
    Sum       float64
    Min       float64
    Max       float64
    SumSquare float64 // For stddev
}

func (sa *StreamAggregator) Add(seriesID uint64, s Sample) {
    windowStart := (s.Timestamp / sa.interval.Nanoseconds()) * sa.interval.Nanoseconds()

    window := sa.current[seriesID]
    if window == nil || window.StartTime != windowStart {
        // Flush previous window
        if window != nil {
            sa.flush(seriesID, window)
        }
        window = &AggregateWindow{StartTime: windowStart, Min: s.Value, Max: s.Value}
        sa.current[seriesID] = window
    }

    window.Count++
    window.Sum += s.Value
    window.Min = min(window.Min, s.Value)
    window.Max = max(window.Max, s.Value)
    window.SumSquare += s.Value * s.Value
}

func (aw *AggregateWindow) Mean() float64 {
    return aw.Sum / float64(aw.Count)
}

func (aw *AggregateWindow) StdDev() float64 {
    mean := aw.Mean()
    variance := (aw.SumSquare / float64(aw.Count)) - (mean * mean)
    return math.Sqrt(variance)
}

Storage reduction:

For 1s samples aggregated to 1min windows:

$$ R\_{\text{reduction}} = \frac{60 \text{ samples}}{5 \text{ stats}} = 12\times $$

Multi-Resolution Storage

Tiered retention (VictoriaMetrics approach):

Raw data: Last 7 days (1s resolution)
5min aggregates: 7–90 days
1hour aggregates: 90 days–2 years

type RetentionPolicy struct {
    Resolution time.Duration
    Retention  time.Duration
}

var retentionPolicies = []RetentionPolicy{
    {Resolution: 0, Retention: 7 * 24 * time.Hour},         // Raw
    {Resolution: 5 * time.Minute, Retention: 90 * 24 * time.Hour},  // 5min
    {Resolution: 1 * time.Hour, Retention: 730 * 24 * time.Hour},   // 1hour
}

Query routing:

func (db *TSDB) selectResolution(queryRange time.Duration) time.Duration {
    // Use highest resolution that fits in retention
    for _, policy := range retentionPolicies {
        if queryRange <= policy.Retention {
            return policy.Resolution
        }
    }
    return retentionPolicies[len(retentionPolicies)-1].Resolution
}

Storage savings:

$$ S\_{\text{total}} = S\_{\text{raw}} \times \left(1 + \frac{1}{12} + \frac{1}{720}\right) \approx 1.08 \times S\_{\text{raw}} $$

Only 8% overhead for long-term retention!

Compaction Strategy

Over time, the number of blocks grows without bound—every 2 hours creates a new block. This creates problems: queries must scan hundreds of small blocks, each requiring index lookups and file I/O. Compaction merges small blocks into larger ones, reducing query overhead.

Leveled Compaction

Goal: Reduce number of blocks to query while maintaining reasonable write amplification.

Compaction is a classic database trade-off: too aggressive and you waste I/O rewriting the same data repeatedly; too lazy and queries become slow scanning many small files. The solution is leveled compaction, pioneered by LevelDB (Google) and adopted by most LSM-based systems.

The strategy: Group blocks into “levels” by age and size. Level 0 contains small, recent blocks (2 hours each). Level 1 merges groups of L0 blocks (6 hours). Level 2 merges L1 blocks (24 hours), and so on. This creates a pyramid structure where each level is ~10× larger than the previous.

Real-world approaches:

Prometheus: Simple 3-level compaction (2h → 6h → 24h+)
VictoriaMetrics: Aggressive merging with configurable retention tiers
Cassandra: Size-tiered compaction (STCS) or time-window compaction (TWCS) for time series
M3DB: Per-namespace compaction policies with configurable windows

flowchart LR
    L0_1[Block 1<br/>0-2h] --> C1[Compact]
    L0_2[Block 2<br/>2-4h] --> C1
    L0_3[Block 3<br/>4-6h] --> C1
    C1 --> L1[Merged Block<br/>0-6h]

    L1 --> C2[Compact]
    L1_2[Block<br/>6-12h] --> C2
    C2 --> L2[Merged Block<br/>0-12h]

Compaction algorithm:

func (c *Compactor) Compact(blocks []*Block) (*Block, error) {
    // 1. Merge indices (union of all series)
    mergedIndex := mergeIndices(blocks)

    // 2. Merge chunks (for each series, concatenate chunks across blocks)
    mergedChunks := make(map[uint64][]*Chunk)
    for _, block := range blocks {
        for seriesID, chunks := range block.Chunks {
            mergedChunks[seriesID] = append(mergedChunks[seriesID], chunks...)
        }
    }

    // 3. Re-compress chunks (optional: rewrite with better compression)
    for seriesID, chunks := range mergedChunks {
        mergedChunks[seriesID] = recompressChunks(chunks)
    }

    // 4. Write new block
    newBlock := &Block{
        MinTime: blocks[0].MinTime,
        MaxTime: blocks[len(blocks)-1].MaxTime,
        Index:   mergedIndex,
        Chunks:  mergedChunks,
    }

    return writeBlock(newBlock)
}

Compaction overhead:

$$ C\_{\text{overhead}} = \frac{R\_{\text{write}} \times \text{WA}}{R\_{\text{ingest}}} $$

For $\text{WA} = 3$: 3× read + write during compaction.

Compaction Scheduling

Trade-off: Compact too often → high I/O, compact too rarely → too many blocks.

Optimal compaction frequency (M3DB model):

$$ f\_{\text{compact}} = \frac{R\_{\text{query}}}{k \times C\_{\text{compact}}} $$

where $k$ is acceptable number of blocks, $C\_{\text{compact}}$ is compaction cost.

Typical schedule: Compact every 2 hours (aligns with block size).

Memory Management

Memory is the scarcest and most expensive resource in a time series database. While disk storage costs pennies per gigabyte, RAM costs dollars per gigabyte—a 100× difference. Yet memory determines query performance: a cache miss means waiting 10-100ms for disk I/O versus <1ms for memory access. The challenge is allocating limited memory across competing needs: recent writes (MemTable), frequently accessed data (block cache), and index structures (posting lists, label lookup tables).

Why memory management dominates TSDB performance:

Time series workloads exhibit extreme data volume combined with specific access patterns. A monitoring system collecting metrics every 15 seconds from 100,000 servers generates 23 million samples per hour—roughly 370 MB of raw data. Over 30 days, this accumulates to 260 GB. No production system can keep this entirely in memory, yet queries often request the “last 24 hours” or “compare this week to last week,” requiring random access across gigabytes of historical data.

The tension is between write amplification and read performance. Writes arrive continuously and must be acknowledged quickly (sub-millisecond latency) to avoid backpressure on metrics collectors. This demands in-memory buffering. But once written to disk, reads dominate the workload—dashboards refreshing every 5 seconds, alerting rules evaluating every minute, ad-hoc queries from engineers debugging production issues. Without caching, every query hits disk, and latency balloons to seconds or minutes.

Memory allocation strategies and their trade-offs:

Production TSDBs face a fundamental allocation problem: given $M$ bytes of memory, how much goes to writes (MemTable) versus reads (block cache)? The answer depends on workload characteristics:

Allocation Strategy	MemTable Size	Block Cache Size	Best For	Drawback
Write-heavy	1 GB (67%)	512 MB (33%)	High ingestion rate (>1M samples/sec)	Poor read latency; frequent cache misses
Balanced	512 MB (33%)	1 GB (67%)	Typical production (monitoring dashboards)	Moderate in both dimensions
Read-heavy	256 MB (17%)	1.25 GB (83%)	Query-dominant workload (analysts, SRE investigations)	Write stalls; frequent flushes

Write-heavy allocation minimizes flush frequency, reducing compaction overhead. If the MemTable can buffer 10 minutes of data instead of 2 minutes, you perform 5× fewer flushes, reducing I/O pressure. However, a small block cache means queries frequently miss and hit disk, inflating P99 latency. Prometheus defaults to this strategy because it prioritizes write throughput and assumes most queries are recent (cached).

Read-heavy allocation maximizes cache hit rate for queries, potentially achieving 90%+ hit rates on hot data. But frequent MemTable flushes create write amplification: samples are written to disk multiple times as SSTables are compacted. InfluxDB leans toward this because its users run complex analytical queries that benefit from caching.

Balanced allocation (512 MB MemTable, 1+ GB cache) suits general-purpose workloads where both writes and reads matter. This is the sweet spot for most production deployments.

The MemTable durability dilemma:

A purely in-memory MemTable is fast but volatile—if the process crashes before flushing to disk, recent writes are lost. Production systems address this with write-ahead logging (WAL): every write is appended to a durable log before being inserted into the MemTable. On recovery, the WAL is replayed to reconstruct the MemTable. This adds write latency (fsync to disk) but prevents data loss.

Alternatives and their trade-offs:

No WAL (VictoriaMetrics in some configs): Fastest writes, but crashes lose unflushed data. Acceptable if metrics are retried or eventual consistency is tolerable.
Synchronous WAL (Prometheus): Every write fsyncs to disk before acknowledging. Guarantees durability but limits write throughput to ~10K samples/sec per disk.
Batched WAL (M3DB): Buffer writes in memory, fsync every 10ms. Balances throughput (100K+ samples/sec) against durability (up to 10ms of data loss on crash).

Block cache eviction policies:

Not all cached blocks are equally valuable. A dashboard querying the “last 1 hour” repeatedly should keep those blocks hot, while a one-off query for “3 months ago” shouldn’t evict hot data. LRU (Least Recently Used) is standard because recent queries predict future queries—if a dashboard refreshes every 30 seconds, the blocks it accessed 30 seconds ago will be accessed again soon. However, LRU has failure modes:

Scan-resistant: A single large query scanning 30 days of data can evict the entire cache, degrading performance for all subsequent queries. LRU-2 or ARC (Adaptive Replacement Cache) mitigate this by tracking access frequency in addition to recency.
Correlated misses: If 100 dashboards all query the same time range and it’s not cached, they all hit disk simultaneously, creating a latency spike. Prefetching solves this: when loading block $i$, also load block $i+1$ if the query likely needs it.

Production systems often use tiered caching:

L1: In-process cache (Go map + LRU, ~1 GB): Zero serialization cost, nanosecond access
L2: Shared Redis/Memcached (~10 GB): Survives process restarts, shared across replicas
L3: Disk (SSD) (~1 TB): Orders of magnitude cheaper than RAM but 100× slower

This hierarchy mirrors CPU caches (L1/L2/L3), with each tier trading capacity for latency.

MemTable Design

In-memory sorted structure for recent writes:

type MemTable struct {
    mu      sync.RWMutex
    data    map[uint64]*SeriesBuffer // seriesID → samples
    size    int64
    maxSize int64 // 512 MB
}

type SeriesBuffer struct {
    samples []Sample
}

func (mt *MemTable) Insert(seriesID uint64, s Sample) error {
    mt.mu.Lock()
    defer mt.mu.Unlock()

    buf := mt.data[seriesID]
    if buf == nil {
        buf = &SeriesBuffer{}
        mt.data[seriesID] = buf
    }

    buf.samples = append(buf.samples, s)
    mt.size += 16 // 8 bytes timestamp + 8 bytes value

    if mt.size >= mt.maxSize {
        return ErrMemTableFull
    }

    return nil
}

Memory budget:

$$ M\_{\text{total}} = M\_{\text{active}} + M\_{\text{immutable}} + M\_{\text{block-cache}} $$

Typical allocation:

Active MemTable: 512 MB
Immutable MemTable: 512 MB
Block cache: 2 GB
Total: 3 GB

LRU Block Cache

Cache recently accessed blocks:

type BlockCache struct {
    mu       sync.RWMutex
    cache    map[string]*Block // blockID → Block
    lru      *list.List
    elements map[string]*list.Element
    size     int64
    maxSize  int64 // 2 GB
}

func (bc *BlockCache) Get(blockID string) (*Block, bool) {
    bc.mu.Lock()
    defer bc.mu.Unlock()

    elem, ok := bc.elements[blockID]
    if !ok {
        return nil, false
    }

    // Move to front (most recently used)
    bc.lru.MoveToFront(elem)
    return elem.Value.(*Block), true
}

func (bc *BlockCache) Put(blockID string, block *Block) {
    bc.mu.Lock()
    defer bc.mu.Unlock()

    // Evict LRU entries if over capacity
    for bc.size+block.Size() > bc.maxSize && bc.lru.Len() > 0 {
        oldest := bc.lru.Back()
        bc.evict(oldest)
    }

    elem := bc.lru.PushFront(block)
    bc.elements[blockID] = elem
    bc.cache[blockID] = block
    bc.size += block.Size()
}

Cache hit ratio (empirical Prometheus data):

$$ h = 0.85 \quad \text{(85% of queries hit cache)} $$

Query latency reduction:

$$ L\_{\text{with-cache}} = h \cdot L\_{\text{cache}} + (1 - h) \cdot L\_{\text{disk}} $$

For $L\_{\text{cache}} = 5$ ms, $L\_{\text{disk}} = 100$ ms:

$$ L\_{\text{with-cache}} = 0.85 \times 5 + 0.15 \times 100 = 19.25 \text{ ms} $$

vs 100 ms without cache (5× speedup).

Distributed Architecture

A single server has limits—typically 1–10 million series and 1–10 million samples/sec. Beyond that, you need distribution. Unlike traditional databases where distribution adds complexity mainly for scale, TSDBs benefit from distribution for both capacity and availability.

Design philosophies differ across TSDBs:

Prometheus: Explicitly single-node, uses federation for scale (scrape multiple Prometheus servers)
VictoriaMetrics: Single-node for most cases, optional clustering for massive scale (100M+ series)
M3DB (Uber): Distributed-first, designed for multi-datacenter replication from day one
InfluxDB Enterprise: Master-master replication with anti-entropy
Thanos/Cortex: Builds distributed layer on top of Prometheus using object storage (S3, GCS)

Each approach reflects different priorities: Prometheus values operational simplicity, M3DB values availability, Thanos/Cortex value long-term storage economics.

Sharding Strategy

Sharding distributes data across multiple servers to scale beyond a single machine’s capacity. The sharding strategy determines how data is partitioned—a critical decision that affects query performance, operational complexity, and the ability to rebalance load.

Why sharding is necessary: A single server can handle ~1–10 million series and 1–10 million samples/sec. Beyond that, memory, disk I/O, or CPU become bottlenecks. Sharding provides horizontal scaling—add more servers to increase capacity linearly.

Hash-based sharding by series identifier is the dominant approach:

This is the standard approach used by M3DB, VictoriaMetrics clustering, and Thanos. Hash the series ID (metric + labels) to assign it to a shard. All samples for that series go to the same shard.

func shard(series Series, numShards int) int {
    h := fnv.New64a()
    h.Write([]byte(series.Metric))
    for k, v := range series.Labels {
        h.Write([]byte(k))
        h.Write([]byte(v))
    }
    return int(h.Sum64() % uint64(numShards))
}

Pros:

Deterministic: Same series always goes to same shard (enables efficient querying)
Load balancing: Uniform hash distributes series evenly across shards
Simple: No coordination needed—client can compute shard location
Query efficiency: Queries for a specific series only hit one shard

Cons:

No range queries across series: Can’t efficiently query “all series starting with cpu_”
Rebalancing is hard: Adding/removing shards requires data migration
Hotspots: If a few series dominate write volume, their shards become bottlenecks
Cross-shard queries: Aggregations across all series require scatter-gather to all shards

When to use: When you have many series with relatively uniform write rates. This is the common case for infrastructure monitoring where thousands of servers each emit similar metrics.

Load distribution:

For uniform hash function:

$$ \mathbb{E}[\text{samples per shard}] = \frac{N}{S} $$

where $N$ is total samples, $S$ is number of shards.

Standard deviation (measure of imbalance):

$$ \sigma = \sqrt{\frac{N}{S}} \cdot \sqrt{1 - \frac{1}{S}} \approx \sqrt{\frac{N}{S}} $$

For $N = 10^9$, $S = 100$: $\sigma = 10^5$ (1% variance—acceptable imbalance).

Alternative: Time-based sharding

Some systems (InfluxDB Enterprise, TimescaleDB) shard by time range instead of series hash. Each shard owns a time window (e.g., shard 1: Jan 1–7, shard 2: Jan 8–14).

Pros:

Excellent for time-range queries: Query “last week” hits one shard
Natural retention: Delete old shards to expire data
No hotspots: Write load distributes evenly as time progresses

Cons:

All writes go to one shard: The “current” shard receives 100% of write traffic (bottleneck)
Requires time-partitioned queries: Queries spanning long ranges hit many shards
Compaction complexity: Must compact across shard boundaries for long-term storage

When to use: When queries are overwhelmingly time-bounded and write volume is manageable on a single node. Works well for lower-volume TSDBs with strong time-locality in queries.

Hybrid: Consistent hashing with virtual nodes

Advanced systems (M3DB, Riak) use consistent hashing with virtual nodes to enable gradual rebalancing.

// Each physical shard owns multiple virtual nodes
type ConsistentHash struct {
    ring []VirtualNode
}

type VirtualNode struct {
    Hash  uint64
    Shard int
}

func (ch *ConsistentHash) GetShard(seriesID uint64) int {
    // Binary search ring to find responsible virtual node
    idx := sort.Search(len(ch.ring), func(i int) bool {
        return ch.ring[i].Hash >= seriesID
    })
    if idx == len(ch.ring) {
        idx = 0
    }
    return ch.ring[idx].Shard
}

Pros:

Gradual rebalancing: Adding a shard only relocates $1/N$ of data (vs all data with simple hash)
Minimal disruption: Can rebalance without taking system offline
Even distribution: Virtual nodes smooth out hash imbalances

Cons:

Complexity: Requires ring management and coordination
Lookup overhead: Binary search instead of modulo operation
State coordination: All nodes need consistent view of ring

When to use: Production systems that anticipate cluster scaling. The complexity pays off when you need to add capacity without downtime.

Query Federation

Once data is sharded, queries must be federated—split across shards, executed in parallel, and results merged. This is conceptually simple but has subtle complexity around correctness, latency, and failure handling.

The query federation problem: A user queries sum(http_requests_total{method="GET"}). With hash-based sharding, matching series are scattered across all shards. The coordinator must:

Send the query to all shards
Wait for responses (with timeout)
Merge partial results correctly
Return unified result to client

Why this is challenging:

Partial failures: If shard 3 of 100 is down, do we fail the entire query or return partial results?
Stragglers: The slowest shard determines total latency (tail latency amplification)
Memory pressure: Merging 100 shard results in coordinator’s memory
Correctness: Aggregations (SUM, AVG, P95) have different merge logic

Fan-out queries across shards:

sequenceDiagram
    participant Client
    participant Coordinator
    participant Shard1
    participant Shard2
    participant Shard3

    Client->>Coordinator: Query
    Coordinator->>Shard1: Partial Query
    Coordinator->>Shard2: Partial Query
    Coordinator->>Shard3: Partial Query
    Shard1-->>Coordinator: Results
    Shard2-->>Coordinator: Results
    Shard3-->>Coordinator: Results
    Coordinator->>Coordinator: Merge Results
    Coordinator-->>Client: Final Results

Query coordinator (basic implementation):

func (qc *QueryCoordinator) ExecuteQuery(q *Query) (*Result, error) {
    shards := qc.getAllShards()

    var wg sync.WaitGroup
    results := make(chan *PartialResult, len(shards))

    for _, shard := range shards {
        wg.Add(1)
        go func(s *Shard) {
            defer wg.Done()
            result, err := s.Query(q)
            if err != nil {
                results <- &PartialResult{Error: err}
                return
            }
            results <- &PartialResult{Data: result}
        }(shard)
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    return qc.mergeResults(results, q.Aggregation)
}

Pros:

Parallel execution: Leverages multiple servers’ CPU/disk simultaneously
Linear scalability: Adding shards increases query throughput proportionally
Fault isolation: Shard failure doesn’t affect other shards’ queries
Resource efficiency: Each shard only processes its subset of data

Cons:

Tail latency amplification: $P_{99}$ latency of distributed query = $P_{99}$ of slowest shard
Network overhead: Coordinator must receive/merge potentially large result sets
Partial failure complexity: Deciding between availability (return partial results) vs consistency (fail entire query)
Coordinator bottleneck: All results flow through coordinator (CPU, memory, network)

When to use: Essential for any sharded TSDB. The question isn’t whether to use query federation, but how to implement it efficiently.

Parallel query latency:

$$ L\_{\text{distributed}} = \max_{i=1}^{S} L\_{\text{shard}_i} + L\_{\text{merge}} $$

For balanced shards: $L\_{\text{distributed}} \approx L\_{\text{single-shard}} + L\_{\text{merge}}$ (minimal overhead).

Advanced: Hedged requests

To combat tail latency, some systems (Thanos) use hedged requests: send duplicate queries to backup shards after timeout.

func (qc *QueryCoordinator) executeWithHedging(shard *Shard, query *Query, timeout time.Duration) (*Result, error) {
    resultChan := make(chan *Result, 2)

    // Primary request
    go func() {
        result, _ := shard.Query(query)
        resultChan <- result
    }()

    // Hedge: send to backup shard after P95 latency
    time.AfterFunc(timeout, func() {
        backupShard := qc.getBackupShard(shard)
        result, _ := backupShard.Query(query)
        resultChan <- result
    })

    return <-resultChan, nil // Return whichever completes first
}

Pros: Reduces P99 latency by 20–50% (empirical data from Google) Cons: Doubles resource usage (wasteful), requires read replicas

Merge complexity by aggregation type:

Different aggregations require different merge logic:

Aggregation	Merge Strategy	Complexity
`SUM`	Add all partial sums	$O(n)$
`AVG`	Weighted average by counts	$O(n)$
`MAX/MIN`	Take max/min of partial results	$O(n)$
`COUNT`	Sum partial counts	$O(n)$
`P95` (quantile)	Re-compute from all samples	$O(n \log n)$
`rate()`	Per-shard rate, then sum	$O(n)$

Why quantiles are hard: You cannot merge P95 values from shards to get global P95. Must either:

Send all samples to coordinator (memory explosion)
Use approximate algorithms (t-digest, DDSketch)
Compute exact quantiles per shard and return approximate global quantile

VictoriaMetrics and Prometheus use approximations; M3DB returns warnings for cross-shard quantiles.

Streaming merge optimization:

Instead of buffering all shard results, stream and merge incrementally:

func (qc *QueryCoordinator) streamingMerge(shards []*Shard, q *Query) *ResultStream {
    streams := make([]SeriesIterator, len(shards))
    for i, shard := range shards {
        streams[i] = shard.QueryStream(q)
    }

    // K-way merge of sorted series (each shard returns sorted)
    return NewKWayMergeIterator(streams)
}

Pros: Constant memory usage, lower latency (results stream as soon as first shard responds) Cons: Requires all shards return sorted results, implementation complexity

Replication

Replication provides fault tolerance—if a shard dies, its replica can serve requests. Unlike sharding (which distributes data), replication duplicates it. The trade-off: higher availability and read throughput at the cost of storage and write amplification.

Why replication matters: Hard drives fail (MTBF ~5 years). Servers crash. Networks partition. Without replication, any single failure loses data. Uber’s M3DB was designed from day one with 3× replication across datacenters—when a whole datacenter lost power, the system continued serving queries.

Replication strategies differ:

1. Synchronous replication (strong consistency)

Write must complete on all replicas before acknowledging to client. Guarantees consistency but sacrifices availability during partitions.

Pros:

Strong consistency: All replicas have identical data at all times
No data loss: If primary fails, replicas are fully up-to-date
Predictable reads: Any replica can serve reads with same result

Cons:

High latency: Write latency = slowest replica
Availability sacrifice: Write fails if any replica is down
Limited to single datacenter: Cross-datacenter latency makes this impractical

When to use: Financial data, compliance-critical metrics where data loss is unacceptable and you’re willing to sacrifice availability.

2. Asynchronous replication (eventual consistency)

Write completes on primary, then asynchronously replicates to backups. Fast but risks data loss.

Pros:

Low latency: Write latency = primary only
High availability: Continues operating even if all replicas are down
Works cross-datacenter: No synchronous network round-trips

Cons:

Eventual consistency: Replicas lag behind primary by seconds/minutes
Data loss risk: If primary fails before replication completes, recent writes lost
Read inconsistency: Different replicas may return different results

When to use: Observability metrics where slight data loss is acceptable and low latency is critical. Prometheus (single-node) with remote write to long-term storage uses this model.

3. Quorum-based replication (tunable consistency)

This is the sweet spot used by M3DB, Cassandra, and Riak. Require $W$ writes and $R$ reads from $N$ replicas. If $R + W > N$, you get linearizability.

Write replication (M3DB approach):

type ReplicatedWrite struct {
    Primary   *Shard
    Replicas  []*Shard
    Quorum    int // Majority: (N/2) + 1
}

func (rw *ReplicatedWrite) Write(sample Sample) error {
    responses := make(chan error, 1+len(rw.Replicas))

    // Write to primary
    go func() {
        responses <- rw.Primary.Write(sample)
    }()

    // Write to replicas
    for _, replica := range rw.Replicas {
        go func(r *Shard) {
            responses <- r.Write(sample)
        }(replica)
    }

    // Wait for quorum
    successCount := 0
    for i := 0; i < 1+len(rw.Replicas); i++ {
        if err := <-responses; err == nil {
            successCount++
            if successCount >= rw.Quorum {
                return nil
            }
        }
    }

    return ErrQuorumNotMet
}

Pros:

Tunable consistency: Choose $W$ and $R$ based on requirements
Fault tolerance: System continues with minority of replicas down
Balance: Middle ground between latency and consistency
Cross-datacenter: Works across regions with appropriate quorum settings

Cons:

Complex: Requires careful tuning of replication factor, quorum sizes
Higher latency than async: Must wait for quorum, not just primary
Anti-entropy needed: Background repair process to fix inconsistencies
Version conflicts: Requires conflict resolution (last-write-wins, vector clocks)

When to use: Production distributed TSDBs that need both availability and reasonable consistency. This is the industry standard for multi-node TSDBs.

Common quorum configurations:

Replication Factor	Write Quorum	Read Quorum	Consistency Level
$N=3$	$W=1$	$R=1$	Eventual (fast, risky)
$N=3$	$W=2$	$R=1$	Write-favored
$N=3$	$W=1$	$R=2$	Read-favored
$N=3$	$W=2$	$R=2$	Strong ($W+R>N$)
$N=3$	$W=3$	$R=1$	Synchronous write

Typical production: $N=3$, $W=2$, $R=1$ (optimized for read latency, tolerates 1 failure).

Availability with quorum replication:

For replication factor $R = 3$, quorum $Q = 2$:

$$ A = 1 - (1 - p)^R \approx 1 - (1 - p)^3 $$

where $p$ is single-node availability.

For $p = 0.99$: $A = 1 - 0.01^3 = 0.999999$ (six nines!).

Performance Benchmarks

Benchmarks provide concrete validation of design decisions. However, real-world performance varies dramatically based on cardinality, query patterns, and hardware. These numbers represent achievable performance on modern hardware, not theoretical limits.

Write Throughput

Write performance is primarily limited by:

Lock contention in the MemTable (mitigated by sharding or lock-free structures)
WAL fsync latency (mitigated by batching or async commit)
Memory allocation in Go’s garbage collector (mitigated by sync.Pool and object reuse)

Single-node Go implementation (16-core AMD EPYC, 64GB RAM, NVMe SSD):

Batch Size	Samples/sec	Latency P99
1	50K	20 ms
100	500K	25 ms
1,000	2M	50 ms
10,000	5M	100 ms

Comparison with production systems:

Prometheus: 1M samples/sec (single node, conservative for stability)
VictoriaMetrics: 10M samples/sec (single node, optimized Go with aggressive caching)
M3DB: 100M+ samples/sec (distributed, Uber production at 500M+)
InfluxDB: 500K–2M samples/sec (single node, depends on cardinality)
TimescaleDB: 100K–500K samples/sec (single node, limited by PostgreSQL overhead)

Why the variance? VictoriaMetrics achieves 10× Prometheus throughput through several optimizations:

Zero-copy decompression
Memory-mapped file I/O
Custom allocator reducing GC pressure
Aggressive use of sync.Pool for object reuse

M3DB’s distributed performance comes from sophisticated replication (quorum writes across 3 replicas with sub-millisecond latency) and custom serialization (protobuf-based commitlog). At Uber, M3DB handles 500+ million samples/sec across a 50-node cluster.

Query Performance

Range query (6-hour window, 1K series):

Resolution	Query Time	Samples Scanned
Raw (1s)	450 ms	21.6M
5min	50 ms	360K
1hour	10 ms	6K

Aggregation query (SUM across 10K series):

$$ T\_{\text{aggregate}} = T\_{\text{scan}} + T\_{\text{compute}} $$

For 10K series × 1 hour (3,600 samples each):

Scan time: 200 ms
Compute time: 50 ms
Total: 250 ms

Storage Efficiency

Compression effectiveness:

Data Type	Uncompressed	Compressed	Ratio
Timestamps	8 bytes	1.37 bits	46.7×
Values (slow-changing)	8 bytes	12 bits	5.3×
Values (random)	8 bytes	48 bits	1.3×
Total (typical)	16 bytes	1.67 bytes	9.6×

Storage cost:

For $10^6$ series, 1s resolution, 30-day retention:

$$ S = 10^6 \times \frac{86{,}400 \times 30}{1} \times 1.67 \text{ bytes} \approx 4.3 \text{ TB} $$

vs 41 TB uncompressed.

Advanced Optimizations

Once the fundamentals are solid, squeezing out additional performance requires diving into hardware-specific optimizations. These techniques provide 2–5× improvements but add complexity, so they’re typically implemented only in mature production systems.

SIMD Compression

Modern CPUs have vector instructions (SIMD - Single Instruction, Multiple Data) that process multiple data elements in parallel. Intel’s AVX2 (256-bit registers) can process 4 int64 timestamps simultaneously, and AVX-512 (512-bit) can handle 8.

Use AVX2 instructions for parallel delta-of-delta encoding:

Why this matters: Compression/decompression is often CPU-bound. Prometheus spends 30–40% of CPU time in compression during high-velocity writes. VictoriaMetrics uses assembly-optimized SIMD routines for critical paths, achieving 3–4× speedup over pure Go code.

Trade-offs: SIMD code is architecture-specific (x86-only for AVX2), hard to debug, and risks portability issues. Most TSDBs implement fallback scalar paths for ARM/other architectures.

import "golang.org/x/sys/cpu"

func compressTimestampsSIMD(timestamps []int64) []byte {
    if !cpu.X86.HasAVX2 {
        return compressTimestampsScalar(timestamps)
    }

    // Process 4 timestamps at once with AVX2 (256-bit registers)
    // Implementation uses assembly or compiler intrinsics
    // Achieves 3-4× speedup over scalar code
    return compressAVX2(timestamps)
}

Speedup: 3–4× for compression/decompression.

Bloom Filters

Bloom filters are probabilistic data structures that answer “does this set contain X?” with no false negatives but possible false positives. For TSDBs, this means: if the bloom filter says a chunk doesn’t contain a series, we can safely skip it; if it says it might contain it, we must read the chunk to verify.

Why this helps: A typical block contains thousands of chunks. A query for specific series might only match 10 chunks, but without bloom filters, we’d have to check metadata for all chunks. Bloom filters reduce I/O by 90%+ by allowing us to skip irrelevant chunks.

Production usage:

Cassandra: Bloom filters per SSTable (skips tables during reads)
Prometheus: Per-block bloom filters on series IDs (added in v2.30)
VictoriaMetrics: Multi-level bloom filters (coarse-grained at block level, fine-grained at chunk level)

Skip chunks that definitely don’t match:

type ChunkBloomFilter struct {
    filter *bloom.BloomFilter
    size   int
}

func (cbf *ChunkBloomFilter) MayContainSeries(seriesID uint64) bool {
    return cbf.filter.Test(uint64ToBytes(seriesID))
}

func (block *Block) QueryWithBloom(matchers []*Matcher) SeriesSet {
    // 1. Get candidate series from index
    seriesIDs := block.Index.Query(matchers)

    // 2. Filter chunks using bloom filter
    var matchedChunks []*Chunk
    for _, sid := range seriesIDs {
        for _, chunk := range block.Chunks[sid] {
            if chunk.Bloom.MayContainSeries(sid) {
                matchedChunks = append(matchedChunks, chunk)
            }
        }
    }

    return block.readChunks(matchedChunks)
}

False positive rate:

$$ p = \left(1 - e^{-kn/m}\right)^k $$

where $k$ is hash functions, $n$ is elements, $m$ is bits.

For $k = 3$, $m/n = 10$: $p \approx 0.008$ (0.8% false positives).

Scan reduction: 90%+ chunks skipped.

Goroutine Pooling

Problem: Creating goroutines for each query is expensive.

Solution: Worker pool pattern:

type WorkerPool struct {
    tasks chan func()
    wg    sync.WaitGroup
}

func NewWorkerPool(numWorkers int) *WorkerPool {
    wp := &WorkerPool{
        tasks: make(chan func(), 1000),
    }

    for i := 0; i < numWorkers; i++ {
        wp.wg.Add(1)
        go wp.worker()
    }

    return wp
}

func (wp *WorkerPool) worker() {
    defer wp.wg.Done()
    for task := range wp.tasks {
        task()
    }
}

func (wp *WorkerPool) Submit(task func()) {
    wp.tasks <- task
}

Throughput improvement: 2–3× for small queries (reduced GC pressure).

Production Considerations

Monitoring

Key metrics:

type TSDBMetrics struct {
    WriteThroughput   prometheus.Counter
    WriteLatency      prometheus.Histogram
    QueryLatency      prometheus.Histogram
    ChunkCompressionRatio prometheus.Gauge
    MemTableSize      prometheus.Gauge
    BlockCount        prometheus.Gauge
    CompactionDuration prometheus.Histogram
}

SLOs (Service Level Objectives):

Write latency: P99 < 100 ms
Query latency: P95 < 500 ms
Write availability: 99.9%
Query availability: 99.95%

Backpressure

Handle write overload gracefully:

type RateLimiter struct {
    limit   int64
    current int64
    mu      sync.Mutex
}

func (db *TSDB) Write(samples []Sample) error {
    if !db.rateLimiter.Allow(len(samples)) {
        return ErrRateLimitExceeded
    }

    if db.memTable.Size() > db.memTable.MaxSize() {
        return ErrMemTableFull // Signal backpressure to client
    }

    return db.memTable.InsertBatch(samples)
}

Response: HTTP 429 (Too Many Requests) or 503 (Service Unavailable).

Disaster Recovery

Backup strategy:

WAL: Continuous backup (replicate to S3)
Blocks: Snapshot every 2 hours
Index: Rebuild from blocks (idempotent)

Recovery time:

$$ \text{RTO} = T\_{\text{WAL-replay}} + T\_{\text{index-rebuild}} $$

For 24-hour WAL at 1M samples/sec:

$$ T\_{\text{WAL-replay}} = \frac{86{,}400 \times 10^6}{10^6 \text{ samples/sec}} \approx 15 \text{ minutes} $$

Conclusion

Building a high-performance time series database in Go requires understanding the unique characteristics of metrics workloads and making appropriate architectural trade-offs. This article has explored the core techniques employed by production systems like Prometheus, VictoriaMetrics, M3DB, and InfluxDB.

Key design principles:

Exploit temporal structure: Delta-of-delta encoding achieves 46× compression for timestamps by leveraging regular scrape intervals. The Gorilla paper from Facebook demonstrated this technique at scale, and it’s now industry standard.
Immutable blocks: Time-bounded, immutable blocks simplify concurrency (no locks needed for reads), enable efficient compaction, and make retention policies trivial (delete old directories). This was Prometheus’s key innovation in their 2.0 TSDB redesign.
Inverted indexing: High-cardinality label queries require specialized indexes. The transition from LevelDB-based indexes to inverted indexes reduced Prometheus query times by 100× for label-heavy workloads.
LSM-tree storage: Write-heavy workloads demand storage engines optimized for sequential writes. LSM-trees provide 5–10× better write throughput than B-trees at the cost of read amplification, which caching mitigates.
Batched writes: Individual writes incur high syscall overhead. Batching 10,000 samples per MemTable insert improves throughput by 10× and reduces WAL fsync pressure.

Architectural trade-offs differ across systems:

Prometheus: Favors operational simplicity (single-node) and correctness over raw performance
VictoriaMetrics: Optimizes for single-node performance through aggressive caching and SIMD
M3DB: Distributed-first with cross-datacenter replication, sacrifices some per-node performance for global availability
InfluxDB IOx: Columnar storage (Apache Arrow) optimizes for analytical queries over raw ingest speed
TimescaleDB: Leverages PostgreSQL’s maturity but pays overhead tax for general-purpose engine

Performance achievable with techniques presented:

Compression: 9.6× average (1.67 bytes/sample), making petabyte-scale retention economical
Indexing: $O(k \cdot |L|)$ label intersection handles millions of series
Write throughput: 5M samples/sec single node (VictoriaMetrics achieves 10M+ with additional optimizations)
Query latency: P95 < 500ms for 6-hour range queries over 10K series
Caching: 85% hit rates reduce query latency by 5×

The architecture presented handles real-world production workloads—Prometheus powers monitoring at SoundCloud, GitLab, and Digital Ocean; VictoriaMetrics handles 10B+ samples/day at single companies; M3DB serves 500M+ samples/sec at Uber across 50 datacenters.

Building a TSDB from scratch is a multi-year engineering effort, but understanding these fundamentals enables informed technology choices and effective operation of existing systems.

References

O’Neil, P., Cheng, E., Gawlick, D., & O’Neil, E. (1996). “The Log-Structured Merge-Tree (LSM-Tree).” Acta Informatica, 33(4), 351-385.
Pelkonen, T., Franklin, S., Teller, J., et al. (2015). “Gorilla: A Fast, Scalable, In-Memory Time Series Database.” VLDB, 8(12), 1816-1827.
Prometheus Authors (2023). “Prometheus TSDB Format.” Prometheus Documentation.
Bortnikov, E., Hillel, E., Keidar, I., et al. (2018). “M3: Uber’s Open Source, Large-scale Metrics Platform for Prometheus.” VLDB, 11(12), 1820-1833.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., et al. (2013). “Apache Hadoop YARN: Yet Another Resource Negotiator.” SoCC, 5:1-5:16.
Bloom, B. H. (1970). “Space/Time Trade-offs in Hash Coding with Allowable Errors.” Communications of the ACM, 13(7), 422-426.
Lemire, D., & Boytsov, L. (2015). “Decoding Billions of Integers per Second through Vectorization.” Software: Practice and Experience, 45(1), 1-29.
Dunning, T., & Ertl, O. (2019). “Computing Extremely Accurate Quantiles Using t-Digests.” arXiv:1902.04023.
InfluxData (2023). “InfluxDB IOx: Columnar Storage Architecture.” InfluxDB Documentation.
Vasilev, B., et al. (2021). “VictoriaMetrics: Time Series Database Performance Optimizations.” VictoriaMetrics Blog.
Wilms, K., & Fischer, S. (2020). “Thanos: Highly Available Prometheus Setup with Long Term Storage.” CNCF.

This implementation guide represents industry best practices as of 2024. Production TSDBs like Prometheus, VictoriaMetrics, M3DB, InfluxDB, TimescaleDB, and Thanos/Cortex employ similar techniques with continuous optimization. The landscape evolves rapidly—newer systems like InfluxDB IOx and QuestDB explore columnar storage and zero-copy architectures for next-generation performance.

Questions or feedback?

Time Series Data Model

Fundamental Characteristics

Data Model Schema

Storage Architecture

LSM-Tree Foundation

Time-Based Partitioning

Chunk Storage Format

Compression Algorithms

Timestamp Compression (Delta-of-Delta)

Value Compression (XOR Float Encoding)

Indexing Strategy

Inverted Index

Index Persistence Format

Write Path Optimization

Write-Ahead Log (WAL)

Batched Writes

Query Execution Engine

Query Language

Query Planner

Parallel Query Execution

Aggregation and Downsampling

Stream Aggregation

Multi-Resolution Storage

Compaction Strategy

Leveled Compaction

Compaction Scheduling

Memory Management

MemTable Design

LRU Block Cache

Distributed Architecture

Sharding Strategy

Query Federation

Replication

Performance Benchmarks

Write Throughput

Query Performance

Storage Efficiency

Advanced Optimizations

SIMD Compression

Bloom Filters

Goroutine Pooling

Production Considerations

Monitoring

Backpressure

Disaster Recovery

Conclusion

References