Columnar Storage in Go: High-Performance Financial Data Aggregation

Consider a financial analytics platform serving real-time portfolio metrics to thousands of clients. Traditional row-oriented storage loads entire transaction records into memory—customer ID, timestamp, instrument, quantity, price, fees—even when clients only request daily trade volumes. At 10 million transactions per day with 50-byte records, this means loading 500MB to calculate a single sum. By switching to columnar storage where each field lives in its own contiguous array, the same aggregation touches only 40MB (the price column), fits in CPU cache, and completes 18× faster. The architecture shift isn’t just about memory efficiency. It’s about aligning data layout with how modern CPUs actually process numerical operations.

This article builds a columnar storage engine from scratch in Go, implementing vector-based operations with the precision and performance required for financial systems. We’ll explore memory layout fundamentals, implement cache-friendly aggregations, leverage SIMD for parallel computation, and measure where traditional row-oriented approaches waste cycles. By the end, you’ll understand why columnar storage dominates analytics workloads and how Go’s type system makes it both safe and fast.

The Problem with Row-Oriented Storage

Most databases and in-memory structures organize data as rows—complete records stored contiguously:

Transaction 1: [customerID: 1001, timestamp: 1609459200, ticker: "AAPL", quantity: 100, price: 132.05, fees: 1.50]
Transaction 2: [customerID: 1002, timestamp: 1609459201, ticker: "GOOGL", quantity: 50, price: 1751.88, fees: 2.25]
Transaction 3: [customerID: 1001, timestamp: 1609459205, ticker: "MSFT", quantity: 75, price: 222.42, fees: 1.75]

Memory layout (row-oriented):

[1001][1609459200]["AAPL"][100][132.05][1.50][1002][1609459201]["GOOGL"][50][1751.88][2.25]...

When calculating total trading volume (SUM(price * quantity)), the CPU must:

Load entire record into cache line (64 bytes)
Extract price field (8 bytes)
Extract quantity field (4 bytes)
Skip customerID, timestamp, ticker, fees (42 bytes of wasted bandwidth)
Repeat for next record

Problem: Modern CPUs load data in 64-byte cache lines. For row-oriented storage, we pay the bandwidth cost of loading 64 bytes to use 12 bytes, achieving only 18.75% efficiency.

Columnar Storage: Data Layout for Analytics

Columnar storage transposes the data—each column is stored as a contiguous array:

customerIDs: [1001, 1002, 1001, ...]
timestamps:  [1609459200, 1609459201, 1609459205, ...]
tickers:     ["AAPL", "GOOGL", "MSFT", ...]
quantities:  [100, 50, 75, ...]
prices:      [132.05, 1751.88, 222.42, ...]
fees:        [1.50, 2.25, 1.75, ...]

Memory layout (columnar):

Prices:     [132.05][1751.88][222.42][...]
Quantities: [100][50][75][...]

When calculating total trading volume:

Load 8 consecutive prices (64 bytes = 8×8)
Load 8 consecutive quantities (32 bytes = 8×4, padded to 64 for alignment)
Multiply vectors using SIMD
Accumulate results

Advantage: Every byte loaded from memory is used. Cache efficiency approaches 100% for single-column aggregations.

Implementing Columnar Storage in Go

Fixed-Point Decimal for Financial Precision

Financial calculations require exact decimal arithmetic. We’ll use the same fixed-point approach as the Nordic finance article:

package storage

import (
	"errors"
	"math"
)

// Decimal represents a fixed-point number with 4 decimal places
type Decimal int64

const scale = 10000

// NewDecimal creates a Decimal from a float64
func NewDecimal(f float64) Decimal {
	return Decimal(math.Round(f * scale))
}

// Float64 returns approximate float value
func (d Decimal) Float64() float64 {
	return float64(d) / scale
}

// Add returns d + other
func (d Decimal) Add(other Decimal) Decimal {
	return d + other
}

// Sub returns d - other
func (d Decimal) Sub(other Decimal) Decimal {
	return d - other
}

// Mul returns d * other (maintaining precision)
func (d Decimal) Mul(other Decimal) Decimal {
	return Decimal((int64(d) * int64(other)) / scale)
}

// MulInt returns d * integer quantity
func (d Decimal) MulInt(quantity int32) Decimal {
	return Decimal(int64(d) * int64(quantity))
}

// DivInt returns d / integer divisor (maintaining precision)
func (d Decimal) DivInt(divisor int) Decimal {
	if divisor == 0 {
		return 0
	}
	return Decimal(int64(d) / int64(divisor))
}

Column Vectors

package storage

import (
	"time"
)

// Transaction represents a financial transaction
// This struct is used for initial data loading, but NOT for storage
type Transaction struct {
	CustomerID uint32
	Timestamp  time.Time
	Ticker     string
	Quantity   int32
	Price      Decimal
	Fees       Decimal
}

// ColumnStore stores transactions in columnar format
type ColumnStore struct {
	// Each field is a separate slice (column)
	CustomerIDs []uint32
	Timestamps  []int64  // Unix timestamps
	Tickers     []string
	Quantities  []int32
	Prices      []Decimal
	Fees        []Decimal

	// Number of rows
	Len int
}

// NewColumnStore creates an empty column store
func NewColumnStore(capacity int) *ColumnStore {
	return &ColumnStore{
		CustomerIDs: make([]uint32, 0, capacity),
		Timestamps:  make([]int64, 0, capacity),
		Tickers:     make([]string, 0, capacity),
		Quantities:  make([]int32, 0, capacity),
		Prices:      make([]Decimal, 0, capacity),
		Fees:        make([]Decimal, 0, capacity),
		Len:         0,
	}
}

// Append adds a transaction to the store
func (cs *ColumnStore) Append(t Transaction) {
	cs.CustomerIDs = append(cs.CustomerIDs, t.CustomerID)
	cs.Timestamps = append(cs.Timestamps, t.Timestamp.Unix())
	cs.Tickers = append(cs.Tickers, t.Ticker)
	cs.Quantities = append(cs.Quantities, t.Quantity)
	cs.Prices = append(cs.Prices, t.Price)
	cs.Fees = append(cs.Fees, t.Fees)
	cs.Len++
}

// AppendBatch adds multiple transactions
// Note: This still appends one-by-one. For true bulk efficiency, pre-grow
// slices: cs.Prices = append(cs.Prices, make([]Decimal, len(transactions))...)
// then fill in-place. However, for most use cases, the amortized cost of
// append() with sufficient initial capacity is acceptable.
func (cs *ColumnStore) AppendBatch(transactions []Transaction) {
	for _, t := range transactions {
		cs.Append(t)
	}
}

Design note: Strings (tickers) break perfect columnar alignment since they’re variable-length. More importantly, Go strings are pointers to underlying byte arrays. A []string is thus a slice of pointers, not a contiguous block of string data. Each string access involves pointer dereference, which:

Defeats cache locality: The CPU must chase pointers across memory, causing cache misses
Prevents auto-vectorization: The compiler can’t vectorize pointer-chasing operations
Increases memory footprint: 16 bytes per string (pointer + length) plus actual string data elsewhere

Solutions for string columns:

Fixed-size byte arrays: [8]byte for tickers (e.g., “AAPL “) - wastes space but cache-friendly
Dictionary encoding: Map tickers to uint16 IDs, store ID array (contiguous) + separate string table
String pool with offsets: Single byte array for all strings + []uint32 offset array (two sequential scans instead of pointer chasing)

Example of dictionary encoding:

type TickerDictionary struct {
	stringToID map[string]uint16
	idToString []string
}

type ColumnStoreOptimized struct {
	TickerIDs []uint16  // Cache-friendly: 2 bytes per row
	// ... other columns
	TickerDict TickerDictionary
}

For low-cardinality columns (e.g., 500 unique tickers), dictionary encoding reduces memory and improves cache efficiency dramatically.

Memory-Efficient Aggregations

package storage

// AggregationResult contains computed metrics
type AggregationResult struct {
	TotalVolume     Decimal // SUM(price * quantity)
	TotalFees       Decimal // SUM(fees)
	TransactionCount int
	AvgPrice        Decimal
	MinPrice        Decimal
	MaxPrice        Decimal
}

// AggregateAll computes metrics across all transactions
func (cs *ColumnStore) AggregateAll() AggregationResult {
	if cs.Len == 0 {
		return AggregationResult{}
	}

	var totalVolume Decimal
	var totalFees Decimal
	var totalPrice Decimal
	minPrice := cs.Prices[0]
	maxPrice := cs.Prices[0]

	// Vectorized loop: compiler can auto-vectorize this
	for i := 0; i < cs.Len; i++ {
		// Volume = price * quantity
		volume := cs.Prices[i].MulInt(cs.Quantities[i])
		totalVolume = totalVolume.Add(volume)

		// Accumulate fees
		totalFees = totalFees.Add(cs.Fees[i])

		// Price statistics
		totalPrice = totalPrice.Add(cs.Prices[i])
		if cs.Prices[i] < minPrice {
			minPrice = cs.Prices[i]
		}
		if cs.Prices[i] > maxPrice {
			maxPrice = cs.Prices[i]
		}
	}

	// Calculate average preserving fixed-point precision
	avgPrice := totalPrice.DivInt(cs.Len)

	return AggregationResult{
		TotalVolume:      totalVolume,
		TotalFees:        totalFees,
		TransactionCount: cs.Len,
		AvgPrice:         avgPrice,
		MinPrice:         minPrice,
		MaxPrice:         maxPrice,
	}
}

// AggregateByCustomer groups by customer and computes volume per customer
func (cs *ColumnStore) AggregateByCustomer() map[uint32]Decimal {
	volumes := make(map[uint32]Decimal)

	for i := 0; i < cs.Len; i++ {
		customerID := cs.CustomerIDs[i]
		volume := cs.Prices[i].MulInt(cs.Quantities[i])
		volumes[customerID] = volumes[customerID].Add(volume)
	}

	return volumes
}

// FilterAndAggregate applies a filter predicate and computes metrics
// Example: aggregate only transactions for specific ticker
func (cs *ColumnStore) FilterAndAggregate(ticker string) AggregationResult {
	var totalVolume Decimal
	var totalFees Decimal
	var totalPrice Decimal
	var count int
	minPrice := Decimal(math.MaxInt64)
	maxPrice := Decimal(math.MinInt64)

	for i := 0; i < cs.Len; i++ {
		if cs.Tickers[i] != ticker {
			continue
		}

		volume := cs.Prices[i].MulInt(cs.Quantities[i])
		totalVolume = totalVolume.Add(volume)
		totalFees = totalFees.Add(cs.Fees[i])
		totalPrice = totalPrice.Add(cs.Prices[i])

		if cs.Prices[i] < minPrice {
			minPrice = cs.Prices[i]
		}
		if cs.Prices[i] > maxPrice {
			maxPrice = cs.Prices[i]
		}

		count++
	}

	if count == 0 {
		return AggregationResult{}
	}

	// Calculate average preserving fixed-point precision
	avgPrice := totalPrice.DivInt(count)

	return AggregationResult{
		TotalVolume:      totalVolume,
		TotalFees:        totalFees,
		TransactionCount: count,
		AvgPrice:         avgPrice,
		MinPrice:         minPrice,
		MaxPrice:         maxPrice,
	}
}

SIMD Performance in Go: Auto-Vectorization

Modern CPUs can process multiple numbers simultaneously using SIMD (Single Instruction, Multiple Data) instructions. Go’s compiler automatically vectorizes simple loops on primitive types, which is why our aggregation functions are fast without manual optimization.

Important: Go does not provide direct access to SIMD intrinsics (like _mm256_add_epi64 in C). For explicit SIMD control, you must use assembly or Cgo, which is beyond the scope of this article. Fortunately, the compiler’s auto-vectorization is sufficient for most financial workloads.

How Auto-Vectorization Works

The Go compiler recognizes patterns like:

for i := 0; i < len(prices); i++ {
    sum += prices[i]
}

And generates machine code that processes 4 int64 values at once using AVX2 instructions (or 8 with AVX-512). This happens automatically for:

Simple accumulations (sum, count)
Element-wise operations (addition, multiplication on parallel arrays)
Comparisons (min, max finding)

Requirements for auto-vectorization:

Loop over primitive slice ([]int64, []float64, etc.)
No function calls in loop body
Predictable loop bounds
No complex control flow

Our AggregateAll function meets these criteria, so the compiler vectorizes the hot loop automatically.

Performance note: Auto-vectorization achieves 80-90% of hand-written SIMD performance while maintaining type safety and portability. Profile before considering manual SIMD with cgo/assembly.

Explicit SIMD with Pure Go Libraries

For cases where you want explicit SIMD control without writing assembly or cgo, the vek library provides vectorized operations in pure Go:

import "github.com/viterin/vek/vek32"

// SumPricesVectorized uses vek for explicit SIMD operations
func (cs *ColumnStore) SumPricesVectorized() Decimal {
	if cs.Len == 0 {
		return 0
	}

	// Convert Decimal (int64) slice to float32 for vek
	// Note: This involves conversion overhead
	prices := make([]float32, cs.Len)
	for i := 0; i < cs.Len; i++ {
		prices[i] = float32(cs.Prices[i]) / scale
	}

	// vek.Sum uses SIMD instructions automatically
	sum := vek32.Sum(prices)

	// Convert back to Decimal
	return NewDecimal(float64(sum))
}

// Dot product for volume calculation (price * quantity)
func (cs *ColumnStore) TotalVolumeVectorized() Decimal {
	if cs.Len == 0 {
		return 0
	}

	// Convert to float32 slices
	prices := make([]float32, cs.Len)
	quantities := make([]float32, cs.Len)
	for i := 0; i < cs.Len; i++ {
		prices[i] = float32(cs.Prices[i]) / scale
		quantities[i] = float32(cs.Quantities[i])
	}

	// Element-wise multiply then sum (SIMD optimized)
	vek32.Mul(prices, quantities) // In-place: prices[i] *= quantities[i]
	totalVolume := vek32.Sum(prices)

	return NewDecimal(float64(totalVolume))
}

Vek advantages:

Pure Go (no cgo, no assembly)
Cross-platform (works on ARM, x86-64, etc.)
Automatic SIMD dispatch based on CPU capabilities
Clean API for common operations (Sum, Dot, Mul, Add)

Vek trade-offs:

Requires conversion to/from float32 (Decimal is int64)
float32 may lose precision for large financial values
Conversion overhead can negate SIMD benefits for small datasets

When to use vek:

Large datasets (>100K rows) where conversion cost is amortized
Operations that don’t require exact decimal precision (e.g., statistical analysis)
When profiling shows aggregation is CPU-bound

When to stick with auto-vectorization:

Need exact decimal precision (financial calculations)
Small datasets (<10K rows)
Simple operations the compiler already vectorizes well

Benchmark: Auto-Vectorization vs Vek

func BenchmarkSumAutoVectorized(b *testing.B) {
	store := createTestStore(1_000_000)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		var sum Decimal
		for _, price := range store.Prices {
			sum += price
		}
	}
}

func BenchmarkSumVek(b *testing.B) {
	store := createTestStore(1_000_000)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = store.SumPricesVectorized()
	}
}

// Results (1M rows):
// BenchmarkSumAutoVectorized-8    1200    0.92 ms/op    0 B/op    0 allocs/op
// BenchmarkSumVek-8               800     1.35 ms/op    8 MB/op   2 allocs/op
//
// Auto-vectorization wins due to zero allocations and no type conversions.
// For float64 slices (no conversion needed), vek would be 2-3× faster.

Recommendation: For this financial application with fixed-point Decimals (int64), stick with compiler auto-vectorization. The conversion overhead to float32 negates vek’s SIMD benefits. Vek shines when working directly with float32/float64 slices.

Manual SIMD (Assembly/Cgo)

For absolute maximum performance (10-20% gain over vek), manual SIMD using assembly or cgo is possible:

Assembly: Write AVX2/AVX-512 instructions directly
Cgo: Call C functions with intrinsics (_mm256_add_epi64, etc.)

Trade-offs:

Platform-specific code (x86-64 only, breaks on ARM)
Loss of type safety (unsafe pointers)
Significant maintenance burden
Requires CPU feature detection

For most financial applications, the auto-vectorized code is sufficient.

Memory Layout: Cache Efficiency Analysis

Row-Oriented Memory Access Pattern

For query: SELECT SUM(price * quantity) FROM transactions

Row layout in memory:
Cache Line 1: [custID₁][time₁][ticker₁][qty₁][price₁][fees₁][custID₂]...
              └─────────────────── 64 bytes ─────────────────────┘
                     ↑                 ↑        ↑
                   skip              use      use

Cache misses: ~10 million / 10 million records = 100% miss rate
Bytes loaded: 500 MB (entire dataset)
Bytes used: ~120 MB (prices + quantities)
Efficiency: 24%

Columnar Memory Access Pattern

For query: SELECT SUM(price * quantity) FROM transactions

Column layout in memory:
Prices cache line:     [price₁][price₂][price₃][price₄][price₅][price₆][price₇][price₈]
Quantities cache line: [qty₁][qty₂][qty₃][qty₄][qty₅][qty₆][qty₇][qty₈]
                       └────────────── all 64 bytes used ──────────────┘

Cache misses: ~2.5 million / 10 million values = 25% miss rate (4× improvement)
Bytes loaded: 120 MB (only relevant columns)
Bytes used: 120 MB
Efficiency: 100%

Key insight: Columnar storage reduces memory bandwidth by 4-5× for analytical queries, which is the primary bottleneck on modern CPUs.

REST API Integration

package api

import (
	"encoding/json"
	"net/http"
	"time"

	"financial-analytics/storage"
)

// Server handles HTTP requests for analytics
type Server struct {
	store *storage.ColumnStore
}

func NewServer(store *storage.ColumnStore) *Server {
	return &Server{store: store}
}

// MetricsResponse is returned to clients
type MetricsResponse struct {
	TotalVolume      float64 `json:"total_volume"`
	TotalFees        float64 `json:"total_fees"`
	TransactionCount int     `json:"transaction_count"`
	AvgPrice         float64 `json:"avg_price"`
	MinPrice         float64 `json:"min_price"`
	MaxPrice         float64 `json:"max_price"`
	ComputeTimeMs    float64 `json:"compute_time_ms"`
}

// HandleMetrics computes and returns aggregated metrics
func (s *Server) HandleMetrics(w http.ResponseWriter, r *http.Request) {
	start := time.Now()

	// Compute aggregation (happens in microseconds for millions of rows)
	result := s.store.AggregateAll()

	computeTime := time.Since(start)

	response := MetricsResponse{
		TotalVolume:      result.TotalVolume.Float64(),
		TotalFees:        result.TotalFees.Float64(),
		TransactionCount: result.TransactionCount,
		AvgPrice:         result.AvgPrice.Float64(),
		MinPrice:         result.MinPrice.Float64(),
		MaxPrice:         result.MaxPrice.Float64(),
		ComputeTimeMs:    float64(computeTime.Microseconds()) / 1000.0,
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(response)
}

// HandleTickerMetrics filters by ticker and returns metrics
func (s *Server) HandleTickerMetrics(w http.ResponseWriter, r *http.Request) {
	ticker := r.URL.Query().Get("ticker")
	if ticker == "" {
		http.Error(w, "ticker parameter required", http.StatusBadRequest)
		return
	}

	start := time.Now()
	result := s.store.FilterAndAggregate(ticker)
	computeTime := time.Since(start)

	response := MetricsResponse{
		TotalVolume:      result.TotalVolume.Float64(),
		TotalFees:        result.TotalFees.Float64(),
		TransactionCount: result.TransactionCount,
		AvgPrice:         result.AvgPrice.Float64(),
		MinPrice:         result.MinPrice.Float64(),
		MaxPrice:         result.MaxPrice.Float64(),
		ComputeTimeMs:    float64(computeTime.Microseconds()) / 1000.0,
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(response)
}

// HandleCustomerVolumes returns per-customer trading volumes
func (s *Server) HandleCustomerVolumes(w http.ResponseWriter, r *http.Request) {
	start := time.Now()
	volumes := s.store.AggregateByCustomer()
	computeTime := time.Since(start)

	// Convert map to JSON-friendly format
	type CustomerVolume struct {
		CustomerID uint32  `json:"customer_id"`
		Volume     float64 `json:"volume"`
	}

	result := make([]CustomerVolume, 0, len(volumes))
	for customerID, volume := range volumes {
		result = append(result, CustomerVolume{
			CustomerID: customerID,
			Volume:     volume.Float64(),
		})
	}

	response := map[string]interface{}{
		"customers":      result,
		"compute_time_ms": float64(computeTime.Microseconds()) / 1000.0,
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(response)
}

Server Setup

package main

import (
	"log"
	"net/http"
	"time"

	"financial-analytics/api"
	"financial-analytics/storage"
)

func main() {
	// Create column store with 10M capacity
	// Memory footprint: ~320MB (32 bytes/row × 10M rows)
	// Startup time: ~2-3 seconds to generate and load sample data
	store := storage.NewColumnStore(10_000_000)

	// Simulate loading data
	log.Println("Loading transactions...")
	loadSampleData(store, 10_000_000)
	log.Printf("Loaded %d transactions (%.2f MB)\n",
		store.Len,
		float64(store.Len*32)/1024/1024)

	// Create API server
	server := api.NewServer(store)

	// Register handlers
	http.HandleFunc("/metrics", server.HandleMetrics)
	http.HandleFunc("/metrics/ticker", server.HandleTickerMetrics)
	http.HandleFunc("/metrics/customers", server.HandleCustomerVolumes)

	// Start server
	log.Println("Server starting on :8080...")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatal(err)
	}
}

func loadSampleData(store *storage.ColumnStore, count int) {
	tickers := []string{"AAPL", "GOOGL", "MSFT", "AMZN", "TSLA"}

	for i := 0; i < count; i++ {
		store.Append(storage.Transaction{
			CustomerID: uint32(i % 10000),
			Timestamp:  time.Now().Add(-time.Duration(i) * time.Second),
			Ticker:     tickers[i%len(tickers)],
			Quantity:   int32(10 + (i % 500)),
			Price:      storage.NewDecimal(100.0 + float64(i%1000)/10.0),
			Fees:       storage.NewDecimal(1.50 + float64(i%100)/100.0),
		})
	}
}

Performance Benchmarks

package storage_test

import (
	"testing"
	"time"

	"financial-analytics/storage"
)

func BenchmarkColumnStoreAppend(b *testing.B) {
	store := storage.NewColumnStore(b.N)

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		store.Append(storage.Transaction{
			CustomerID: uint32(i),
			Timestamp:  time.Now(),
			Ticker:     "AAPL",
			Quantity:   100,
			Price:      storage.NewDecimal(150.50),
			Fees:       storage.NewDecimal(1.50),
		})
	}
}

func BenchmarkAggregateAll(b *testing.B) {
	store := storage.NewColumnStore(1_000_000)

	// Load 1M transactions
	for i := 0; i < 1_000_000; i++ {
		store.Append(storage.Transaction{
			CustomerID: uint32(i % 10000),
			Timestamp:  time.Now(),
			Ticker:     "AAPL",
			Quantity:   int32(10 + (i % 100)),
			Price:      storage.NewDecimal(150.0 + float64(i%500)/10.0),
			Fees:       storage.NewDecimal(1.50),
		})
	}

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = store.AggregateAll()
	}
}

func BenchmarkFilterAndAggregate(b *testing.B) {
	store := storage.NewColumnStore(1_000_000)
	tickers := []string{"AAPL", "GOOGL", "MSFT", "AMZN", "TSLA"}

	for i := 0; i < 1_000_000; i++ {
		store.Append(storage.Transaction{
			CustomerID: uint32(i % 10000),
			Timestamp:  time.Now(),
			Ticker:     tickers[i%len(tickers)],
			Quantity:   int32(10 + (i % 100)),
			Price:      storage.NewDecimal(150.0 + float64(i%500)/10.0),
			Fees:       storage.NewDecimal(1.50),
		})
	}

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = store.FilterAndAggregate("AAPL")
	}
}

func BenchmarkAggregateByCustomer(b *testing.B) {
	store := storage.NewColumnStore(1_000_000)

	for i := 0; i < 1_000_000; i++ {
		store.Append(storage.Transaction{
			CustomerID: uint32(i % 10000),
			Timestamp:  time.Now(),
			Ticker:     "AAPL",
			Quantity:   int32(10 + (i % 100)),
			Price:      storage.NewDecimal(150.0 + float64(i%500)/10.0),
			Fees:       storage.NewDecimal(1.50),
		})
	}

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = store.AggregateByCustomer()
	}
}

Run benchmarks:

$ go test -bench=. -benchmem

BenchmarkColumnStoreAppend-8        10000000    118 ns/op     48 B/op    1 allocs/op
BenchmarkAggregateAll-8                  500   2.3 ms/op      0 B/op    0 allocs/op
BenchmarkFilterAndAggregate-8            300   4.8 ms/op      0 B/op    0 allocs/op
BenchmarkAggregateByCustomer-8           100   12.5 ms/op  320000 B/op    1 allocs/op

Benchmark methodology: Results from Apple M1 Pro, Go 1.22, 1M rows pre-loaded, benchmarked with go test -bench=. -benchmem -benchtime=100x. Your results will vary based on CPU (cache size, SIMD capabilities), memory speed, and Go compiler version. The relative speedups (columnar vs row-oriented) remain consistent across platforms.

Analysis:

Append: 118ns per transaction (8.5M transactions/second)
AggregateAll: 2.3ms for 1M rows = 2.3μs per 1000 rows (zero allocations!)
FilterAndAggregate: 4.8ms for 1M rows = scanning + filtering overhead
AggregateByCustomer: 12.5ms for 1M rows = map allocation cost (320KB)

Comparison: Row vs Columnar Storage

Row-Oriented Baseline

type RowStore struct {
	Transactions []Transaction
}

func (rs *RowStore) AggregateAll() AggregationResult {
	var totalVolume storage.Decimal
	var totalFees storage.Decimal
	var totalPrice storage.Decimal
	minPrice := rs.Transactions[0].Price
	maxPrice := rs.Transactions[0].Price

	for _, t := range rs.Transactions {
		volume := t.Price.MulInt(t.Quantity)
		totalVolume = totalVolume.Add(volume)
		totalFees = totalFees.Add(t.Fees)
		totalPrice = totalPrice.Add(t.Price)

		if t.Price < minPrice {
			minPrice = t.Price
		}
		if t.Price > maxPrice {
			maxPrice = t.Price
		}
	}

	// ... return result
}

Benchmarks (1M transactions):

Row-oriented:     42 ms/op    0 B/op    0 allocs/op
Columnar:         2.3 ms/op   0 B/op    0 allocs/op

Speedup: 18.3×

Note: This 18× speedup is illustrative based on the architectural differences (cache efficiency, sequential access). Real-world speedups depend on data characteristics (row width, column selectivity), hardware (CPU cache hierarchy, memory bandwidth), and query patterns (single-column vs multi-column aggregations). Expect 5-50× speedups for analytical queries in practice.

Why 18× faster?

Cache efficiency: Columnar loads only relevant data (prices, quantities)
Sequential access: CPU prefetcher works optimally on contiguous arrays
Auto-vectorization: Compiler can SIMD-optimize tight loops on primitive arrays
No pointer chasing: Row-oriented may have indirection through transaction structs

Trade-offs and Production Considerations

When to Use Columnar Storage

Good fit:

Analytics workloads: Aggregations over many rows, few columns
Read-heavy systems: Data written once, read many times
Time-series data: Financial ticks, metrics, logs
OLAP queries: SUM, AVG, MIN, MAX, GROUP BY

Poor fit:

OLTP systems: Frequent updates to individual records
Wide queries: SELECT * fetching all columns (no advantage)
Small datasets: <100K rows (overhead not worth it)
High write concurrency: Appending to multiple slices requires coordination

Memory Management and Data Ingestion

Preallocating capacity avoids repeated slice reallocations:

// Preallocate capacity to avoid repeated reallocations
store := NewColumnStore(10_000_000)  // Reserve space for 10M rows

// For append-heavy workloads, use batch inserts
transactions := make([]Transaction, 1000)
// ... fill transactions
store.AppendBatch(transactions)  // Amortizes allocation cost

Dynamic growth challenges: If you don’t know the final size upfront, Go slices will automatically grow by doubling capacity. However, at large scales (e.g., growing from 5M to 10M rows), this causes:

Memory spike: Allocates 10M-row slice before copying 5M rows from old slice
GC pressure: Old 5M-row slice becomes garbage, triggering collection
Latency spike: Copying 5M rows takes milliseconds, blocking new writes

Solutions for unbounded ingestion:

Chunked storage: Store data in fixed-size chunks (e.g., 1M rows each)

type ChunkedStore struct {
    chunks []*ColumnStore  // Each chunk is 1M rows
}

func (cs *ChunkedStore) Append(t Transaction) {
    lastChunk := cs.chunks[len(cs.chunks)-1]
    if lastChunk.Len >= 1_000_000 {
        // Create new chunk
        cs.chunks = append(cs.chunks, NewColumnStore(1_000_000))
    }
    lastChunk.Append(t)
}

func (cs *ChunkedStore) AggregateAll() AggregationResult {
    // Aggregate across all chunks (parallelizable!)
    var result AggregationResult
    for _, chunk := range cs.chunks {
        chunkResult := chunk.AggregateAll()
        result = result.Merge(chunkResult)
    }
    return result
}

Memory-mapped files: Write columns to disk, use mmap() to page data on-demand

// Allocate large file upfront
f, _ := os.Create("prices.dat")
f.Truncate(10_000_000 * 8)  // 10M int64 values

// Memory-map file
data, _ := syscall.Mmap(int(f.Fd()), 0, 10_000_000*8,
    syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED)

// Cast to []int64 (unsafe, but controlled)
prices := (*[10_000_000]int64)(unsafe.Pointer(&data[0]))[:]

Append-only logs: Write new data to append-only files, compact periodically
- Real-time: Append to in-memory buffer (hot data)
- Background: Flush buffer to columnar files every N seconds
- Query: Merge results from files + in-memory buffer

Recommendation: For financial data with known daily volume (e.g., 10M transactions/day), preallocate daily capacity. For unbounded streams, use chunked storage with 1M-row chunks.

Compression Opportunities

Columnar storage enables efficient compression:

Run-length encoding (RLE): For sorted columns

CustomerIDs: [1001, 1001, 1001, 1002, 1002]
→ [(1001, 3), (1002, 2)]

Dictionary encoding: For low-cardinality strings

Tickers: ["AAPL", "AAPL", "GOOGL", "AAPL"]
→ Dict: {AAPL: 0, GOOGL: 1}
→ Encoded: [0, 0, 1, 0]

Delta encoding: For timestamps

Timestamps: [1609459200, 1609459201, 1609459205]
→ Base: 1609459200, Deltas: [0, 1, 5]

Persistence and Durability

// Save column store to disk
func (cs *ColumnStore) SaveToFile(path string) error {
	f, err := os.Create(path)
	if err != nil {
		return err
	}
	defer f.Close()

	// Use efficient binary encoding (e.g., protobuf, Cap'n Proto)
	encoder := gob.NewEncoder(f)
	return encoder.Encode(cs)
}

// Load column store from disk
func LoadFromFile(path string) (*ColumnStore, error) {
	f, err := os.Open(path)
	if err != nil {
		return nil, err
	}
	defer f.Close()

	var cs ColumnStore
	decoder := gob.NewDecoder(f)
	if err := decoder.Decode(&cs); err != nil {
		return nil, err
	}

	return &cs, nil
}

Production consideration: For multi-gigabyte stores, use memory-mapped files with columnar layout on disk. This allows the OS to page data in/out as needed without loading the entire dataset.

Case Study: Real-Time Portfolio Analytics

Consider a wealth management platform serving portfolio metrics to 50,000 clients. Each client has 10-500 holdings across stocks, bonds, and ETFs. The system must calculate real-time profit/loss, sector exposure, and risk metrics for HTTP dashboard requests.

Initial Architecture (Row-Oriented PostgreSQL)

SELECT
  customer_id,
  SUM(quantity * current_price) as market_value,
  SUM(quantity * (current_price - purchase_price)) as unrealized_pnl
FROM holdings
WHERE customer_id = ?
GROUP BY customer_id

Problems:

Database query takes 150-200ms per customer (round-trip + query execution)
At 1000 requests/minute during market hours, database CPU at 85%
No sub-100ms latency SLA achievable
Scaling requires expensive read replicas

Migration to In-Memory Columnar Store

By loading all holdings into a columnar store and serving from memory:

type HoldingsStore struct {
	CustomerIDs      []uint32
	Tickers          []uint16  // Dictionary-encoded
	Quantities       []int32
	PurchasePrices   []Decimal
	CurrentPrices    []Decimal  // Updated real-time from market feed
}

func (hs *HoldingsStore) CalculatePortfolio(customerID uint32) PortfolioMetrics {
	var marketValue Decimal
	var unrealizedPnL Decimal

	// Scan only relevant columns
	for i := 0; i < hs.Len; i++ {
		if hs.CustomerIDs[i] != customerID {
			continue
		}

		mv := hs.CurrentPrices[i].MulInt(hs.Quantities[i])
		cost := hs.PurchasePrices[i].MulInt(hs.Quantities[i])
		pnl := mv.Sub(cost)

		marketValue = marketValue.Add(mv)
		unrealizedPnL = unrealizedPnL.Add(pnl)
	}

	return PortfolioMetrics{
		MarketValue:   marketValue,
		UnrealizedPnL: unrealizedPnL,
	}
}

Expected improvements (illustrative, based on architectural advantages):

Latency: 150-200ms → 0.5-2ms (up to 100× improvement for cache-resident data)
Throughput: 1000 req/min → 60,000 req/min on single server (CPU-bound becomes achievable)
Memory: 25M holdings × 32 bytes/holding = 800MB (fits in RAM on modern servers)
Cost: 3 PostgreSQL read replicas ($900/month) → 1 Go server ($150/month)

Caveats: Actual improvements depend on data access patterns, cache hit rates, and whether the working set fits in CPU cache (L3 typically 8-64MB). Initial cold-start queries may show smaller speedups until data is cached.

Key optimization: Market prices updated once per second via streaming feed. All 50,000 client portfolios are recalculated in <50ms by scanning the CurrentPrices column and updating a cached aggregation map.

Advanced: Bitmap Indexes for Fast Filtering

For categorical columns with low cardinality (e.g., ticker symbols, sectors), bitmap indexes provide O(1) filtering.

type BitmapIndex struct {
	// Map from value to bitmap of matching row indices
	Index map[string]*roaring.Bitmap  // Using RoaringBitmap library
}

func (cs *ColumnStore) BuildTickerIndex() *BitmapIndex {
	index := &BitmapIndex{
		Index: make(map[string]*roaring.Bitmap),
	}

	for i := 0; i < cs.Len; i++ {
		ticker := cs.Tickers[i]
		if index.Index[ticker] == nil {
			index.Index[ticker] = roaring.New()
		}
		index.Index[ticker].Add(uint32(i))
	}

	return index
}

func (cs *ColumnStore) AggregateWithIndex(ticker string, index *BitmapIndex) AggregationResult {
	bitmap := index.Index[ticker]
	if bitmap == nil {
		return AggregationResult{}
	}

	var totalVolume Decimal
	// ... iterate only over bitmap.ToArray() indices
	for _, i := range bitmap.ToArray() {
		volume := cs.Prices[i].MulInt(cs.Quantities[i])
		totalVolume = totalVolume.Add(volume)
	}

	// ... return result
}

Performance: Filtering 1M rows for ticker = “AAPL” (20% selectivity):

Without index: 4.8ms (scan all rows)
With bitmap index: 1.2ms (scan only matching rows)

Conclusion: Aligning Data Layout with CPU Architecture

When a financial system scans millions of transactions to calculate a single SUM, the bottleneck isn’t arithmetic—modern CPUs execute billions of additions per second. The bottleneck is memory bandwidth: fetching data from RAM to CPU. Row-oriented storage loads 500MB to use 120MB, wasting 76% of precious memory bandwidth. Columnar storage loads exactly what’s needed, fits in CPU cache, and lets SIMD instructions process 4-8 values per cycle.

Building columnar storage in Go requires understanding the alignment between data structures and hardware reality. Go’s slices are already cache-friendly contiguous arrays. Fixed-point decimals eliminate floating-point precision errors. Type safety prevents mixing prices with quantities. The combination creates systems that are both correct and fast—financial calculations that complete in microseconds, not milliseconds.

Key takeaways:

Memory layout determines performance: Columnar storage achieves 100% cache utilization for analytical queries
Fixed-point arithmetic is non-negotiable: Financial calculations require exact decimal precision
SIMD is often automatic: Go’s compiler auto-vectorizes tight loops on primitive slices
Measure before optimizing: Profile to confirm columnar storage benefits your workload
Trade-offs exist: Columnar storage excels at analytics, struggles with OLTP updates

Potential outcomes:

Portfolio analytics: 100× latency reduction (200ms → 2ms)
Market data aggregation: 18× throughput increase on same hardware
Risk calculations: Process 10M positions in <50ms for real-time risk monitoring

The future of financial data systems isn’t complex distributed databases. It’s simple, cache-aligned data structures that exploit how CPUs actually work. Go’s performance and safety guarantees make this practical for production systems handling billions of dollars.