Cache warming and memory access

Posted Jul 19, 2024

cache

By Ibrahim Essam 4 min read

Recently I started to grow some interest in the field of High-frequency trading (HFT), especially after watching this amazing presentation by Carl Cook.

I didn’t have much exposure to HFT before this video but with some quick digging you can find the following summary.

High-frequency trading (HFT) is a trading method that uses powerful computer programs to transact a large number of orders in fractions of a second. HFT uses complex algorithms to analyze multiple markets and execute orders based on market conditions. link source

In order to submit the exchange orders as soon as possible, and reduce any latency. the developers use some optimization techniques.

Mr Cook introduced some of these optimization techniques in his presentation. and im planning in this post and the following ones to talk about what i have learned.

Memory types

Computer memory comes in various types, each serving different purposes and characterized by different properties such as speed, volatility, and cost.

Memory Type	Speed	Size (Typical Capacity)	Cost (Per GB)
Registers	Fastest (nanoseconds)	Bytes	Very High
Cache Memory	Very Fast (nanoseconds)	Kilobytes to Megabytes	High
L1 Cache	Fastest (2-4 cycles)	16KB - 64KB	High
L2 Cache	Faster (10-20 cycles)	256KB - 1MB	High
L3 Cache	Fast (20-50 cycles)	4MB - 64MB	High
RAM (DRAM)	Fast (nanoseconds)	4GB - 128GB	Moderate
SSD	Fast (microseconds)	128GB - 4TB	Higher

Registers: Fastest, smallest capacity, highest cost. Located inside the CPU, providing the fastest access speed (in nanoseconds). They are used for immediate data processing.
Cache Memory: Extremely fast, intermediate capacity, high cost. located closest to the CPU cores.
RAM: Fast, large capacity, moderate cost. used as the main working memory for programs and data.
SSD: Fast (but slower than RAM), very large capacity, higher cost (but lower than RAM per GB).

Cache Memory

The primary purpose of cache memory is to reduce the time the CPU takes to access data from the main memory (RAM). By storing copies of frequently accessed data and instructions, the cache allows the CPU to retrieve this information quickly without having to access the slower RAM repeatedly.

Working Mechanism of Cache

Cache Hit: When the CPU looks for data or an instruction in the cache and finds it, this is called a cache hit. This results in faster data retrieval.

Cache Miss: When the CPU looks for data in the cache and doesn’t find it, this is called a cache miss. The CPU then retrieves the data from the main memory, which takes more time, and stores a copy in the cache for future access.

Cache warming

Pre-loading data into the cache before it is actually needed by the application. This technique helps reduce the initial latency that occurs when a system starts up or when a new task begins, ensuring that the most frequently accessed data is readily available in the cache

Let’s take an example the following code snippet

  
#include <benchmark/benchmark.h>
#include <vector>
constexpr int dataSize = 10000000;
std::vector<double> data(dataSize);
std::vector<double> indices(dataSize);

static void SumRandom(benchmark::State &state) {
  // Generate random indices
  for (auto &index : indices) {
    index = rand() % dataSize;
  }
  for (auto _ : state) {
    int sum = 0;
    // Access data in random order
    for (int i = 0; i < dataSize; ++i) {
      benchmark::DoNotOptimize(sum += data[indices[i]]);
    }
    benchmark::ClobberMemory();
  }
}

In this code we use google benchmark library to compare the different functions. We start by creating a vector of random data with size dataSize = 10000000. Then another vector of random indices with the same size as the dataSize.

The function SumRandom is summing the vector’s data in a random order. which will cause many cache misses.

running the benchmark for this function on my pc gives the following results.

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
SumRandom                  422 ms          415 ms            2

Now let’s try to modify this function and access the data in the vector sequentially.

  
static void SumUniform(benchmark::State &state) {
  // Warm cache by accessing data in sequential order
  double sum_warm = 0;
  for (int i = 0; i < kSize; ++i) {
    benchmark::DoNotOptimize(sum_warm += data[i]);
  }
  for (auto _ : state) {
    double sum = 0;
    for (int i = 0; i < kSize; ++i) {
      benchmark::DoNotOptimize(sum += data[i]);
    }
    benchmark::ClobberMemory();
  }
}

and rerun the benchmark.

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
SumRandom                  450 ms          438 ms            2
SumUniform                 40.3 ms         40.2 ms           18

We can see how much faster the sequential memory accessing. the processing time almost dropped to 90%.

Thanks to our CPU cache and the Spatial Locality which means that if a particular memory location is accessed, nearby memory locations are likely to be accessed soon. Modern CPUs and memory systems are optimized for this pattern, loading contiguous blocks of memory into the cache, thus reducing cache misses.

References

SW, cpp

This post is licensed under CC BY 4.0 by the author.