Cache warming and memory access
Recently I started to grow some interest in the field of High-frequency trading (HFT), especially after watching this amazing presentation by Carl Cook.
I didn’t have much exposure to HFT before this video but with some quick digging you can find the following summary.
High-frequency trading (HFT) is a trading method that uses powerful computer programs to transact a large number of orders in fractions of a second. HFT uses complex algorithms to analyze multiple markets and execute orders based on market conditions. link source
In order to submit the exchange orders as soon as possible, and reduce any latency. the developers use some optimization techniques.
Mr Cook introduced some of these optimization techniques in his presentation. and im planning in this post and the following ones to talk about what i have learned.
Memory types
Computer memory comes in various types, each serving different purposes and characterized by different properties such as speed, volatility, and cost.
Memory Type | Speed | Size (Typical Capacity) | Cost (Per GB) |
---|---|---|---|
Registers | Fastest (nanoseconds) | Bytes | Very High |
Cache Memory | Very Fast (nanoseconds) | Kilobytes to Megabytes | High |
L1 Cache | Fastest (2-4 cycles) | 16KB - 64KB | High |
L2 Cache | Faster (10-20 cycles) | 256KB - 1MB | High |
L3 Cache | Fast (20-50 cycles) | 4MB - 64MB | High |
RAM (DRAM) | Fast (nanoseconds) | 4GB - 128GB | Moderate |
SSD | Fast (microseconds) | 128GB - 4TB | Higher |
- Registers: Fastest, smallest capacity, highest cost. Located inside the CPU, providing the fastest access speed (in nanoseconds). They are used for immediate data processing.
- Cache Memory: Extremely fast, intermediate capacity, high cost. located closest to the CPU cores.
- RAM: Fast, large capacity, moderate cost. used as the main working memory for programs and data.
- SSD: Fast (but slower than RAM), very large capacity, higher cost (but lower than RAM per GB).
Cache Memory
The primary purpose of cache memory is to reduce the time the CPU takes to access data from the main memory (RAM). By storing copies of frequently accessed data and instructions, the cache allows the CPU to retrieve this information quickly without having to access the slower RAM repeatedly.
Working Mechanism of Cache
Cache Hit: When the CPU looks for data or an instruction in the cache and finds it, this is called a cache hit. This results in faster data retrieval.
Cache Miss: When the CPU looks for data in the cache and doesn’t find it, this is called a cache miss. The CPU then retrieves the data from the main memory, which takes more time, and stores a copy in the cache for future access.
Cache warming
Pre-loading data into the cache before it is actually needed by the application. This technique helps reduce the initial latency that occurs when a system starts up or when a new task begins, ensuring that the most frequently accessed data is readily available in the cache
Let’s take an example the following code snippet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <benchmark/benchmark.h>
#include <vector>
constexpr int dataSize = 10000000;
std::vector<double> data(dataSize);
std::vector<double> indices(dataSize);
static void SumRandom(benchmark::State &state) {
// Generate random indices
for (auto &index : indices) {
index = rand() % dataSize;
}
for (auto _ : state) {
int sum = 0;
// Access data in random order
for (int i = 0; i < dataSize; ++i) {
benchmark::DoNotOptimize(sum += data[indices[i]]);
}
benchmark::ClobberMemory();
}
}
In this code we use google benchmark library to compare the different functions. We start by creating a vector of random data with size dataSize = 10000000
. Then another vector of random indices with the same size as the dataSize
.
The function SumRandom
is summing the vector’s data in a random order. which will cause many cache misses.
running the benchmark for this function on my pc gives the following results.
1
2
3
4
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
SumRandom 422 ms 415 ms 2
Now let’s try to modify this function and access the data in the vector sequentially.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static void SumUniform(benchmark::State &state) {
// Warm cache by accessing data in sequential order
double sum_warm = 0;
for (int i = 0; i < kSize; ++i) {
benchmark::DoNotOptimize(sum_warm += data[i]);
}
for (auto _ : state) {
double sum = 0;
for (int i = 0; i < kSize; ++i) {
benchmark::DoNotOptimize(sum += data[i]);
}
benchmark::ClobberMemory();
}
}
and rerun the benchmark.
1
2
3
4
5
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
SumRandom 450 ms 438 ms 2
SumUniform 40.3 ms 40.2 ms 18
We can see how much faster the sequential memory accessing. the processing time almost dropped to 90%.
Thanks to our CPU cache and the Spatial Locality which means that if a particular memory location is accessed, nearby memory locations are likely to be accessed soon. Modern CPUs and memory systems are optimized for this pattern, loading contiguous blocks of memory into the cache, thus reducing cache misses.
Comments powered by Disqus.