Speeding-up random memory access using prefetching
This is the second post in a series of posts about optimizations. If you haven’t checked part 1 check it from this link Cache warming and memory access.
In this post, we will explore together a very useful compiler built in function called __builtin_prefetch.
From the gcc official documentation
This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions are generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.
In the previous post, we have shown two examples of memory access. One example was accessing the memory in a sequential manner, the other one was accessing the memory in random order.
We have shown that the sequential memory access was much faster due to the CPU cache and the Spatial Locality. This means the CPU was loading the next chunk of data in cache memory in advance.
in the random memory accessing example the CPU was loading the neighbouring data in cpu cache. but it was most of the time is loading the wrong data and getting cache misses.
Using this builtin prefetch function we can help the CPU by giving a hint on the expected next data to load.
Let’s have the example from the pervious tutorial, and add a new function to it SumRandomPrefFetch
which will use this built-in function.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#include <benchmark/benchmark.h>
#include <iostream>
#include <vector>
constexpr int kSize = 10000000;
std::vector<int> data(kSize);
std::vector<int> indices(kSize);
static void SumRandom(benchmark::State &state) {
// Generate random indices
for (auto &index : indices) {
index = rand() % kSize;
}
for (auto _ : state) {
int sum = 0;
// Access data in random order
for (int i = 0; i < kSize; ++i) {
benchmark::DoNotOptimize(sum += data[indices[i]]);
}
benchmark::ClobberMemory();
}
}
static void SumUniform(benchmark::State &state) {
for (auto _ : state) {
int sum = 0;
// Access data in random order
for (int i = 0; i < kSize; ++i) {
benchmark::DoNotOptimize(sum += data[i]);
}
benchmark::ClobberMemory();
}
}
static void SumRandomPreFetch(benchmark::State &state) {
for (auto _ : state) {
int sum = 0;
// Access data in random order
for (int i = 0; i < kSize; ++i) {
if(i+1 < kSize){
__builtin_prefetch(&data[indices[i+1]], 0, 0); // prefetch the next data in advance
}
benchmark::DoNotOptimize(sum += data[indices[i]]);
}
benchmark::ClobberMemory();
}
}
// Register the function as a benchmark
BENCHMARK(SumRandom)->Unit(benchmark::kMillisecond);
BENCHMARK(SumUniform)->Unit(benchmark::kMillisecond);
BENCHMARK(SumRandomPreFetch)->Unit(benchmark::kMillisecond);
// Run the benchmark
BENCHMARK_MAIN();
Inside the for-loop, we have added a call to __builtin_prefetch
to pre-fetch the next data giving the memory address.
1
2
3
if(i+1 < kSize){
__builtin_prefetch(&data[indices[i+1]], 0, 0); // prefetch the next data in advance
}
The second argument is for specifying if it’s read/write (0,1) operation, since we only reading the data we pass 0
.
The last argument is the locality the explanation of it based on GCC documentation is
A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.
Since we know that the data access in our case is totally random and no temporal locality is expected, we will use a value of 0
here.
Now Let’s run our benchmark and see the performance gain we are getting.
1
2
3
4
5
6
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
SumRandom 358 ms 344 ms 2
SumUniform 33.4 ms 31.9 ms 29
SumRandomPreFetch 298 ms 293 ms 2
as we can see compared to the SumRandom
version, the runtime for SumRandomPreFetch
is dropped from 358 ms to 298 ms saving almost 20% of processing time using this small code tweak.
Comments powered by Disqus.