Speeding-up random memory access using prefetching

Posted Jul 22, 2024

By Ibrahim Essam 4 min read

This is the second post in a series of posts about optimizations. If you haven’t checked part 1 check it from this link Cache warming and memory access.

In this post, we will explore together a very useful compiler built in function called __builtin_prefetch.

From the gcc official documentation

This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions are generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.

In the previous post, we have shown two examples of memory access. One example was accessing the memory in a sequential manner, the other one was accessing the memory in random order.

We have shown that the sequential memory access was much faster due to the CPU cache and the Spatial Locality. This means the CPU was loading the next chunk of data in cache memory in advance.

in the random memory accessing example the CPU was loading the neighbouring data in cpu cache. but it was most of the time is loading the wrong data and getting cache misses.

Using this builtin prefetch function we can help the CPU by giving a hint on the expected next data to load.

Let’s have the example from the pervious tutorial, and add a new function to it SumRandomPrefFetch which will use this built-in function.

  
#include <benchmark/benchmark.h>
#include <iostream>
#include <vector>

constexpr int kSize = 10000000;
std::vector<int> data(kSize);
std::vector<int> indices(kSize);

static void SumRandom(benchmark::State &state) {
  // Generate random indices
  for (auto &index : indices) {
    index = rand() % kSize;
  }
  for (auto _ : state) {
    int sum = 0;
    // Access data in random order
    for (int i = 0; i < kSize; ++i) {
      benchmark::DoNotOptimize(sum += data[indices[i]]);
    }
    benchmark::ClobberMemory();
  }
}

static void SumUniform(benchmark::State &state) {
  for (auto _ : state) {
    int sum = 0;
    // Access data in random order
    for (int i = 0; i < kSize; ++i) {
      benchmark::DoNotOptimize(sum += data[i]);
    }
    benchmark::ClobberMemory();
  }
}

static void SumRandomPreFetch(benchmark::State &state) {
  for (auto _ : state) {
    int sum = 0;
    // Access data in random order
    for (int i = 0; i < kSize; ++i) {
      if(i+1 < kSize){
        __builtin_prefetch(&data[indices[i+1]], 0, 0); // prefetch the next data in advance 
      }
      benchmark::DoNotOptimize(sum += data[indices[i]]);
    }
    benchmark::ClobberMemory();
  }
}

// Register the function as a benchmark
BENCHMARK(SumRandom)->Unit(benchmark::kMillisecond);
BENCHMARK(SumUniform)->Unit(benchmark::kMillisecond);
BENCHMARK(SumRandomPreFetch)->Unit(benchmark::kMillisecond);

// Run the benchmark
BENCHMARK_MAIN();

Inside the for-loop, we have added a call to __builtin_prefetch to pre-fetch the next data giving the memory address.

  
if(i+1 < kSize){
        __builtin_prefetch(&data[indices[i+1]], 0, 0); // prefetch the next data in advance 
      }

The second argument is for specifying if it’s read/write (0,1) operation, since we only reading the data we pass 0.

The last argument is the locality the explanation of it based on GCC documentation is

A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.

Since we know that the data access in our case is totally random and no temporal locality is expected, we will use a value of 0 here.

Now Let’s run our benchmark and see the performance gain we are getting.

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
SumRandom                  358 ms          344 ms            2
SumUniform                 33.4 ms         31.9 ms           29
SumRandomPreFetch          298 ms          293 ms            2

as we can see compared to the SumRandom version, the runtime for SumRandomPreFetch is dropped from 358 ms to 298 ms saving almost 20% of processing time using this small code tweak.

SW, cpp

This post is licensed under CC BY 4.0 by the author.

Trending Tags