A simple memcpy loop with stride of 8-16-32 Bytes, no L1 cache misses, what stalls backend cycles?

I am trying to understand the CPU cache performance in a single producer single consumer queue algorithm, but cannot pinpoint the cause of performance degradation in some cases. The following simplified test program runs with practically no L1 cache misses, but nevertheless it spends many cycles stalled in the CPU backend when its memory access pattern is somewhat sparser. What causes the CPU backend to stall in such cases, when there are practically no L1 misses? What should be measured to pinpoint the cause?

My CPU is AMD Ryzen 5 PRO 4650G, which is Zen 2 “Renoir” with 192KiB L1d cache. The memcpy test program:

// demo_memcpy_test_speed-gap.c
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <cstring>

#define PACKET_SIZE 8 // 16 32 // <--- it is really the stride of the memcpy over the mem array
#define SIZE_TO_MEMCPY 8 // memcpy only the first 8 bytes of the "packet"

const static long long unsigned n_packets = 512; // use few packets, to fit in L2 etc
static long long unsigned repeat    = 1000*1000 * 2 * 2; // repeat many times to get enough stats in perf

const static long long unsigned n_max_data_bytes = n_packets * PACKET_SIZE;

#define CACHE_LINE_SIZE 64 // align explicitly just in case
alignas(CACHE_LINE_SIZE) uint8_t data_in  [n_max_data_bytes];
alignas(CACHE_LINE_SIZE) uint8_t data_out [n_packets][PACKET_SIZE];

int main(int argc, char* argv[])
printf("memcpy_test.c standard\n");

// warmup the memory
// i.e. access the memory to make sure Linux has set up the virtual mem tables

printf("\nrun memcpy\n");

long long unsigned n_bytes_copied = 0;
long long unsigned memcpy_ops     = 0;
start_setup = clock();
for (unsigned rep=0; rep<repeat; rep++) {
    uint8_t* data_in_ptr  = &data_in [0];

    for (unsigned long long i_packet=0; i_packet<n_packets; i_packet++) {
        // copy only SIZE_TO_MEMCPY of the in data array to the out

        uint8_t* data_out_ptr = &(data_out [i_packet][0]);

        memcpy(data_out_ptr, data_in_ptr, SIZE_TO_MEMCPY*sizeof(uint8_t));
        n_bytes_copied   += SIZE_TO_MEMCPY;

        data_in_ptr  += PACKET_SIZE;
end_setup   = clock();
cpu_time_used_setup = ((double) (end_setup - start_setup)) / CLOCKS_PER_SEC;
printf("memcpy() took %f seconds to execute\n", cpu_time_used_setup);

printf("%f Mops\n",   memcpy_ops/(1000000*cpu_time_used_setup));
printf("%llu bytes\n",  n_bytes_copied);
printf("%f Mbytes/s\n", n_bytes_copied/(1000000*cpu_time_used_setup));

} // end of main

It’s built with -O1 to get a really efficient loop with memcpy:

g++ -g -O1 ./demo_memcpy_test_speed-gap.c

The instructions of the memcpy loop, as seen in perf record annotate option:

sudo perf record -F 999 -e stalled-cycles-backend -- ./a.out
sudo perf report
...select main

With PACKET_SIZE set to 8, the code is really efficient:

       │     for (unsigned long long i_packet=0; i_packet<n_packets; i_packet++) {
       │11b:┌─→mov      %rbx,%rax
       │    │memcpy():
       │11e:│  mov      (%rcx,%rax,8),%rdx
100.00 │    │  mov      %rdx,(%rsi,%rax,8)
       │    │main():
       │    │  add      $0x1,%rax
       │    │  cmp      $0x200,%rax
       │    │↑ jne      11e
       │    │for (unsigned rep=0; rep<repeat; rep++) {
       │    │  sub      $0x1,%edi
       │    └──jne      11b

With PACKET_SIZE set to 1024, and the code is the same for 256, except add $0x100.. instead of 0x400:

       │       lea      _end,%rsi
       │140:┌─→mov      %rbp,%rdx
       │    │
       │    │  lea      data_in,%rax
       │    │
       │    │__fortify_function void *
       │    │__NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
       │    │size_t __len))
       │    │{
       │    │return __builtin___memcpy_chk (__dest, __src, __len,
       │14a:│  mov      (%rax),%rcx
       │    │memcpy():
 96.31 │    │  mov      %rcx,(%rdx)
       │    │
  1.81 │    │  add      $0x400,%rax
  0.20 │    │  add      $0x400,%rdx
  1.12 │    │  cmp      %rsi,%rax
  0.57 │    │↑ jne      14a
       │    │  sub      $0x1,%edi
       │    └──jne      140

I run it with PACKET_SIZE set to 8, 16, 32, and other values. The perf counts for 8 and 32:

sudo perf stat -e task-clock,instructions,cycles,stalled-cycles-frontend,stalled-cycles-backend \
      -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches \
      -e l2_cache_accesses_from_dc_misses,l2_cache_hits_from_dc_misses,l2_cache_misses_from_dc_misses \
      -- ./a.out

 Performance counter stats for './a.out':

            503.43 msec task-clock                       #    0.998 CPUs utilized
    10,323,618,071      instructions                     #    4.79  insn per cycle
                                                  #    0.01  stalled cycles per insn     (29.11%)
     2,154,694,815      cycles                           #    4.280 GHz                         (29.91%)
         5,148,993      stalled-cycles-frontend          #    0.24% frontend cycles idle        (30.70%)
        55,922,538      stalled-cycles-backend           #    2.60% backend cycles idle         (30.99%)
     4,091,862,625      L1-dcache-loads                  #    8.128 G/sec                       (30.99%)
            24,211      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (30.99%)
            18,745      L1-dcache-prefetches             #   37.234 K/sec                       (30.37%)
            30,749      l2_cache_accesses_from_dc_misses #   61.079 K/sec                       (29.57%)
            21,046      l2_cache_hits_from_dc_misses     #   41.805 K/sec                       (28.78%)
             9,095      l2_cache_misses_from_dc_misses   #   18.066 K/sec                       (28.60%)

 Performance counter stats for './a.out':

            832.83 msec task-clock                       #    0.999 CPUs utilized
    12,289,501,297      instructions                     #    3.46  insn per cycle
                                                  #    0.11  stalled cycles per insn     (29.42%)
     3,549,297,932      cycles                           #    4.262 GHz                         (29.64%)
         5,552,837      stalled-cycles-frontend          #    0.16% frontend cycles idle        (30.12%)
     1,349,663,970      stalled-cycles-backend           #   38.03% backend cycles idle         (30.25%)
     4,144,875,512      L1-dcache-loads                  #    4.977 G/sec                       (30.25%)
           772,968      L1-dcache-load-misses            #    0.02% of all L1-dcache accesses   (30.24%)
           539,481      L1-dcache-prefetches             #  647.767 K/sec                       (30.25%)
           532,879      l2_cache_accesses_from_dc_misses #  639.839 K/sec                       (30.24%)
           461,131      l2_cache_hits_from_dc_misses     #  553.690 K/sec                       (30.04%)
            14,485      l2_cache_misses_from_dc_misses   #   17.392 K/sec                       (29.55%)

There is a slight increase in L1 cache misses: from 0% at 8 bytes PACKET_SIZE up to 0.02% at 32 bytes. But can it justify why the backend stalls jumped from 2.6% to 38%? If no, then what else stalls the CPU backend?

I get that a larger stride means that the memcpy loop moves from one L1 cache line to another one sooner. But, if the lines are already in the cache, and there are practically no L1 miss events, as reported by perf, then why would an access to a different cache line stall the backend? Is it something about how the CPU issues the instructions in parallel? Maybe, it cannot issue instructions that access different cache lines simultaneously?

Runs with PACKET_SIZE up to 1024 bytes are shown on the following somewhat busy plot:

“Sorry, you cannot include links in your post” i. stack. imgur. com /98nRo.jpg

The number on the data points shows the PACKET_SIZE parameter of the run, i.e. the stride of the memcpy access pattern. The X axis is millions of operations per second (Mops), an “operation” = 1 memcpy. The Y axis contains the metrics from perf: the percent of L1 accesses that missed, and percent of cycles that are stalled in backend and frontend.

In all these runs, the L2 accesses practically do not miss, i.e. l2_cache_misses_from_dc_misses metric is always very low. Just for completeness, according to Ian Cutress on anandtech (“AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome” amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome/11) Zen 2 architecture L1 latency is 4 cycles, L2 latency is 12 cycles.

I am not sure why the frontend gets stalled. But that’s what perf reports. And I believe it is real. Because the effect of the stalled frontend is different from backend. If you compare the runs with PACKET_SIZE 256 and 1024 on the plot: they have about the same L1 misses; 256 has about 77% cycles stalled in backend and 0% in frontend; 1024 is the opposite, 77% stalled in frontend and 0% in backend. Yet, 1024 is much slower, because it has much less instructions issued per cycle. It’s about 0.42 in the 1024 run, and 1.28 in the 256 run.

So, the CPU issues less instructions per cycle when it is stalled in frontend rather than in backend. I guess that’s just how the frontend and backend work, i.e. the backend can run more parallel. It would be well appreciated if someone could confirm or correct this guess. However, a more important question: why does the frontend get stalled? The frontend is supposed to just decode the instructions. The assembly does not really change with PACKET_SIZE set to 256 or 1024. Then, what makes the frontend to stall more at 1024 stride than at 256?

The plot of IPC per Mops for all the PACKET_SIZE runs:

“Sorry, you cannot include links in your post” i. stack. imgur. com / zy678.png

The run with PACKET_SIZE 8 is slightly off the line towards more Mops, i.e. faster than the trend of other values. That must be because of the more efficient instructions.


Off the top of my head, this seems like it could be an issue with cache associativity. Power-of-two strides can be problematic.

Your L1D cache is 8-way associative, and your L3 cache is 16-way associative. That means every 8th (or 16th) 64-byte cache line in your array gets mapped to the same part of the CPU cache. (Notice that 8*64 = 512, and 16*64 = 1024.)

So with a stride of 1024, all of your writes are hitting the same small subset of the caches. Does the performance stop degrading as you increase the stride to 2048, 4096, etc?

If you haven’t read it yet, I would highly recommend Ulrich Drepper’s “What Every Programmer Should Know About Memory