I just rewatched wendells video and tested the performance of memcpy on an 3970x:
Source code is here:
compile (gcc 8.4.0 newer versions give the same result):
gcc -march=native -O3 testmem_modified.c -o tm64
32 MB = 1.787191 ms
now use the optimized sse function:
32 MB = 1.131438 ms
much faster that shouldn’t be the case.
perf record shows 99% in __memmove_avx_unaligned_erms for the slow case.
Now my question is, is this related to my system maybe misoptimized glibc or is this a general problem?
My glibc version is 2.30 with clear linux patches applied.