Default memcpy performance(avx) on threadripper 3 bad?

I just rewatched wendells video and tested the performance of memcpy on an 3970x:


Source code is here:

compile (gcc 8.4.0 newer versions give the same result):
gcc -march=native -O3 testmem_modified.c -o tm64
32 MB = 1.787191 ms

now use the optimized sse function:
32 MB = 1.131438 ms

much faster that shouldn’t be the case.

perf record shows 99% in __memmove_avx_unaligned_erms for the slow case.
Now my question is, is this related to my system maybe misoptimized glibc or is this a general problem?

My glibc version is 2.30 with clear linux patches applied.

1 Like

Ok I found the “problem”
disabling the avx_unaligned_erms and avx_unaligned functions in glibc fixes the performance of normal memcpy.
i just commented out this stuff in glibc-2.30/sysdeps/x86_64/multiarch/ifunc-memmove.h:
image

now i get:
32 MB = 1.123270 ms
with normal memcpy

2 Likes

Can anyone test this on their system?
This should most likely also be a problem on ryzen 3000.

I recently found that the problem stays for 32bit compiles most likely a gcc related I reported a bug:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435

3 Likes

The gcc bug is now fixed in git:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=809b4d226c7f5ded392a88ffafe8d652f911b473

2 Likes