Level1 Diagnostic: Fixing our Memcpy Troubles (for Looking Glass) | Level One Techs

Btw are you modifying the c file to switch between libc and my memcpy? Interested in the comparison between my copy and yours. Im not 100% these results arent inline memcpy vs called memcpy? Can you clarify which one you are testing?

Edit you answered thanks

I will agree the results are admirable only if my memcpy is similarly slow hehe :slight_smile:

It does look like the 8320 is doing better not inlined. For me my experince was similar but still slower on tr than the 8700k. With libc

Certainly :slight_smile: I meant in comparison to the builtins, which are shown in a few of the test results to be about twice as slow as libc.

The macos results are also utter insanity. That cpu should take no mode than 2ms with the cheapest memory ever haha.

But you have to admit more is going on here than you first thought. I think. Because tbe results are fascinating.

memcpy_sse

diff --git a/testmem_modified.c b/testmem_modified.c
index 4b9af0f..9f1084f 100644
--- a/testmem_modified.c
+++ b/testmem_modified.c
@@ -63,9 +63,8 @@ for(int i = 0; i < s; i++) {


     uint64_t t = nanotime();
-    void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;
     for(volatile int i = 0; i < 1000; ++i)
-       memcpy_ptr(buffer1, buffer2, size );
+       memcpy_sse(buffer1, buffer2, size );
     printf("%2u MB = %f ms\n", s, ((float)(nanotime() - t) / 1000.0f) / 1000000.0f);
     printf("-Compare match (should be zero): %2u \n\n", memcmp(buffer1,buffer2,size)) ;
     free(buffer1);

Linux 4.16.9-1-ARCH

model name : AMD FX™-8320 Eight-Core Processor

GCC:

[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m64 testmem_modified.c -o tm64-sse
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 testmem_modified.c -o tm32-sse
[ryan@fx8320-arch memcpy_sse]$ ./tm64-sse 32
32 MB = 4.419465 ms
-Compare match (should be zero):  0

[ryan@fx8320-arch memcpy_sse]$ ./tm32-sse 32
32 MB = 4.415379 ms
-Compare match (should be zero):  0

Clang:

[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m64 testmem_modified.c -o ctm64-sse
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m32 testmem_modified.c -o ctm32-sse
[ryan@fx8320-arch memcpy_sse]$ ./ctm64-sse 32
32 MB = 9.013206 ms
-Compare match (should be zero):  0

[ryan@fx8320-arch memcpy_sse]$ ./ctm32-sse 32
32 MB = 9.017775 ms
-Compare match (should be zero):  0

Clang, what are you doing?

macOS 10.13.4

Intel® Core™ i7-6700HQ CPU @ 2.60GHz

Clang:

i7-macos ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64-sse
i7-macos ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32-sse
i7-macos ➜  memcpy_sse git:(master) ✗ ./tm64-sse 32
32 MB = 2.849157 ms
-Compare match (should be zero):  0

i7-macos ➜  memcpy_sse git:(master) ✗ ./tm32-sse 32
32 MB = 2.753566 ms
-Compare match (should be zero):  0

Well how about that!

Edit: The above result seemed too good to be true, so I reran the tests a few more times and the following result is a far more representative outcome:

i7-macos ➜  memcpy_sse git:(master) ✗ ./tm64-sse 32
32 MB = 5.779583 ms
-Compare match (should be zero):  0

i7-macos ➜  memcpy_sse git:(master) ✗ ./tm32-sse 32
32 MB = 5.549712 ms
-Compare match (should be zero):  0

I have noticed this a few times where the first run has been unusually fast and later runs take a bit longer but are very consistent.

FreeBSD 11.1-RELEASE-p10

CPU: AMD FX-8370 Eight-Core Processor (4013.68-MHz K8-class CPU)

Clang:

fx-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64-sse
fx-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32-sse
fx-freebsd ➜  memcpy_sse git:(master) ✗ ./tm64-sse 32
32 MB = 4.969570 ms
-Compare match (should be zero):  0

fx-freebsd ➜  memcpy_sse git:(master) ✗ ./tm32-sse 32
32 MB = 5.059393 ms
-Compare match (should be zero):  0

GCC:

fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -O3 -march=native -m64 testmem_modified.c -o gtm64-sse
fx-freebsd ➜  memcpy_sse git:(master) ✗ ./gtm64-sse 32
32 MB = 4.955077 ms
-Compare match (should be zero):  0

Not surprisingly, the SSE code is considerably faster than the non-SSE libc.

FreeBSD 11.2-BETA2

CPU: Intel® Xeon® CPU E3-1275 v3 @ 3.50GHz (3491.98-MHz K8-class CPU

Clang:

xeon-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64-sse
xeon-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32-sse
xeon-freebsd ➜  memcpy_sse git:(master) ✗ ./tm64-sse 32
32 MB = 4.927730 ms
-Compare match (should be zero):  0

xeon-freebsd ➜  memcpy_sse git:(master) ✗ ./tm32-sse 32
32 MB = 4.963394 ms
-Compare match (should be zero):  0

GCC:

xeon-freebsd ➜  memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m64 testmem_modified.c -o gtm64-sse
xeon-freebsd ➜  memcpy_sse git:(master) ✗ ./gtm64-sse 32
32 MB = 3.350286 ms
-Compare match (should be zero):  0

Interestingly, the SSE code is slower than the non-SSE libc here when compiled by Clang, and on par with libc when compiled by GCC! I re-ran both versions of the tests several times to be satisfied these weren’t outliers.

2 Likes

is it by any chance single channel memory on the mac? That’s just way too slow for a 6700. 1866 memory maybe?

I never had the 8320 but the old 8350 I had I am almost positive was on the order of 2-3ms for this, not 5. But I wasn’t doing exactly this type of work with it way back when so perhaps I’m mistaken about that.

The Xeon E3 is also much slower than I’ve experienced with the same class of cpu, except perhaps if it is in single channel mode.

Very interesting results. And very glad it’s consistent between 32/64 bit mode no matter how the user happens to compile it. Still, I expected it to be faster and I think something is still up with your hardware and I should test some of my own similar hardware.

also, just to clarify, are you saying the mac os libc memcpy is not 5/6ms as above?

And is the lesson here:

Libc – it’s probably more busted than you think. Lol?

At this point it should be said that the method being used for timing these operations is not ideal. CLOCK_MONOTONIC_RAW is not a good indicator of how much time a program spends on CPU, because it includes all the time the OS spends doing things besides running your program (running other programs, for example).

Diffs

testmem_cputime.c
--- testmem_modified.c	2018-06-01 19:10:10.948411000 -0700
+++ testmem_cputime.c	2018-06-01 18:38:16.329042000 -0700
@@ -44,7 +44,7 @@
     {

       struct timespec time;
-      clock_gettime(CLOCK_MONOTONIC_RAW, &time);
+      clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time);
       return ((uint64_t)time.tv_sec * 1e9) + time.tv_nsec;
     }

@@ -63,9 +63,8 @@


     uint64_t t = nanotime();
-    void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;
     for(volatile int i = 0; i < 1000; ++i)
-       memcpy_ptr(buffer1, buffer2, size );
+       memcpy(buffer1, buffer2, size );
     printf("%2u MB = %f ms\n", s, ((float)(nanotime() - t) / 1000.0f) / 1000000.0f);
     printf("-Compare match (should be zero): %2u \n\n", memcmp(buffer1,buffer2,size)) ;
     free(buffer1);
testmem_cputime_sse.c
--- testmem_modified.c	2018-06-01 19:10:10.948411000 -0700
+++ testmem_cputime_sse.c	2018-06-01 18:38:16.330004000 -0700
@@ -44,7 +44,7 @@
     {

       struct timespec time;
-      clock_gettime(CLOCK_MONOTONIC_RAW, &time);
+      clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time);
       return ((uint64_t)time.tv_sec * 1e9) + time.tv_nsec;
     }

@@ -63,9 +63,8 @@


     uint64_t t = nanotime();
-    void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;
     for(volatile int i = 0; i < 1000; ++i)
-       memcpy_ptr(buffer1, buffer2, size );
+       memcpy_sse(buffer1, buffer2, size );
     printf("%2u MB = %f ms\n", s, ((float)(nanotime() - t) / 1000.0f) / 1000000.0f);
     printf("-Compare match (should be zero): %2u \n\n", memcmp(buffer1,buffer2,size)) ;
     free(buffer1);

Linux 4.16.9-1-ARCH (GCC)

model name : AMD FX™-8320 Eight-Core Processor

[ryan@fx8320-arch memcpy_sse]$ for src in testmem_cputime*.c; do
    echo "${src} (64-bit)"
    echo "---------------"
    gcc -O3 -march=native -m64 ${src} -o tm64
    for i in $(seq 10); do ./tm64 32; done
    echo "${src} (32-bit)"
    echo "---------------"
    gcc -O3 -march=native -m32 ${src} -o tm32
    for i in $(seq 10); do ./tm32 32; done
    echo "${src} (32-bit, -fno-builtin-memcpy)"
    echo "------------------------------------"
    gcc -O3 -march=native -m32 -fno-builtin-memcpy ${src} -o tm32
    for i in $(seq 10); do ./tm32 32; done
done
Results

testmem_cputime.c (64-bit)

32 MB = 5.241058 ms
-Compare match (should be zero): 0

32 MB = 5.243259 ms
-Compare match (should be zero): 0

32 MB = 5.266191 ms
-Compare match (should be zero): 0

32 MB = 5.240136 ms
-Compare match (should be zero): 0

32 MB = 5.256821 ms
-Compare match (should be zero): 0

32 MB = 5.266900 ms
-Compare match (should be zero): 0

32 MB = 5.234083 ms
-Compare match (should be zero): 0

32 MB = 5.251327 ms
-Compare match (should be zero): 0

32 MB = 5.249514 ms
-Compare match (should be zero): 0

32 MB = 5.244353 ms
-Compare match (should be zero): 0

testmem_cputime.c (32-bit)

32 MB = 8.323980 ms
-Compare match (should be zero): 0

32 MB = 8.328946 ms
-Compare match (should be zero): 0

32 MB = 8.329352 ms
-Compare match (should be zero): 0

32 MB = 8.296982 ms
-Compare match (should be zero): 0

32 MB = 8.326883 ms
-Compare match (should be zero): 0

32 MB = 8.336661 ms
-Compare match (should be zero): 0

32 MB = 8.329819 ms
-Compare match (should be zero): 0

32 MB = 8.321874 ms
-Compare match (should be zero): 0

32 MB = 8.324215 ms
-Compare match (should be zero): 0

32 MB = 8.321898 ms
-Compare match (should be zero): 0

testmem_cputime.c (32-bit, -fno-builtin-memcpy)

32 MB = 4.249210 ms
-Compare match (should be zero): 0

32 MB = 4.238623 ms
-Compare match (should be zero): 0

32 MB = 4.223117 ms
-Compare match (should be zero): 0

32 MB = 4.246448 ms
-Compare match (should be zero): 0

32 MB = 4.232831 ms
-Compare match (should be zero): 0

32 MB = 4.211716 ms
-Compare match (should be zero): 0

32 MB = 4.236586 ms
-Compare match (should be zero): 0

32 MB = 4.208664 ms
-Compare match (should be zero): 0

32 MB = 4.215388 ms
-Compare match (should be zero): 0

32 MB = 4.238823 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (64-bit)

32 MB = 4.416759 ms
-Compare match (should be zero): 0

32 MB = 4.412868 ms
-Compare match (should be zero): 0

32 MB = 4.419092 ms
-Compare match (should be zero): 0

32 MB = 4.413001 ms
-Compare match (should be zero): 0

32 MB = 4.413273 ms
-Compare match (should be zero): 0

32 MB = 4.411419 ms
-Compare match (should be zero): 0

32 MB = 4.407824 ms
-Compare match (should be zero): 0

32 MB = 4.406849 ms
-Compare match (should be zero): 0

32 MB = 4.399800 ms
-Compare match (should be zero): 0

32 MB = 4.407373 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (32-bit)

32 MB = 4.420913 ms
-Compare match (should be zero): 0

32 MB = 4.410238 ms
-Compare match (should be zero): 0

32 MB = 4.411091 ms
-Compare match (should be zero): 0

32 MB = 4.486512 ms
-Compare match (should be zero): 0

32 MB = 4.414333 ms
-Compare match (should be zero): 0

32 MB = 4.408229 ms
-Compare match (should be zero): 0

32 MB = 4.408923 ms
-Compare match (should be zero): 0

32 MB = 4.411462 ms
-Compare match (should be zero): 0

32 MB = 4.410039 ms
-Compare match (should be zero): 0

32 MB = 4.411212 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (32-bit, -fno-builtin-memcpy)

32 MB = 4.422927 ms
-Compare match (should be zero): 0

32 MB = 4.479233 ms
-Compare match (should be zero): 0

32 MB = 4.405762 ms
-Compare match (should be zero): 0

32 MB = 4.414087 ms
-Compare match (should be zero): 0

32 MB = 4.411552 ms
-Compare match (should be zero): 0

32 MB = 4.476846 ms
-Compare match (should be zero): 0

32 MB = 4.485496 ms
-Compare match (should be zero): 0

32 MB = 4.407228 ms
-Compare match (should be zero): 0

32 MB = 4.416489 ms
-Compare match (should be zero): 0

32 MB = 4.413395 ms
-Compare match (should be zero): 0

FreeBSD 11.1-RELEASE-p10 (Clang)

CPU: AMD FX-8370 Eight-Core Processor (4013.68-MHz K8-class CPU)

fx-freebsd ➜  memcpy_sse git:(master) ✗ for src in testmem_cputime*.c; do
    echo "${src} (64-bit)"
    echo "---------------"
    cc -O3 -march=native -m64 ${src} -o tm64
    for i in $(seq 10); do ./tm64 32; done
    echo "${src} (32-bit)"
    echo "---------------"
    cc -O3 -march=native -m32 ${src} -o tm32
    for i in $(seq 10); do ./tm32 32; done
done
Results

testmem_cputime.c (64-bit)

32 MB = 9.364038 ms
-Compare match (should be zero): 0

32 MB = 9.469106 ms
-Compare match (should be zero): 0

32 MB = 9.737254 ms
-Compare match (should be zero): 0

32 MB = 9.648358 ms
-Compare match (should be zero): 0

32 MB = 9.649680 ms
-Compare match (should be zero): 0

32 MB = 9.740510 ms
-Compare match (should be zero): 0

32 MB = 9.662104 ms
-Compare match (should be zero): 0

32 MB = 9.837167 ms
-Compare match (should be zero): 0

32 MB = 9.789955 ms
-Compare match (should be zero): 0

32 MB = 9.747930 ms
-Compare match (should be zero): 0

testmem_cputime.c (32-bit)

32 MB = 9.453690 ms
-Compare match (should be zero): 0

32 MB = 9.738267 ms
-Compare match (should be zero): 0

32 MB = 9.645654 ms
-Compare match (should be zero): 0

32 MB = 9.653027 ms
-Compare match (should be zero): 0

32 MB = 9.742734 ms
-Compare match (should be zero): 0

32 MB = 10.306901 ms
-Compare match (should be zero): 0

32 MB = 9.746101 ms
-Compare match (should be zero): 0

32 MB = 9.655098 ms
-Compare match (should be zero): 0

32 MB = 9.553641 ms
-Compare match (should be zero): 0

32 MB = 9.412613 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (64-bit)

32 MB = 5.203243 ms
-Compare match (should be zero): 0

32 MB = 5.205019 ms
-Compare match (should be zero): 0

32 MB = 5.373679 ms
-Compare match (should be zero): 0

32 MB = 5.184111 ms
-Compare match (should be zero): 0

32 MB = 5.114539 ms
-Compare match (should be zero): 0

32 MB = 5.110372 ms
-Compare match (should be zero): 0

32 MB = 5.109356 ms
-Compare match (should be zero): 0

32 MB = 5.109775 ms
-Compare match (should be zero): 0

32 MB = 4.921534 ms
-Compare match (should be zero): 0

32 MB = 5.119770 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (32-bit)

32 MB = 5.210353 ms
-Compare match (should be zero): 0

32 MB = 5.206656 ms
-Compare match (should be zero): 0

32 MB = 5.395012 ms
-Compare match (should be zero): 0

32 MB = 5.113557 ms
-Compare match (should be zero): 0

32 MB = 5.204370 ms
-Compare match (should be zero): 0

32 MB = 5.305252 ms
-Compare match (should be zero): 0

32 MB = 5.218168 ms
-Compare match (should be zero): 0

32 MB = 5.209056 ms
-Compare match (should be zero): 0

32 MB = 5.298002 ms
-Compare match (should be zero): 0

32 MB = 5.203724 ms
-Compare match (should be zero): 0

macOS 10.13.4 (Clang)

Intel® Core™ i7-6700HQ CPU @ 2.60GHz

i7-macos ➜  memcpy_sse git:(master) ✗ for src in testmem_cputime*.c; do
    echo "${src} (64-bit)"
    echo "---------------"
    cc -O3 -march=native -m64 ${src} -o tm64
    for i in $(seq 10); do ./tm64 32; done
    echo "${src} (32-bit)"
    echo "---------------"
    cc -O3 -march=native -m32 ${src} -o tm32
    for i in $(seq 10); do ./tm32 32; done
done
Results

testmem_cputime.c (64-bit)

32 MB = 6.048658 ms
-Compare match (should be zero): 0

32 MB = 5.885129 ms
-Compare match (should be zero): 0

32 MB = 6.949056 ms
-Compare match (should be zero): 0

32 MB = 6.519306 ms
-Compare match (should be zero): 0

32 MB = 5.916329 ms
-Compare match (should be zero): 0

32 MB = 5.919647 ms
-Compare match (should be zero): 0

32 MB = 6.008809 ms
-Compare match (should be zero): 0

32 MB = 5.992583 ms
-Compare match (should be zero): 0

32 MB = 5.890260 ms
-Compare match (should be zero): 0

32 MB = 5.900751 ms
-Compare match (should be zero): 0

testmem_cputime.c (32-bit)

32 MB = 5.162855 ms
-Compare match (should be zero): 0

32 MB = 5.062851 ms
-Compare match (should be zero): 0

32 MB = 5.122910 ms
-Compare match (should be zero): 0

32 MB = 5.227316 ms
-Compare match (should be zero): 0

32 MB = 5.102281 ms
-Compare match (should be zero): 0

32 MB = 5.117700 ms
-Compare match (should be zero): 0

32 MB = 5.322892 ms
-Compare match (should be zero): 0

32 MB = 5.150164 ms
-Compare match (should be zero): 0

32 MB = 5.280897 ms
-Compare match (should be zero): 0

32 MB = 5.287386 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (64-bit)

32 MB = 4.031928 ms
-Compare match (should be zero): 0

32 MB = 3.895115 ms
-Compare match (should be zero): 0

32 MB = 3.913997 ms
-Compare match (should be zero): 0

32 MB = 3.871658 ms
-Compare match (should be zero): 0

32 MB = 3.965120 ms
-Compare match (should be zero): 0

32 MB = 3.860322 ms
-Compare match (should be zero): 0

32 MB = 3.876988 ms
-Compare match (should be zero): 0

32 MB = 3.967155 ms
-Compare match (should be zero): 0

32 MB = 3.985467 ms
-Compare match (should be zero): 0

32 MB = 3.888782 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (32-bit)

32 MB = 4.000218 ms
-Compare match (should be zero): 0

32 MB = 3.992098 ms
-Compare match (should be zero): 0

32 MB = 3.983450 ms
-Compare match (should be zero): 0

32 MB = 3.985040 ms
-Compare match (should be zero): 0

32 MB = 3.895962 ms
-Compare match (should be zero): 0

32 MB = 4.052533 ms
-Compare match (should be zero): 0

32 MB = 4.364831 ms
-Compare match (should be zero): 0

32 MB = 4.003279 ms
-Compare match (should be zero): 0

32 MB = 4.467057 ms
-Compare match (should be zero): 0

32 MB = 5.001040 ms
-Compare match (should be zero): 0

FreeBSD 11.2-BETA2 (Clang)

CPU: Intel® Xeon® CPU E3-1275 v3 @ 3.50GHz (3491.98-MHz K8-class CPU

xeon-freebsd ➜  memcpy_sse git:(master) ✗ for src in testmem_cputime*.c; do
    echo "${src} (64-bit)"
    echo "---------------"
    cc -O3 -march=native -m64 ${src} -o tm64
    for i in $(seq 10); do ./tm64 32; done
    echo "${src} (32-bit)"
    echo "---------------"
    cc -O3 -march=native -m32 ${src} -o tm32
    for i in $(seq 10); do ./tm32 32; done
done
Results

testmem_cputime_sse.c (64-bit)

32 MB = 5.397115 ms
-Compare match (should be zero): 0

32 MB = 5.396561 ms
-Compare match (should be zero): 0

32 MB = 5.679435 ms
-Compare match (should be zero): 0

32 MB = 5.492458 ms
-Compare match (should be zero): 0

32 MB = 5.461031 ms
-Compare match (should be zero): 0

32 MB = 5.338261 ms
-Compare match (should be zero): 0

32 MB = 5.489692 ms
-Compare match (should be zero): 0

32 MB = 5.398943 ms
-Compare match (should be zero): 0

32 MB = 5.487649 ms
-Compare match (should be zero): 0

32 MB = 5.498127 ms
-Compare match (should be zero): 0

testmem_cputime_sse.c (32-bit)

32 MB = 5.586349 ms
-Compare match (should be zero): 0

32 MB = 5.398066 ms
-Compare match (should be zero): 0

32 MB = 5.399075 ms
-Compare match (should be zero): 0

32 MB = 5.394996 ms
-Compare match (should be zero): 0

32 MB = 5.582423 ms
-Compare match (should be zero): 0

32 MB = 5.516856 ms
-Compare match (should be zero): 0

32 MB = 5.777014 ms
-Compare match (should be zero): 0

32 MB = 5.403728 ms
-Compare match (should be zero): 0

32 MB = 5.405563 ms
-Compare match (should be zero): 0

32 MB = 5.398723 ms
-Compare match (should be zero): 0

testmem_cputime.c (64-bit)

32 MB = 3.977475 ms
-Compare match (should be zero): 0

32 MB = 3.976020 ms
-Compare match (should be zero): 0

32 MB = 4.095995 ms
-Compare match (should be zero): 0

32 MB = 3.886303 ms
-Compare match (should be zero): 0

32 MB = 3.888495 ms
-Compare match (should be zero): 0

32 MB = 3.982998 ms
-Compare match (should be zero): 0

32 MB = 3.883676 ms
-Compare match (should be zero): 0

32 MB = 3.929006 ms
-Compare match (should be zero): 0

32 MB = 4.082254 ms
-Compare match (should be zero): 0

32 MB = 4.077366 ms
-Compare match (should be zero): 0

testmem_cputime.c (32-bit)

32 MB = 3.887418 ms
-Compare match (should be zero): 0

32 MB = 3.978700 ms
-Compare match (should be zero): 0

32 MB = 3.978148 ms
-Compare match (should be zero): 0

32 MB = 4.162944 ms
-Compare match (should be zero): 0

32 MB = 4.218827 ms
-Compare match (should be zero): 0

32 MB = 4.071517 ms
-Compare match (should be zero): 0

32 MB = 4.266001 ms
-Compare match (should be zero): 0

32 MB = 4.173596 ms
-Compare match (should be zero): 0

32 MB = 4.303510 ms
-Compare match (should be zero): 0

32 MB = 3.973035 ms
-Compare match (should be zero): 0

1 Like

Not sure which time macOS threw me for a loop you’re referring to. In the first post I had originally observed a high outlier around 9ms, and after further testing the results settled around 5/6 ms. In the second post I was saying that the 3ms result with memcpy_sse was a low outlier I couldn’t reproduce, and 5/6 ms again was typical.

Here’s the i7-macos memory configuration:

system_profiler SPMemoryDataType
Memory:

    Memory Slots:

      ECC: Disabled
      Upgradeable Memory: No

        BANK 0/DIMM0:

          Size: 8 GB
          Type: LPDDR3
          Speed: 2133 MHz
          Status: OK
          Manufacturer: 0x802C
          Part Number: 0x4D5435324C31473332443450472D30393320
          Serial Number: -

        BANK 1/DIMM0:

          Size: 8 GB
          Type: LPDDR3
          Speed: 2133 MHz
          Status: OK
          Manufacturer: 0x802C
          Part Number: 0x4D5435324C31473332443450472D30393320
          Serial Number: -

I assume each bank is on a separate channel…

And the xeon-freebsd memory configuration:

sudo dmidecode -t memory
# dmidecode 3.1
# SMBIOS entry point at 0x000f04c0
Found SMBIOS entry point in EFI, reading table from /dev/mem.
SMBIOS 2.7 present.

Handle 0x004B, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Single-bit ECC
	Maximum Capacity: 32 GB
	Error Information Handle: Not Provided
	Number Of Devices: 4

Handle 0x004C, DMI type 17, 34 bytes
Memory Device
	Array Handle: 0x004B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMA1
	Bank Locator: P0_Node0_Channel0_Dimm0
	Type: DDR3
	Type Detail: Synchronous
	Speed: 1600 MT/s
	Manufacturer: Kingston
	Serial Number: C9279C16
	Asset Tag: 9876543210
	Part Number: 9965525-024.A00LF
	Rank: 2
	Configured Clock Speed: 1600 MT/s

Handle 0x004D, DMI type 17, 34 bytes
Memory Device
	Array Handle: 0x004B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMA2
	Bank Locator: P0_Node0_Channel0_Dimm1
	Type: DDR3
	Type Detail: Synchronous
	Speed: 1600 MT/s
	Manufacturer: Kingston
	Serial Number: CE278A16
	Asset Tag: 9876543210
	Part Number: 9965525-024.A00LF
	Rank: 2
	Configured Clock Speed: 1600 MT/s

Handle 0x004E, DMI type 17, 34 bytes
Memory Device
	Array Handle: 0x004B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMB1
	Bank Locator: P0_Node0_Channel1_Dimm0
	Type: DDR3
	Type Detail: Synchronous
	Speed: 1600 MT/s
	Manufacturer: Kingston
	Serial Number: CF278C16
	Asset Tag: 9876543210
	Part Number: 9965525-024.A00LF
	Rank: 2
	Configured Clock Speed: 1600 MT/s

Handle 0x004F, DMI type 17, 34 bytes
Memory Device
	Array Handle: 0x004B
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMB2
	Bank Locator: P0_Node0_Channel1_Dimm1
	Type: DDR3
	Type Detail: Synchronous
	Speed: 1600 MT/s
	Manufacturer: Kingston
	Serial Number: C9274F16
	Asset Tag: 9876543210
	Part Number: 9965525-024.A00LF
	Rank: 2
	Configured Clock Speed: 1600 MT/s

Two channels here, without a doubt.

Dual channel in both cases. 2133. This calls for more testing when Im back from computex next week. That’s still disturbingly slow…

@wendell have you tried running the code in real time mode and seeing if there is any differences?

I’ve also been down a memcpy() rabbit hole; but for portability not speed. I’m developing on a Linux platform which has a newer glibc than the target (2.23 vs 2.11.3).

memcpy() was updated in glibc v2.14 (PATCH: Improve 64bit memcpy/memmove for Atom, Core 2 and Core i7) and by default the linker picks the later version making it a dependency, meaning the compiled binary won’t run on the target.

The eventual solution was to add this magic to our source:

__asm__(".symver memcpy,memcpy@GLIBC_2.2.5");       // Use GLIBC 2.2.5 version of memcpy, not the later 2.14 version. Target has glibc 2.11.3 only.

See this write-up from someone who had the same problem: The memcpy vs. memmove saga. It has good details and useful further links - well worth a read.

Anyhow, if you have a better/more optimised 32bit memcpy would it be worth contributing it back to glibc?

i9-7960X, running at 3.5ghz (turbo disabled) using 3000mhz mem. On Fedora 27.

# 64 bits
$ gcc -O3 testmem_modified.c
$ a.out 32
memcpy: 32 MB = 1.738202 ms
fastcpy: 32 MB = 1.797177 ms

$ a.out 128
memcpy: 128 MB = 6.830133 ms
fastcpy: 128 MB = 8.642609 ms

# 32 bits
$ gcc -m32 -O3 testmem_modified.c 
$ a.out 32
memcpy: 32 MB = 1.767541 ms
fastcpy: 32 MB = 2.800404 ms

$ a.out 128
memcpy: 128 MB = 6.642093 ms
fastcpy: 128 MB = 11.914536 ms

Note that the 32 bit version is faster than the 64 bit (Edit: sometimes. It’s close).

The 64 bit version uses memcpy-avx with memory prefetching… bells and whistles.

The 32 bit version uses memcpy with the old “rep movsl”, no sse or anything like that.

    uint64_t t = nanotime();
    void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;
    for(volatile int i = 0; i < 1000; ++i)
       memcpy_ptr(buffer1, buffer2, size );
    printf("memcpy: %2u MB = %f ms\n", s, ((float)(nanotime() - t) / 1000.0f) / 1000000.0f);
    t = nanotime();
    for(volatile int i = 0; i < 1000; ++i)
       fastcpy(buffer1, buffer2, size );
    printf("fastcpy: %2u MB = %f ms\n", s, ((float)(nanotime() - t) / 1000.0f) / 1000000.0f);
    printf("-Compare match (should be zero): %2u \n\n", memcmp(buffer1,buffer2,size)) ;

memcpy 32 bit

(gdb) disassem
Dump of assembler code for function memcpy:
   0xf7e94cb0 <+0>:     mov    %edi,%eax
   0xf7e94cb2 <+2>:     mov    0x4(%esp),%edi
   0xf7e94cb6 <+6>:     mov    %esi,%edx
   0xf7e94cb8 <+8>:     mov    0x8(%esp),%esi
   0xf7e94cbc <+12>:    mov    %edi,%ecx
   0xf7e94cbe <+14>:    xor    %esi,%ecx
   0xf7e94cc0 <+16>:    and    $0x3,%ecx
   0xf7e94cc3 <+19>:    mov    0xc(%esp),%ecx
   0xf7e94cc7 <+23>:    cld    
   0xf7e94cc8 <+24>:    jne    0xf7e94d06 <memcpy+86>
   0xf7e94cca <+26>:    cmp    $0x3,%ecx
   0xf7e94ccd <+29>:    jbe    0xf7e94d06 <memcpy+86>
   0xf7e94ccf <+31>:    test   $0x3,%esi
   0xf7e94cd5 <+37>:    je     0xf7e94ced <memcpy+61>
   0xf7e94cd7 <+39>:    movsb  %ds:(%esi),%es:(%edi)
   0xf7e94cd8 <+40>:    dec    %ecx
   0xf7e94cd9 <+41>:    test   $0x3,%esi
   0xf7e94cdf <+47>:    je     0xf7e94ced <memcpy+61>
   0xf7e94ce1 <+49>:    movsb  %ds:(%esi),%es:(%edi)
   0xf7e94ce2 <+50>:    dec    %ecx
   0xf7e94ce3 <+51>:    test   $0x3,%esi
   0xf7e94ce9 <+57>:    je     0xf7e94ced <memcpy+61>
   0xf7e94ceb <+59>:    movsb  %ds:(%esi),%es:(%edi)
   0xf7e94cec <+60>:    dec    %ecx
   0xf7e94ced <+61>:    push   %eax
   0xf7e94cee <+62>:    mov    %ecx,%eax
   0xf7e94cf0 <+64>:    shr    $0x2,%ecx
   0xf7e94cf3 <+67>:    and    $0x3,%eax
=> 0xf7e94cf6 <+70>:    rep movsl %ds:(%esi),%es:(%edi)
   0xf7e94cf8 <+72>:    mov    %eax,%ecx
   0xf7e94cfa <+74>:    rep movsb %ds:(%esi),%es:(%edi)
   0xf7e94cfc <+76>:    pop    %eax
   0xf7e94cfd <+77>:    mov    %eax,%edi
   0xf7e94cff <+79>:    mov    %edx,%esi
   0xf7e94d01 <+81>:    mov    0x4(%esp),%eax
   0xf7e94d05 <+85>:    ret    
   0xf7e94d06 <+86>:    shr    %ecx
   0xf7e94d08 <+88>:    jae    0xf7e94d0b <memcpy+91>
   0xf7e94d0a <+90>:    movsb  %ds:(%esi),%es:(%edi)
   0xf7e94d0b <+91>:    shr    %ecx
   0xf7e94d0d <+93>:    jae    0xf7e94d11 <memcpy+97>
   0xf7e94d0f <+95>:    movsw  %ds:(%esi),%es:(%edi)
   0xf7e94d11 <+97>:    rep movsl %ds:(%esi),%es:(%edi)
   0xf7e94d13 <+99>:    jmp    0xf7e94cfd <memcpy+77>
End of assembler dump.
2 Likes

One other note… The ‘perf stat’ shows 0.74 ins/cycle. That low IPC indicates the time is mostly spent waiting for memory, as you might expect. The amount of time gained from one assem version over the other is probably small compared to speeding up the memory subsystem.

$ perf stat ./test64 32
32 MB = 1.724581 ms
32 MB = 1.840364 ms
-Compare match (should be zero):  0 


 Performance counter stats for './test64 32':

       3569.342863      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            16,433      page-faults:u             #    0.005 M/sec                  
    12,397,367,274      cycles:u                  #    3.473 GHz                    
     9,180,075,679      instructions:u            #    0.74  insn per cycle         
       393,828,347      branches:u                #  110.336 M/sec                  
            10,053      branch-misses:u           #    0.00% of all branches        

See i think intel microcode is not really using eax/ecx in this case.

Did you try memcpy_sse from the test program? Is thst fastcpy in your post or something else?

No, I didn’t see it in the header. I just called fastcpy in the C file.

Heres the sse version:

$ gcc -march=native -O3 -m32 testmem_modified.c -o test32
$ ./test32 32
memcpy 32 MB = 1.766930 ms
fastcpy 32 MB = 4.096765 ms
memcpy_sse 32 MB = 1.737195 ms
-Compare match (should be zero):  0 

$ gcc -O3 testmem_modified.c -o test64
$ ./test64 32
memcpy 32 MB = 1.754567 ms
fastcpy 32 MB = 1.814079 ms
memcpy_sse 32 MB = 2.026398 ms
-Compare match (should be zero):  0 
1 Like

Src:

testmem_modified.txt (2.6 KB)

Oh you know you may not be able to run multiple copies in the same program because caching/etc to get a true performance picture. Usually sse performance is very consistent 32/64 bit doesnt matter

1 Like

Its running the same code 1000 times. Every iteration after 1 has caching effects :slight_smile: It is a microbench after all.

1 Like

My opinion is this: It doesn’t matter what asm is used, except in the register starved 32 bit copy versions. The CPU is smart enough to prefetch memory on it’s own, which is what really matters.

1 Like