FreeBSD 11.2-BETA2
CPU: Intel® Xeon® CPU E3-1275 v3 @ 3.50GHz (3491.98-MHz K8-class CPU
Compiling with Clang:
xeon-freebsd ➜ memcpy_sse git:(master) ✗ cc -v
FreeBSD clang version 6.0.0 (tags/RELEASE_600/final 326565) (based on LLVM 6.0.0)
Target: x86_64-unknown-freebsd11.2
Thread model: posix
InstalledDir: /usr/bin
xeon-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -S
xeon-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
callq memcpy
xeon-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64
xeon-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -S
xeon-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
calll memcpy
xeon-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32
Test results:
xeon-freebsd ➜ memcpy_sse git:(master) ✗ ./tm64 32
32 MB = 3.540753 ms
-Compare match (should be zero): 0
xeon-freebsd ➜ memcpy_sse git:(master) ✗ ./tm32 32
32 MB = 3.440898 ms
-Compare match (should be zero): 0
Compiling with GCC:
xeon-freebsd ➜ memcpy_sse git:(master) ✗ gcc7 -v
Using built-in specs.
COLLECT_GCC=gcc7
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc7/gcc/x86_64-portbld-freebsd11.1/7.3.0/lto-wrapper
Target: x86_64-portbld-freebsd11.1
Configured with: /wrkdirs/usr/ports/lang/gcc7/work/gcc-7.3.0/configure --with-build-config=bootstrap-debug --disable-nls --enable-gnu-indirect-function --libdir=/usr/local/lib/gcc7 --libexecdir=/usr/local/libexec/gcc7 --program-suffix=7 --with-as=/usr/local/bin/as --with-gmp=/usr/local --with-gxx-include-dir=/usr/local/lib/gcc7/include/c++/ --with-ld=/usr/local/bin/ld --with-pkgversion='FreeBSD Ports Collection' --with-system-zlib --enable-languages=c,c++,objc,fortran --prefix=/usr/local --localstatedir=/var --mandir=/usr/local/man --infodir=/usr/local/info/gcc7 --build=x86_64-portbld-freebsd11.1
Thread model: posix
gcc version 7.3.0 (FreeBSD Ports Collection)
xeon-freebsd ➜ memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m64 testmem_modified.c -S
xeon-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
call memcpy
xeon-freebsd ➜ memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m64 testmem_modified.c -o gtm64
xeon-freebsd ➜ memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m32 testmem_modified.c -S
xeon-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
call memcpy
xeon-freebsd ➜ memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m32 testmem_modified.c -o gtm32
<complains about not finding 32-bit libraries>
Test results:
xeon-freebsd ➜ memcpy_sse git:(master) ✗ ./gtm64 32
32 MB = 3.440913 ms
-Compare match (should be zero): 0
(My GCC package does not include 32-bit libraries, and the 64-bit build segfaulted! forgot the size argument)
I’ll look into getting the 32-bit version to build with GCC and give that a go if it’s not a great ordeal.
FreeBSD 11.1-RELEASE-p10
CPU: AMD FX-8370 Eight-Core Processor (4013.68-MHz K8-class CPU)
Compiling with Clang:
fx-freebsd ➜ memcpy_sse git:(master) ✗ cc -v
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)
Target: x86_64-unknown-freebsd11.1
Thread model: posix
InstalledDir: /usr/bin
fx-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -S
fx-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
callq memcpy
fx-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64
fx-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -S
fx-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
calll memcpy
fx-freebsd ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32
Test results:
fx-freebsd ➜ memcpy_sse git:(master) ✗ ./tm64 32
32 MB = 9.419675 ms
-Compare match (should be zero): 0
fx-freebsd ➜ memcpy_sse git:(master) ✗ ./tm32 32
32 MB = 9.371661 ms
-Compare match (should be zero): 0
Compiling with GCC:
fx-freebsd ➜ memcpy_sse git:(master) ✗ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc6/gcc/x86_64-portbld-freebsd11.1/6.4.0/lto-wrapper
Target: x86_64-portbld-freebsd11.1
Configured with: /wrkdirs/usr/ports/lang/gcc6/work/gcc-6.4.0/configure --with-build-config=bootstrap-debug --disable-nls --enable-gnu-indirect-function --libdir=/usr/local/lib/gcc6 --libexecdir=/usr/local/libexec/gcc6 --program-suffix=6 --with-as=/usr/local/bin/as --with-gmp=/usr/local --with-gxx-include-dir=/usr/local/lib/gcc6/include/c++/ --with-ld=/usr/local/bin/ld --with-pkgversion='FreeBSD Ports Collection' --with-system-zlib --with-ecj-jar=/usr/local/share/java/ecj-4.5.jar --enable-languages=c,c++,objc,fortran,java --prefix=/usr/local --localstatedir=/var --mandir=/usr/local/man --infodir=/usr/local/info/gcc6 --build=x86_64-portbld-freebsd11.1
Thread model: posix
gcc version 6.4.0 (FreeBSD Ports Collection)
fx-freebsd ➜ memcpy_sse git:(master) ✗ gcc -O3 -march=native -m64 testmem_modified.c -S
fx-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
call memcpy
fx-freebsd ➜ memcpy_sse git:(master) ✗ gcc -O3 -march=native -m64 testmem_modified.c -o gtm64
fx-freebsd ➜ memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 testmem_modified.c -S
fx-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
fx-freebsd ➜ memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 testmem_modified.c -o gtm32
<complains about not finding 32-bit libraries>
fx-freebsd ➜ memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -S
fx-freebsd ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
call memcpy
fx-freebsd ➜ memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -o gtm32-libcall
<complains about not finding 32-bit libraries>
Test results:
fx-freebsd ➜ memcpy_sse git:(master) ✗ ./gtm64 32
32 MB = 9.402394 ms
-Compare match (should be zero): 0
This does not look good. FreeBSD does not have optimized routines for memcpy
in libc, but the code ran considerably faster on the Xeon processor.
Again, no 32-bit gcc libs here.
macOS 10.13.4
Intel® Core™ i7-6700HQ CPU @ 2.60GHz
Compiling with Clang:
i7-macos ➜ memcpy_sse git:(master) ✗ cc -v
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
i7-macos ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -S
i7-macos ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
callq _memcpy
i7-macos ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64
i7-macos ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -S
i7-macos ➜ memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
calll _memcpy
i7-macos ➜ memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32
Test results:
i7-macos ➜ memcpy_sse git:(master) ✗ ./tm64 32
32 MB = 6.175650 ms
-Compare match (should be zero): 0
i7-macos ➜ memcpy_sse git:(master) ✗ ./tm32 32
32 MB = 5.483932 ms
-Compare match (should be zero): 0
Oh, another segfault? Maybe the test code has a bug in it… Forgot the size argument!
(Edit: The original run times were unusually high. I was not able to reproduce this with further testing, so I have updated the post with more representative numbers.)
Linux 4.16.9-1-ARCH
model name : AMD FX™-8320 Eight-Core Processor
Compiling with GCC:
[ryan@fx8320-arch memcpy_sse]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp
Thread model: posix
gcc version 8.1.0 (GCC)
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m64 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
call memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m64 testmem_modified.c -o tm64
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 testmem_modified.c -o tm32
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
call memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -o tm32-libcall
Test results:
[ryan@fx8320-arch memcpy_sse]$ ./tm64 32
32 MB = 5.273611 ms
-Compare match (should be zero): 0
[ryan@fx8320-arch memcpy_sse]$ ./tm32 32
32 MB = 8.373336 ms
-Compare match (should be zero): 0
[ryan@fx8320-arch memcpy_sse]$ ./tm32-libcall 32
32 MB = 4.250947 ms
-Compare match (should be zero): 0
Compiling with Clang:
[ryan@fx8320-arch memcpy_sse]$ clang -v
clang version 6.0.0 (tags/RELEASE_600/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/8.1.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/8.1.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/8.1.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/8.1.0
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/8.1.0
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m64 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
callq memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m64 testmem_modified.c -o ctm64
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m32 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
calll memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m32 testmem_modified.c -o ctm32
Test results:
[ryan@fx8320-arch memcpy_sse]$ ./ctm64 32
32 MB = 5.252851 ms
-Compare match (should be zero): 0
[ryan@fx8320-arch memcpy_sse]$ ./ctm32 32
32 MB = 4.249620 ms
-Compare match (should be zero): 0
Hmm…