Level1 Diagnostic: Fixing our Memcpy Troubles (for Looking Glass) | Level One Techs

Find our new code at github.com/level1wendell.

********************************** Thanks for watching our videos! If you want more, check us out online at the following places: + Website: http://level1techs.com/ + Forums: http://forum.level1techs.com/ + Store: http://store.level1techs.com/ + Patreon: https://www.patreon.com/level1 + L1 Twitter: https://twitter.com/level1techs + L1 Facebook: https://www.facebook.com/level1techs + Wendell Twitter: https://twitter.com/tekwendell + Ryan Twitter: https://twitter.com/pgpryan + Krista Twitter: https://twitter.com/kreestuh + Business Inquiries/Brand Integrations: [email protected]


This is a companion discussion topic for the original entry at https://level1techs.com/video/level1-diagnostic-fixing-our-memcpy-troubles-looking-glass
4 Likes

This project is soooo hard.

I gotta figure out fcat on linux though.

1 Like

OBS LINUX PLUGIN for the RGB buffers PLEASE!!! There’s gotta be a way to receive the /dev/shm file and dump it onto an OBS canvas.

Also very thankful you didn’t go with AVX instructions cause my old school X79 Xeon can handle SSE3.

This is very interesting that 32bit instruction sets work consistently faster yet have similar build code to it’s 64bit brethren.

I did some research which I will present here. Some of this may reiterate what was said in the video.

What is memcpy?

memcpy(3) is a C language library (libc) function (defined in <string.h>), which gets provided by the OS. In the case of Linux-based operating systems, it’s whatever libc the distro chooses (usually glibc, but some others are musl and bionic). Each BSD variant has its own libc implementation, as do Darwin (the base of macOS) and Windows (CRT).

As the name implies, memcpy() is used to copy data from one region of memory to another. The full prototype is:

void *
memcpy(void *dst, const void *src, size_t len);

memcpy() copies len bytes from src to dst. For portability, the memory pointed to by src and dst should not overlap.

Many compilers, such as Clang and GCC, include their own builtin implementations of some of the standard C library functions. In the case of memcpy, the builtin is __builtin_memcpy. The compiler determines whether to use the builtin or to call to libc based on the alignment and size of the memory region and compilation parameters such as the target architecture.

Why would the compiler bother duplicating the functionality provided by libc?

The compiler can inline the code of the builtin function, while the libc function requires a function call. For small regions of memory, the overhead of a function call may be significant compared to the amount of work to be performed copying the data. The compiler also has the opportunity to choose a builtin implementation best suited for each specific invocation.

Inlining is an optimization that avoids the overhead of a function call by splicing the code of the function directly into the calling context. Things that contribute to function call overhead include the setup for passing parameters to the function (on x86 parameters are passed on the stack), setting up and restoring the stack pointer for local variable space, adjusting the frame pointer, etc. Jumping to and back from a function in a different region of memory can also have effects on instruction pipelining, caches, etc, depending on the CPU microarchitecture (though modern processors tend to be very good about minimizing those costs).

The compiler builtins have several different implementations (often in assembly), each optimized for specific alignment, data sizes, and processor features. There are generalized implementations that aim to be widely compatible and avoid relying on special features. There are also optimized implementations that take advantage of newer hardware capabilities. The compiler is aware of the cost of performing operations on different CPUs, and can choose the builtin implementation most suited to the target being compiled for.

Inlined code might also benefit from compiler optimization passes. On the other hand, the compiler has no control over how the libc function is implemented or optimized.

Why would the compiler ever bother generating a call to libc?

While the compiler may provide builtins optimized for specific cases, it does not always know the values that will be passed to the function at run time. The compiler has to make its decisions at compile time, so if it doesn’t know in advance what case to optimize for, things get more complicated. There can also be tricks at the OS level that may be advantageous when dealing with larger memory regions. These scenarios are best left to the libc functions provided by the operating system.

Many operating systems provide a libc that includes several optimized versions of memcpy() which can be selectively linked in at runtime based on the particular machine’s hardware. This allows generic binaries to be distributed to end users and means the library can be optimized for various hardware platforms while maintaining compatibility with legacy or otherwise limited systems.

As an example, Apple’s libc implements several of the string functions (including memcpy()) as wrappers for functions provided by libplatform, which has generic C implementations based on code from BSD. However, these are only used as fallbacks for optimized variants which are selected through run time introspection. Unfortunately, Apple seem to be keeping the sources for their optimized routines hidden. Of course, it is still possible to disassemble the binaries (as has been done here, with informative comments)! Older versions of Apple’s libc did include sources for optimized routines, which they still make available here.

One of the tricks an OS can do to speed up large copies is to use virtual address manipulation. This technique leverages the processor’s memory management unit (MMU) to map entire pages of memory from the source address to the destination address, virtually. The virtual mapping creates a shadow copy of the memory, sharing the physical data between the two virtual addresses until something tries to modify the destination memory, at which point a physical copy must be performed. This technique is used by glibc.

Like Apple’s libplatform (or older versions of their libc), glibc also has code optimized for different machine architectures. In glibc these procedures are provided through indirect function calls (IFUNC). There are quite a few system dependent implementations, but the string functions can be found here for x86-64 and here for i686.

I’ll break this post here for now, but there is plenty more I would like to add.

6 Likes

I dont know anything about computer science/programming/ processor theory… well … in short… I dont know a lot…

But I still enjoyed watching this video.

1 Like

The sad thing is a lot of this turned out not to be perfectly accurate in practice. For example hard coding the copy size at 32m I would think would be enough of a signal to GCC to say yep inlining is a bad idea, 32mb is huge. Also answers any alignment issues.

There is a commented out line in the source on GitHub that forces indirection sonas to use glibc. Even stepping into that with a debugger the performance was still awful on my particular system for whatever reason.

That version is hugely complex checking cpuid and other stuff.

-march also did diddly squat here

Is this something gcc could improve upon? Could someone write a patch to gcc that looks at the copy size and decides not to inline based upon that?

And i don’t understand why inlining makes it worse? Is the inlined code not the same as the code from the library? Wouldn’t the result be only one less function call?
And why did this problem only exist on 32 bit, why is the 64 bit inlined memcpy so good?

And why do you need 32 bit binaries anyways?

Hey wendell, do you think you could switch back to the old music for the next video? I’m not sure why but I find the background music in this video repetitive, making it crazy-hard to concentrate on what you’re saying w/o the annoying music distracting me lol

1 Like

GCC has a LOT of options. The obvious one to test is -fno-builtin-memcpy. More obscure options include

-mstringop-strategy=libcall
-minline-stringops-dynamically

Here is the output of a simple test program using various compilers and flags (the optimization and march flags are basically irrelevant when it’s calling to libc). You can test things in considerably greater detail on the Compiler Explorer site, if you wish.

GCC

gcc -O3 -march=k8-sse3 -m64
calls memcpy

gcc -O3 -march=k8-sse3 -m32
inlines builtin

gcc -O3 -march=k8-sse3 -m32 -fno-builtin-memcpy
calls memcpy

Clang

clang -O3 -march=k8-sse3 -m64
calls memcpy

clang -O3 -march=k8-sse3 -m32
calls memcpy

you should run these tests on your system and report back the results of the runs. The results may surprise you :smiley:

gcc -O3 -march=k8-sse3 -m32 -fno-builtin-memcpy

I literally demod that one in the video, and even though it calls memcpy, the performance was still awful.

Did you check the output of the compiler? If it’s calling to libc, it’s your libc that’s awful!

You should really run the tests and look at your own libc and report back.

I even did this, which was in the github readme:
void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;

There is something happening, too, in the microcode. Which is maybe why no one has caught libc apparently regressed fairly recently.

I saw-fno-builtin-inline which will still use the builtins but it won’t inline them, and I saw you scrolling through the non-inlined builtin code that was emitted by the compiler. I didn’t see a call to libc (which would be a single instruction in the asm output, you wouldn’t see any of the libc code). I’m having trouble finding where you used -fno-builtin-memcpy, but I will take your word on it.

Unfortunately I don’t have a Threadripper system to test things on. I’ll have to try it on an FX8320 system that has Linux installed and report back!

What’s it do on whatever system you are running on now? I had inconsistent behavior even on Intel.

This is what I meant when I said using the debugger I could see cpuid checks so that was from libc. The broll of that didn’t make it in the video I think.

The giveaway is that the RDX register based code is still so fast. No libc there either. I couldn’t beat 0.7ms even with 8x pipelined 128 bit instructions at once. I don’t think the CPU was really using RDX since the optimized sse3 wasn’t really any faster in other words.

Most of my systems are FreeBSD, which does not have these sort of system dependent optimizations in libc. I’m not sure about OpenBSD, but I can also test on macOS which I know has optimized string routines. I just booted up my Linux box (have had it off to minimize the summer heat), so I’ll be reporting back shortly with my findings.

1 Like

I’d be curious to know what it does on freebsd, still GCC. That’s good data?

Freebsd can run 32 bit binaries right?

Yes, if I install the 32 bit libraries. I think I have one system with those installed.

FreeBSD 11.2-BETA2

CPU: Intel® Xeon® CPU E3-1275 v3 @ 3.50GHz (3491.98-MHz K8-class CPU

Compiling with Clang:

xeon-freebsd ➜  memcpy_sse git:(master) ✗ cc -v
FreeBSD clang version 6.0.0 (tags/RELEASE_600/final 326565) (based on LLVM 6.0.0)
Target: x86_64-unknown-freebsd11.2
Thread model: posix
InstalledDir: /usr/bin
xeon-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -S
xeon-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	callq	memcpy
xeon-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64
xeon-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -S
xeon-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	calll	memcpy
xeon-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32

Test results:

xeon-freebsd ➜  memcpy_sse git:(master) ✗ ./tm64 32
32 MB = 3.540753 ms
-Compare match (should be zero):  0

xeon-freebsd ➜  memcpy_sse git:(master) ✗ ./tm32 32
32 MB = 3.440898 ms
-Compare match (should be zero):  0

Compiling with GCC:

xeon-freebsd ➜  memcpy_sse git:(master) ✗ gcc7 -v
Using built-in specs.
COLLECT_GCC=gcc7
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc7/gcc/x86_64-portbld-freebsd11.1/7.3.0/lto-wrapper
Target: x86_64-portbld-freebsd11.1
Configured with: /wrkdirs/usr/ports/lang/gcc7/work/gcc-7.3.0/configure --with-build-config=bootstrap-debug --disable-nls --enable-gnu-indirect-function --libdir=/usr/local/lib/gcc7 --libexecdir=/usr/local/libexec/gcc7 --program-suffix=7 --with-as=/usr/local/bin/as --with-gmp=/usr/local --with-gxx-include-dir=/usr/local/lib/gcc7/include/c++/ --with-ld=/usr/local/bin/ld --with-pkgversion='FreeBSD Ports Collection' --with-system-zlib --enable-languages=c,c++,objc,fortran --prefix=/usr/local --localstatedir=/var --mandir=/usr/local/man --infodir=/usr/local/info/gcc7 --build=x86_64-portbld-freebsd11.1
Thread model: posix
gcc version 7.3.0 (FreeBSD Ports Collection)
xeon-freebsd ➜  memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m64 testmem_modified.c -S
xeon-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	call	memcpy
xeon-freebsd ➜  memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m64 testmem_modified.c -o gtm64
xeon-freebsd ➜  memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m32 testmem_modified.c -S
xeon-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	call	memcpy
xeon-freebsd ➜  memcpy_sse git:(master) ✗ gcc7 -O3 -march=native -m32 testmem_modified.c -o gtm32
<complains about not finding 32-bit libraries>

Test results:

xeon-freebsd ➜  memcpy_sse git:(master) ✗ ./gtm64 32
32 MB = 3.440913 ms
-Compare match (should be zero):  0

(My GCC package does not include 32-bit libraries, and the 64-bit build segfaulted! forgot the size argument)

I’ll look into getting the 32-bit version to build with GCC and give that a go if it’s not a great ordeal.

FreeBSD 11.1-RELEASE-p10

CPU: AMD FX-8370 Eight-Core Processor (4013.68-MHz K8-class CPU)

Compiling with Clang:

fx-freebsd ➜  memcpy_sse git:(master) ✗ cc -v
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)
Target: x86_64-unknown-freebsd11.1
Thread model: posix
InstalledDir: /usr/bin
fx-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -S
fx-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	callq	memcpy
fx-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64
fx-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -S
fx-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	calll	memcpy
fx-freebsd ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32

Test results:

fx-freebsd ➜  memcpy_sse git:(master) ✗ ./tm64 32
32 MB = 9.419675 ms
-Compare match (should be zero):  0

fx-freebsd ➜  memcpy_sse git:(master) ✗ ./tm32 32
32 MB = 9.371661 ms
-Compare match (should be zero):  0

Compiling with GCC:

fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc6/gcc/x86_64-portbld-freebsd11.1/6.4.0/lto-wrapper
Target: x86_64-portbld-freebsd11.1
Configured with: /wrkdirs/usr/ports/lang/gcc6/work/gcc-6.4.0/configure --with-build-config=bootstrap-debug --disable-nls --enable-gnu-indirect-function --libdir=/usr/local/lib/gcc6 --libexecdir=/usr/local/libexec/gcc6 --program-suffix=6 --with-as=/usr/local/bin/as --with-gmp=/usr/local --with-gxx-include-dir=/usr/local/lib/gcc6/include/c++/ --with-ld=/usr/local/bin/ld --with-pkgversion='FreeBSD Ports Collection' --with-system-zlib --with-ecj-jar=/usr/local/share/java/ecj-4.5.jar --enable-languages=c,c++,objc,fortran,java --prefix=/usr/local --localstatedir=/var --mandir=/usr/local/man --infodir=/usr/local/info/gcc6 --build=x86_64-portbld-freebsd11.1
Thread model: posix
gcc version 6.4.0 (FreeBSD Ports Collection)
fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -O3 -march=native -m64 testmem_modified.c -S
fx-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	call	memcpy
fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -O3 -march=native -m64 testmem_modified.c -o gtm64
fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 testmem_modified.c -S
fx-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 testmem_modified.c -o gtm32
<complains about not finding 32-bit libraries>
fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -S
fx-freebsd ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	call	memcpy
fx-freebsd ➜  memcpy_sse git:(master) ✗ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -o gtm32-libcall
<complains about not finding 32-bit libraries>

Test results:

fx-freebsd ➜  memcpy_sse git:(master) ✗ ./gtm64 32
32 MB = 9.402394 ms
-Compare match (should be zero):  0

This does not look good. FreeBSD does not have optimized routines for memcpy in libc, but the code ran considerably faster on the Xeon processor.

Again, no 32-bit gcc libs here.

macOS 10.13.4

Intel® Core™ i7-6700HQ CPU @ 2.60GHz

Compiling with Clang:

i7-macos ➜  memcpy_sse git:(master) ✗ cc -v
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
i7-macos ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -S
i7-macos ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	callq	_memcpy
i7-macos ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m64 testmem_modified.c -o tm64
i7-macos ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -S
i7-macos ➜  memcpy_sse git:(master) ✗ grep 'call.*memcpy' testmem_modified.s
	calll	_memcpy
i7-macos ➜  memcpy_sse git:(master) ✗ cc -O3 -march=native -m32 testmem_modified.c -o tm32

Test results:

i7-macos ➜  memcpy_sse git:(master) ✗ ./tm64 32
32 MB = 6.175650 ms
-Compare match (should be zero):  0

i7-macos ➜  memcpy_sse git:(master) ✗ ./tm32 32
32 MB = 5.483932 ms
-Compare match (should be zero):  0

Oh, another segfault? Maybe the test code has a bug in it… Forgot the size argument!

(Edit: The original run times were unusually high. I was not able to reproduce this with further testing, so I have updated the post with more representative numbers.)

Linux 4.16.9-1-ARCH

model name : AMD FX™-8320 Eight-Core Processor

Compiling with GCC:

[ryan@fx8320-arch memcpy_sse]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp
Thread model: posix
gcc version 8.1.0 (GCC)
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m64 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
	call	memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m64 testmem_modified.c -o tm64
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 testmem_modified.c -o tm32
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
	call	memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 -fno-builtin-memcpy testmem_modified.c -o tm32-libcall

Test results:

[ryan@fx8320-arch memcpy_sse]$ ./tm64 32
32 MB = 5.273611 ms
-Compare match (should be zero):  0

[ryan@fx8320-arch memcpy_sse]$ ./tm32 32
32 MB = 8.373336 ms
-Compare match (should be zero):  0

[ryan@fx8320-arch memcpy_sse]$ ./tm32-libcall 32
32 MB = 4.250947 ms
-Compare match (should be zero):  0

Compiling with Clang:

[ryan@fx8320-arch memcpy_sse]$ clang -v
clang version 6.0.0 (tags/RELEASE_600/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/8.1.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/8.1.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/8.1.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/8.1.0
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/8.1.0
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m64 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
	callq	memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m64 testmem_modified.c -o ctm64
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m32 testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
	calll	memcpy@PLT
[ryan@fx8320-arch memcpy_sse]$ clang -O3 -march=native -m32 testmem_modified.c -o ctm32

Test results:

[ryan@fx8320-arch memcpy_sse]$ ./ctm64 32
32 MB = 5.252851 ms
-Compare match (should be zero):  0

[ryan@fx8320-arch memcpy_sse]$ ./ctm32 32
32 MB = 4.249620 ms
-Compare match (should be zero):  0

Hmm…

1 Like

the two segfaults there are because no size argument and I didnt bother warning the user if they forget the argument. Your paste above is just ./tm64 not ./tm64 32 as you did previously.

if you want in the test program replace memcpy with memcpy_sse

There is no way I should be outdoing the sage neckbeards if your glibc is not also busted. But I suspect modding the test program by changing memcpy to my memcpy_sse will show interesting resuts?

Edit: 8 ms on an i7! I don’t think memcpy has been that slow since the telegraph!

2 Likes

I’ve updated my previous post with more results using the unmodified testmem_modified.c and later I’ll run tests using memcpy_sse. I also would like to try a few other test programs.

It does seem that as I predicted, adding -fno-builtin-memcpy to gcc on Linux forces the compiler to generate a call to the system libc, which performs admirably. For completeness, I also compiled using -fno-builtin-inline -fno-inline as seen in the video, and it did not call libc:

[ryan@fx8320-arch memcpy_sse]$ gcc -O3 -march=native -m32 -fno-builtin-inline -fno-inline testmem_modified.c -S
[ryan@fx8320-arch memcpy_sse]$ grep 'call.*memcpy' testmem_modified.s
1 Like