Level1 Diagnostic: Fixing our Memcpy Troubles (for Looking Glass) | Level One Techs

what about in 32 bit land on amd where its 4x slower though?

Only the fastcpy was 4x slower, though. The memcpy_sse and the libc memcpy are comparable to the 64 bit versions.

fastcpy is the 32 byte at a time function in testmem_modified.c:13

memcpy 32 MB = 1.742444 ms
fastcpy 32 MB = 4.072645 ms
memcpy_sse 32 MB = 1.714666 ms

On Fedora 27 on Threadripper, that isn’t the case though. The 32 bit glibc versionis only slightly faster than the pretty-slow inlined version for 32 bit. (The forced indirection on the github version should prevent inlining even if gcc is being weird about the flags).

On some recent but older intel systems, it’s also hit and miss if glibc’s version is crap for 32 bit. I think recent glibc might fix this or at least fix it somewhat though?

The 64 bit vesion of fastcpy doesn’t have the 4x slow down, gcc is converting that info avx instructions:

memcpy 32 MB = 1.750596 ms
fastcpy 32 MB = 1.822043 ms
memcpy_sse 32 MB = 2.014954 ms

Here is a 32 byte move:

        vmovdqu64       32(%rax), %xmm0
        addq    $32, %rdi
        addq    $32, %rax
        vmovups %xmm0, (%rdi)
        vmovdqu64       16(%rax), %xmm0
        vmovups %xmm0, 16(%rdi)

AMD should fix that. The rep movsl is a common way to do things.

I worked on a zlib asm contrib back when I had an x86 box (2001). It is full of rep movs instructions.

It looked to me like it was a CPUID check gone wrong in the libc version on Fedora 27, but I could be wrong.

The inlined intel version using eax/ecx for data moves is faster on intel systems, which is why I thought microcode magic might be doing stuff under the hood to optimize it.

Hah!

“Avoid using the REP prefix when performing string operations, especially when copying blocks of memory.” – AMD 2005

“Use the REP prefix judiciously when performing string operations.” – AMD 2011

3 Likes

So that was probably intentional years ago but is now a bit of a kneecap?

I wonder if Intel has some crazy strcpy microcode patent around prefetching memory.

Here ya go… A memcpy written by AMD in 2001:

https://www.cs.virginia.edu/stream/FTP/Contrib/AMD/memcpy_amd.asm

1 Like

I also did some testing.
The memcpy on a 7900x(DDR4-3200 CL13) is much slower than on a 7700k (DDR4-2400 CL15)
Both a running at 4,5Ghz 4 modules each. Shouldn’t be 4 channel DDR much faster ?

7900x

Performance counter stats for './tm64 32':

   1470.760952      task-clock (msec)         #    0.999 CPUs utilized
           186      context-switches          #    0.126 K/sec
           136      cpu-migrations            #    0.092 K/sec
         1,109      page-faults               #    0.754 K/sec
 6,615,783,860      cycles                    #    4.498 GHz
 3,026,079,136      instructions              #    0.46  insn per cycle
   132,931,784      branches                  #   90.383 M/sec
        47,021      branch-misses             #    0.04% of all branches

   1.472230188 seconds time elapsed

7700k

 Performance counter stats for './tm64 32':
        919.460337      task-clock (msec)         #    0.999 CPUs utilized
               132      context-switches          #    0.144 K/sec
                36      cpu-migrations            #    0.039 K/sec
             1,108      page-faults               #    0.001 M/sec
     4,130,136,182      cycles                    #    4.492 GHz
     4,532,561,132      instructions              #    1.10  insn per cycle
       280,813,458      branches                  #  305.411 M/sec
           118,914      branch-misses             #    0.04% of all branches

       0.920016100 seconds time elapsed

Ironically, the SSE variations of glibc were removed because of steam games:

https://bugzilla.redhat.com/show_bug.cgi?id=1471427

SSE requires 16 byte alignment and some games are built without 16 byte stack alignment.

“The most comment scenario observed is code compiled with Intel’s icc compiler in conjunction with options that break the ABI. The developer perhaps doesn’t know that the option, specifically -falign-stack=4, will cause problems when entering SSE2 code.”

2 Likes

The 7900x has a unshared 1.3mb L3. The 7700 has a shared 8m L3. It definitely shows up as higher memory latency in some use cases.

Wow… I register on to this forum after seeing your video out of pure frustration. You are using memcpy? Optimizing it to bits? To copy a frame buffer? Other than desperately trying to avoid writing kernel code (which you should, even if just to avoid context switches and interrupts), why are you doing this?

I found only ONE person on the video stream mentioning DMA. ONE. There is not a single mention of DMA in this entire forum thread. Why? If your goal is to share memory between host and VM on a very regular basis, why not use DMA? What you are doing with memcpy is moving bits from physical memory, through lookup tables, multiple levels of CPU cache, back through more lookup tables until finally placing it where it needs to go. This is not what memcpy is designed for, memcpy is for copying irregular, small amounts of data.

DMA is designed for frequent, large-scale copying of memory. This is what everyone (including myself) used since the days of old. My old trusty 80386 40MHz (yes, pre-Pentium) was capable of performing 1024x768 backbuffer flips without tearing at 30Hz with time to spare. How? By setting up DMA a DMA transfer and triggering the transfer on V-blank. There were no CPU optimizations, no CPU cache involved, nothing. Just tell the MMU (memory controller) what it can do best (manage memory) so the CPU can do other things like generate amazing sprite art filling that backbuffer.

If you want to go really fast, you reach the same goal by not copying at all. If you can guarantee that the memory block the VM is dumping its framebuffer to is fixed, you could even just tell the MMU to provide a mirror to your host. This way you achieve copy-less sharing at maximum speed (because nothing is actually copied, just the same bits are available at different addresses). Of course, this can be optimized further by having the GPU do all the heavy lifting of filling that area (tell the GPU “you can use this memory block, its yours”). Look up mmap in the man pages, it will be worth your effort. Just be sure that once you start reading from that mirrored memory block to use volatile pointers because otherwise the compiler optimizes out any reads (because it sees that the program never changed anything). Doing this through virtualization layers might be a little more tricky but everything has to map to physical memory at some point.

This is exactly how high-speed peripherals (eg. network cards) achieve high speeds… their drivers do not copy memory, they just pass pointers and give instructions to the MMU.

3 Likes

I had similar thoughts and have meant to ask why this route did not work out, assuming they considered it. It’s too easy to get distracted by small details!

Because Linux doesn’t support dma operations in user space? I actually do mention dma in the video.

Linux’s API for DMA doesn’t permit memory to memory transfers. It’s only for communication between devices and memory. Look in Documentation/DMA-API.txt for more details.

At hardware level, the x86 DMA controller doesn’t allow memory to memory transfers. It’s been discussed here: DMA transfer RAM-to-RAM

Given that the memory bus is usually slower than the CPU, what benefit would it have to launch a kernel driven memory copy ? You’d still have to wait for the transfer to finish and its duration would still be the determined by the memory bandwidth, exactly as with a CPU driven copy.

If your program’s performance solely depends on memory to memory copy performance, it means that it can be probably be strongly improved by avoiding copy as much as possible, or by implementing a smarter procedure such as copy on write.

The Dxgapi might be a dma copy on the windwos side in which case we might already be nearly as good as we can be but I’m still investigating there.

It appears using some avxish looking a example code from Intel on the 7900x doesn’t poison the cache as badly and has a lower overall latency than glibc as well. Lower than the sse2/3 copies too.

Maybe worth doing a follow up video since all these examples prettymuch confirms both Linux and freebsd could offer a lot more clever optimizations for large memcpy()s

Just a thought… could you perhaps get better throughput by parallelizing it using multiple cores?

i.e., split the copy in half and spin off two threads (in looking glass, just keep N threads permanently active for doing 1/N of the work each - rather than constantly creating threads of course).

Surely memory copy is one of those “embarassingly parallel” problems - if we assume that we are CPU bound rather than memory bus bound (as seems to be indicated above by the quad vs dual channel results)? If you’re memory bus bound then you’re kinda boned one would think?

Looking glass does use multiple threads: https://github.com/gnif/LookingGlass/blob/03622f61b0ffb179edff3942e4d7b9f98f73074b/host/MultiMemcpy.cpp

1 Like

I was chatting with a friend from work when ram bandwidth came into the conversation.

Wendel benchmarked a threadripper doing a 32MB copy in .7ms

bench: 32MB/0.7ms = 45GB/s = 45GB/s reads + 45GB/s writes

ram: 3 GHz * 64bits * 2 (for DDR) = 48GB/s / channel

@wendell , was this a quad channel setup, what if you run multiple threads on multiple cores. Do you get close to 90GB/s aggregate?

I believe the write side is actually a read + write, you need to load the cache line before you can write to it.

The AMD memcpy I linked to above uses the “movntq” instruction. Check out the comments in the code:

movntq	[edi-64], mm0	; write 64 bits, bypassing the cache
movq	mm0,[esi-40]	;    note: movntq also prevents the CPU
movntq	[edi-56], mm1	;    from READING the destination address
movq	mm1,[esi-32]	;    into the cache, only to be over-written
movntq	[edi-48], mm2	;    so that also helps performance

This interacts with the cache coherence protocols, since a write must invalidate the cache line in other cores. Note this comment in the Intel manual:

“Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTQ instructions if multiple processors might use different memory types to read/write the destination memory locations.”

I’m sure there is a trade off somewhere where mfence/sfence is more/less expensive than the builtin cache coherence of movq. The looking glass application may benefit from sfence, though, since the frame buffer does not need to invalidate each cache line.

Edit: Also, the cache coherence protocols could load the data from another cpu cache. This may also benefit looking glass, since the frame buffer displays are running in different cores. I also think the OpenGL library could interact with the DMA of the video card somewhere, it probably has memory pinned for the DMA to occur. Said another way: Looking glass copies the frame from the VM into the OpenGL pinned buffer which is then DMAed by the video card.

2 Likes