LG Client dmabuf-test & Nvidia driver 535: DMABUF working

Hi Gnif,

Just registered in the forum to confirm dmabuf-test branch gets DMA Buffer fully working with Linux nvidia driver series 535 (525 for some reason still bailed out with BAD PARM).

Testbed:

DOM0 : Zen1 Server running Proxmox 8.0
GPUs : Quadro M6000, Quadro P4000, Quadro RTX6000 (this latter tested both as VFIO/PT and vGPU)
3 x VM Guest w LG Host B6 : Win10 LTSC 1809 with Nvidia drivers (Quadro or Grid version) 536.25
1 x VM Guest w LG Client (dmabuf-test branch) : Ubuntu 22 with Nvidia driver 535.86

Qemu additional lines on VM1/2/3 (Hosts) :

VM1 → args: -device ivshmem-plain,id=shmem0,memdev=looking-glass -object memory-backend-file,id=looking-glass,mem-path=/dev/shm/looking-glass,size=128M,share=yes -spice port=5901,addr=[server-ip],disable-ticketing=on,image-compression=off

VM2 → args: -device ivshmem-plain,id=shmem2,memdev=looking-glass-2 -object memory-backend-file,id=looking-glass-2,mem-path=/dev/shm/looking-glass-2,size=32M,share=yes -spice port=5902,addr=[server-ip],disable-ticketing=on,image-compression=off

VM3 → args: -device ivshmem-plain,id=shmem3,memdev=looking-glass-3 -object memory-backend-file,id=looking-glass-3,mem-path=/dev/shm/looking-glass-3,size=32M,share=yes -spice port=5903,addr=[server-ip],disable-ticketing=on,image-compression=off

Note: VM1 is rendering at 4K, thus the larger ivhsmem size. VM2/3 do HD.

Qemu additional line on VM4 (Client) :

VM4 → args: -device ivshmem-plain,id=shmem0,memdev=looking-glass -object memory-backend-file,id=looking-glass,mem-path=/dev/shm/looking-glass,size=128M,share=yes -device ivshmem-plain,id=shmem1,memdev=looking-glass-2 -object memory-backend-file,id=looking-glass-2,mem-path=/dev/shm/looking-glass-2,size=32M,share=yes -device ivshmem-plain,id=shmem3,memdev=looking-glass-3 -object memory-backend-file,id=looking-glass-3,mem-path=/dev/shm/looking-glass-3,size=32M,share=yes

Compiled and loaded kvmfr on VM4 (which automatically creates /dev/kmvfr0,1,2).

Connected as:

VM4 → VM1 : __GL_YIELD=usleep ./looking-glass-client.dma-test -f /dev/kvmfr0 -p 5901 -c [server-ip] -m KEY_RIGHTCTRL spice:audio=no egl:vsync=on

VM4 → VM2 : __GL_YIELD=usleep ./looking-glass-client.dma-test -f /dev/kvmfr1 -p 5902 -c [server-ip] -m KEY_RIGHTCTRL spice:audio=no egl:vsync=on

VM4 → VM3 : __GL_YIELD=usleep ./looking-glass-client.dma-test -f /dev/kvmfr2 -p 5903 -c [server-ip] -m KEY_RIGHTCTRL spice:audio=no egl:vsync=on

Performance improvement is significant:

  • on client CPU usage went down from 50-70% (no DMA Bufs) for each instance of LG Client to 5-8% (with DMA Bufs) (measured with top)

  • on host I got better numbers with NvFBC compared to DXGI (maybe because GPUs are Quadro), with LG Host CPU usage down to 1% for NvFBC vs 5% with DXGI for the same scenario (measured with Windows Task Manager)

I had 3 VM w LG host passing high FPS rendering to 3 LG Clients with DMA Buf on VM4 all at the same time for 8 hr, and was completely stable.

I tested LG Client on each Quadro generation (as there were rumors DMA buf might only work on Turing+) and happy to report, with driver 535 DMA buf flawlessly works on Maxwell and Pascal too. I could not test on Kepler, since Nvidia driver on that stops at 470, but think you could be able to use Nouveau there to get DMA bufs.

Kudos to Gnif for the milestone and good job Nvidia with the driver.

Thanks,
-max

Hi Max,

Not sure why you’re using the old dmabuf-test branch, this has been merged into master quite some time ago, please use the bleeding edge release of LG instead.

Hi Gnif,

Good point.

I tested with with dmabuf-test since it’s still compatible with Host B6, while latest master also requires an updated Windows Host and didn’t have in place all the tools to cross-compile it.

Do you think performance would be even better with master ?

Thanks,
-max

We provide host builds on the LG website, why are you builing it at all?

Also note that your testing is not very useful as you are enabling vsync in the LG client. We disable this intentionally for good reason, it impacts latency far too much and there are better solutions to tearing such as TearFree.

Thanks Gnif, will retest with master and pre-built Host (somehow I missed that).

The goal of my testing is not latency though, is lowest CPU usage on Host and Client, which is why DMA buf was important.

If you think there is interest in the forum, I can report which settings seem optimal to minimize CPU usage in this hardware configuration with latest master (although they might not be the optimal ones for latency).

LG already aims to use minimal CPU, using vsync is not the way to achieve this, and if anything it will increase it as the CPU has to block waiting on the GPU to to flip the buffers.

You want to look at enabling jitRender instead, which waits until the GPU is about to flip the buffers and draws at the last possible moment. This gives the best of both worlds, the GPU wont buffer frames which increases latency, and LG wont render faster then required.

Note though we do not enable this by default as it requires some tuning on your part for Wayland, and on X11, we do our best to emualte it as there is no way to get this notification without some hacks, so it can be a bit buggy.

Really interesting. Thanks for the pointers.

Without DMA Buffers vsync was almost doubling the CPU usage of Client.

The interesting part (at least on this hw & sw configuration) is that with DMA Buffers, the coupling of egl:vsync=on and __GL_YIELD=usleep, seems to provide about 2% CPU usage saving compared to the vsync=off scenario (at the cost of latency). __GL_YIELD is Nvidia specific, not sure if there is a Mesa equivalent option. vsync alone used more CPU than default.

Will definitely take a look at jitRender and TearFree.

Thanks a lot!