Graphics tearing in looking-glass client window output

8-P · May 21, 2020, 5:56am

Yeah I’m going to go on a scavenger hunt & call some friends, see what I can HW dig up to at least test with. But that will probably have to wait until the weekend.

And hey, thanks for being so on top of this. Major kudos.

8-P · May 21, 2020, 8:00pm

Ok, initial results look pretty good. Still having viewer update rate issues (egl_warn_slow), more on that in a sec.

UPS is much improved, raised from 30 to 40 in my ‘daily driver’ (slow) vm.
Window drag high CPU utilization mentioned above appears to be greatly reduced.
I’m noticing an issue where when I hover mouse over windows treelist graphical element the new element doesn’t highlight. To reproduce open windows explorer & hover over Documents, then hover over Downloads. Note highlighted element doesn’t update. I notice UPS drops to 0 quite quickly. Missing last frame update?
Regarding egl_warn_slow: Your last commit got me thinking about win:fpsLimit. I tried starting the client with values as low as 2 and still see the egl_warn_slow warning on output. I don’t think the viewer is running at less than 2 fps (?).
It might be helpful for troubleshooting to display the actual frameserver & viewer fps values when it generates that warning. Even better would be to display viewer fps when win:showFPS is enabled.

Congrats on getting to B2rc1. I’ll definitely hammer on this build and get you more feedback soon.

gnif · May 22, 2020, 3:53am

I can’t replicate this.

The fps limit is no longer a limit, it’s a target. If there is a frame to render even if the value is set to 0, it will render the new frame instantly. You will only get this message if your client hardware is too slow to upload the image into your host GPU and the frame has been dropped as there is nothing else that can be done at this point. In short, your host GPU is too slow and the cause of your issues, including the above mentioned behaviour.

It’s not helpful at all, that message is 100% proof positive that your host GPU texture upload performance is not fast enough to keep up with the incoming stream of updates from the guest.

Thanks mate

8-P · May 22, 2020, 6:55pm

This is what I see on my machine. See attached.
Video.zip (389.6 KB)
(sorry it’s so low-resolution, am a bit rushed today) Monitor is in PiP mode. Shows display client output in full-screen mode except for lower-right portion which is video card output. Note hover difference.

Ok if you say so, not to argue but I’d think users getting a perspective on relative frame rates would be useful

Briefly skimming texture.c, it looks like the warning is generated even if a single frame is skipped. My guess is occasional host OS process scheduler hiccups will make this warning pretty common.

Another thing. I have shm size set to 64M to support 3440x1440 and am using integrated graphics. I plan to dive in a bit deeper this weekend but my guess is this 64M buffer size aggravates the IGP performance problem. I also expect a lot of users like me would like to use looking-glass in an IGP + discrete video adapter hardware scenario and not have to stuff two discrete cards in their machine to get things running smoothly. I haven’t looked closely but this round up ^2 shm requirement’s relationship to texture->pboBufferSize, yes this may mean a more optimal SIMD/DMA memory block-copy execution. But in my case running at 3440x1440 results in a ~37.8M buffer requirement. Since this is required to be rounded up to the next ^2, I have to set shm size to 64M. Would it be more efficient in cases like this to break the ^2 requirement than to eat the ~41% inefficiency? Is this requirement just a guideline or a hard-coded requirement? If it’s hard-coded, perhaps a user settable parameter to soften the restriction?

Just trying to make suggestions to improve the igp host viewer scenario, hope it’s helpful.

Edit1: FYI, interestingly in my ‘baseline’ vm which is for some reason faster (?) I just noticed the egl_warn_slow warning takes a while to show up; generally not until I load the vm with some work. Does this mean that it’s borderline fast enough? It’s still not exactly smooth. I was surprised to see vm load generating an issue in the host environment. Is this a OS thread scheduling issuing? Could vm core pinning help?

I also noticed that when setting maxFPS (target framerate) to low levels (easily visible at 5-15) I see entire frames or several frames skipped, which at this fps results in big jumps. I expected the opposite. Could this be a hint as to what’s going on, at least on my machine?

gnif · May 22, 2020, 8:20pm

yes, after filling a buffer of up to three frames, which is plenty of time to catch up if the scheduler jumps in.

It’s a hardware requirement, you can not map non-power of two address spaces into the guest’s memory space.

It’s a brand new warning and may need some heursitcs added, you’re picking at very new code here.

Cursor movements do not trigger updates, so if you have a static image (ie the desktop) and have your FPS super low, your mouse will be super jumpy. Experimentation shows that a minimum of 60FPS is needed to get decent cursor interactivity, and allowing the cursor movements to trigger new frames is not an option either as it causes frame skips due to the high rate of cursor updates.

Not really sorry, I am already building this code to be as fast and efficient as possible (Btw, try B2-rc2 if you have not already, more EGL improvements). You have two issues here.

You are using an IGP… this is always going to hamper you, they share system memory but in an isolated way, as such they need to do the same transfers that a dedicated GPU does, but on far slower memory that is being used by the entire OS also. In effect, you have half the expected bandwidth when doing a texture upload as its a system ram to system ram copy, let alone whatever the OS is doing too at the same time.

You are running 4K with an IGP. LG can do 4K, barely, on dedicated GPUs. I am running two dedicated GPUS (1080Ti, Radeon 7) both in x16 Gen 3 slots (Threadripper motherboard) and 4K is usable, but still not perfect under higher loads. In short, you are asking way too much of your IGP.

My attention to your problem was not to address making IGP better as we already know we are at the limits of the current hardware on the market. It was to fix that tearing and old frame presentation issue, which I had suspected is occuring but never managed to prove it until you captured it on video (and thanks for that btw, it was a huge help in understanding where the problem was).

8-P · May 25, 2020, 10:44pm

Sorry I got drafted into yardwork & bbqs this weekend so this is just a skim but:

Just wondering, it might be an easy optimization win if only used memory is copied, not all (^2 rounded-up) allocated memory. (?) If so, this would be a 41% reduction in memory copied per frame in my case, more or less in others.

Just curious, what throttles UPS to 0? Lack of frame-frame delta? By the way, I’m seeing that ‘last-frame-lost’ issue in all my vms running the b2-rc1 code. It’s a regression IMO.

Ok, not to over-argue the point, I think looking-glass would be greatly improved if optimized a bit for the IGP use case. This is because I think the discrete+IGP hw platform could be your most common user’s rig because it wouldn’t require them to buy a 2nd graphics card. And these optimizations might benefit all cases.

Some suggestions for ways to optimize (generally better pack) data to reduce memory bandwidth requirements:

Granted, don’t want to overload CPU with pack / unpack overhead, a balance needs to be struck between bandwidth overhead & cpu overhead.
RLE frame encoding is probably the minimal cpu overhead ‘codec’ to look at. I bet overhead would be really low if properly coded for compiler generated or hand-coded SIMD instructions. I bet you could combine pack & copy and also unpack & copy operations so it’s not even two loops. Search on ‘packed memory copy’, lots of good stuff available. I’d look into this first.
Of course other ‘codecs’ could be applied. I’m really liking ZFS’s new compression codecs, they’re really low overhead / high-compression and might be worth a look.
And then there’s temporal encoding, where frame and frame-1 are diffed and only deltas are shipped to the viewer. Of course this means frames cant be skipped, I-frames would have to be reset every so often, etc. Kinda complex but if done right could be a big bandwidth win with little overhead.

Sorry, you probably know and have considered most or all of this…

I’ll try to get your latest bits installed tonight and get you feedback ASAP. Thanks for your patience and I hope you keep the door open to suggestions. Not trying to bag on your or your work, I still think it’s great stuff!

gnif · May 25, 2020, 10:53pm

You clearly do not understand what shared memory is, there is zero copy, both programs/apps see the same physical memory at once. We only copy/upload the actual frame data to the GPU, just because the RAM is available doesn’t mean we use it.

github.com

gnif/LookingGlass/blob/master/common/src/framebuffer.c#L42-L67


bool framebuffer_read(const FrameBuffer * frame, void * dst, size_t dstpitch,
    size_t height, size_t width, size_t bpp, size_t pitch)
{
  uint8_t       *d         = (uint8_t*)dst;
  uint_least32_t rp        = 0;
  size_t         y         = 0;
  const size_t   linewidth = width * bpp;


  while(y < height)
  {
    uint_least32_t wp;


    /* spinlock */
    do
      wp = atomic_load_explicit(&frame->wp, memory_order_relaxed);
    while(wp - rp < pitch);


    memcpy(d, frame->data + rp, linewidth);


    rp += pitch;

This file has been truncated. show original

DXGI, if there are no frame changes it just waits.

What issue?

We have done all of this over the last year testing which works best, every proposal you have suggested here require the CPU reading every byte of memory and processing it, then putting it back somewhere else. This compounds the issue, there is no way to improve this.

The only viable encoding is h264 ON the GPU, but it adds a ton of latency which goes against the entire point of this project.

You are describing one of the technologies behind MPEG, the amount of computing required to identify the deltas and apply them is enormous.

The ONLY performant option to reduce total bandwidth is to use the YUV420 color space at the sacrifice of color quality, but in practice even with the hardware acceleration the process of packing & unpacking it yields far too much overhead to make it viable.

Again, it requires compressing every single frame which means reading the entire frame into the CPU from RAM, doing math, then putting it back. Then on the host, reading it from RAM into the CPU, doing math, then putting it back, then handing it to the GPU. Not only do you add CPU overhead but you double your RAM bandwidth requirements.

8-P · May 25, 2020, 11:03pm

Uhh, pardon me, not meaning to offend. Really.

Weird, I see the memcpy statement. Into shm on the frame server side at least. But yeah I didn’t write the code. Ok I see you posted in one side of the code. Great!
Still, my point is you could compress data as it gets copied into shared memory and decompress it as it gets copied out. But what do I know, I’m obviously a dumb-ass, right?

Hover issue I mentioned earlier on this thread. And supplied a video of…

If you say so… I’d think that since the data is in-cache for that memcpy it wouldn’t be a big hit to pack it. But you’re right, I haven’t exactly spend the time you have on it so I bow to your expertise.

gnif · May 25, 2020, 11:17pm

I never implied that you are dumb, just ignorant as to how these things work. It’s not as simple as it looks for several reasons.

Windows may provide a buffer that is in system RAM, or it might be a DMA mapped buffer that is actually still on the GPU. We must copy this into shared memory. If this is a hardware mapped buffer it may have restrictions on how it’s accessed, such as no random or out of order accesses or it either faults, or hurts performance.
On the host, we may be doing a simple memcpy from the RAM into a GPU provided buffer for texture upload, but it’s entirely possible that the copy can be accelerated using a streaming texture upload, at which the GPU does a DMA transfer itself from system RAM directly into itself. If this paths is taken we can not do any decode at all as we have no control. (see: https://github.com/gnif/LookingGlass/blob/master/client/renderers/OpenGL/opengl.c#L1283-L1291, and https://github.com/gnif/LookingGlass/blob/master/client/renderers/OpenGL/opengl.c#L1217-L1232)
The overhead of this processing is far too high for a live video feed and introduces microstutters and latency, along with far higher CPU consumption. A simple memcpy can be done using SMID extensions to parallaize the copy getting extremely high throughput, but once you add some math to this it gets much much slower.

This is what I have been working on the last few months to try to improve the picture for everyone…

github.com

gnif/LookingGlass/blob/master/module/kvmfr.c#L155-L218


static long kvmfr_dmabuf_create(struct kvmfr_dev * kdev, struct file * filp, unsigned long arg)
{
  struct kvmfr_dmabuf_create create;
  DEFINE_DMA_BUF_EXPORT_INFO(exp_kdev);
  struct kvmfrbuf * kbuf;
  struct dma_buf  * buf;
  u32 i;
  u8 *p;
  int ret = -EINVAL;


  if (copy_from_user(&create, (void __user *)arg,
        sizeof(create)))
      return -EFAULT;


  if (!IS_ALIGNED(create.offset, PAGE_SIZE) ||
      !IS_ALIGNED(create.size  , PAGE_SIZE))
  {
    printk("kvmfr: buffer not aligned to 0x%lx bytes", PAGE_SIZE);
    return -EINVAL;
  }

This file has been truncated. show original

github.com

gnif/ADL/blob/master/linux/xcb/image.c#L54-L81


case ADL_IMAGE_BACKEND_DMABUF:
{
  idata->pixmap = xcb_generate_id(this.xcb);


  xcb_void_cookie_t c =
    xcb_dri3_pixmap_from_buffer_checked(
      this.xcb,
      idata->pixmap,
      wdata->window,
      def.h * def.pitch,
      def.w,
      def.h,
      def.pitch,
      def.bpp,
      def.depth,
      def.u.dmabuf.fd
    );


  xcb_generic_error_t *error;
  if ((error = xcb_request_check(this.xcb, c)))

This file has been truncated. show original

This is a hardware accelerated DMA upload from shared ram directly into the GPU using the GPU’s DMA engine. This is the fastest, and lowest latency option for evry platform, including IGP. It’s a WIP and very experimental at the moment which is why it has not yet been seen in the public eye.

Note, I already have an experimental LG client using this, and it’s completely compatible with LGMP, which is why LGMP works very hard to keep memory allocations (if you can call it that, more like assignments) aligned so that they can be made into userspace dma buffers (udmabuf).