Yeah I’m going to go on a scavenger hunt & call some friends, see what I can HW dig up to at least test with. But that will probably have to wait until the weekend.
And hey, thanks for being so on top of this. Major kudos.
Yeah I’m going to go on a scavenger hunt & call some friends, see what I can HW dig up to at least test with. But that will probably have to wait until the weekend.
And hey, thanks for being so on top of this. Major kudos.
Ok, initial results look pretty good. Still having viewer update rate issues (egl_warn_slow), more on that in a sec.
Congrats on getting to B2rc1. I’ll definitely hammer on this build and get you more feedback soon.
I can’t replicate this.
The fps limit is no longer a limit
, it’s a target. If there is a frame to render even if the value is set to 0, it will render the new frame instantly. You will only get this message if your client hardware is too slow to upload the image into your host GPU and the frame has been dropped as there is nothing else that can be done at this point. In short, your host GPU is too slow and the cause of your issues, including the above mentioned behaviour.
It’s not helpful at all, that message is 100% proof positive that your host GPU texture upload performance is not fast enough to keep up with the incoming stream of updates from the guest.
Thanks mate
This is what I see on my machine. See attached.
Video.zip (389.6 KB)
(sorry it’s so low-resolution, am a bit rushed today) Monitor is in PiP mode. Shows display client output in full-screen mode except for lower-right portion which is video card output. Note hover difference.
Ok if you say so, not to argue but I’d think users getting a perspective on relative frame rates would be useful
Briefly skimming texture.c, it looks like the warning is generated even if a single frame is skipped. My guess is occasional host OS process scheduler hiccups will make this warning pretty common.
Another thing. I have shm size set to 64M to support 3440x1440 and am using integrated graphics. I plan to dive in a bit deeper this weekend but my guess is this 64M buffer size aggravates the IGP performance problem. I also expect a lot of users like me would like to use looking-glass in an IGP + discrete video adapter hardware scenario and not have to stuff two discrete cards in their machine to get things running smoothly. I haven’t looked closely but this round up ^2 shm requirement’s relationship to texture->pboBufferSize, yes this may mean a more optimal SIMD/DMA memory block-copy execution. But in my case running at 3440x1440 results in a ~37.8M buffer requirement. Since this is required to be rounded up to the next ^2, I have to set shm size to 64M. Would it be more efficient in cases like this to break the ^2 requirement than to eat the ~41% inefficiency? Is this requirement just a guideline or a hard-coded requirement? If it’s hard-coded, perhaps a user settable parameter to soften the restriction?
Just trying to make suggestions to improve the igp host viewer scenario, hope it’s helpful.
Edit1: FYI, interestingly in my ‘baseline’ vm which is for some reason faster (?) I just noticed the egl_warn_slow warning takes a while to show up; generally not until I load the vm with some work. Does this mean that it’s borderline fast enough? It’s still not exactly smooth. I was surprised to see vm load generating an issue in the host environment. Is this a OS thread scheduling issuing? Could vm core pinning help?
I also noticed that when setting maxFPS (target framerate) to low levels (easily visible at 5-15) I see entire frames or several frames skipped, which at this fps results in big jumps. I expected the opposite. Could this be a hint as to what’s going on, at least on my machine?
yes, after filling a buffer of up to three frames, which is plenty of time to catch up if the scheduler jumps in.
It’s a hardware requirement, you can not map non-power of two address spaces into the guest’s memory space.
It’s a brand new warning and may need some heursitcs added, you’re picking at very new code here.
Cursor movements do not trigger updates, so if you have a static image (ie the desktop) and have your FPS super low, your mouse will be super jumpy. Experimentation shows that a minimum of 60FPS is needed to get decent cursor interactivity, and allowing the cursor movements to trigger new frames is not an option either as it causes frame skips due to the high rate of cursor updates.
Not really sorry, I am already building this code to be as fast and efficient as possible (Btw, try B2-rc2 if you have not already, more EGL improvements). You have two issues here.
My attention to your problem was not to address making IGP better as we already know we are at the limits of the current hardware on the market. It was to fix that tearing and old frame presentation issue, which I had suspected is occuring but never managed to prove it until you captured it on video (and thanks for that btw, it was a huge help in understanding where the problem was).
Sorry I got drafted into yardwork & bbqs this weekend so this is just a skim but:
Just wondering, it might be an easy optimization win if only used memory is copied, not all (^2 rounded-up) allocated memory. (?) If so, this would be a 41% reduction in memory copied per frame in my case, more or less in others.
Just curious, what throttles UPS to 0? Lack of frame-frame delta? By the way, I’m seeing that ‘last-frame-lost’ issue in all my vms running the b2-rc1 code. It’s a regression IMO.
Ok, not to over-argue the point, I think looking-glass would be greatly improved if optimized a bit for the IGP use case. This is because I think the discrete+IGP hw platform could be your most common user’s rig because it wouldn’t require them to buy a 2nd graphics card. And these optimizations might benefit all cases.
Some suggestions for ways to optimize (generally better pack) data to reduce memory bandwidth requirements:
Sorry, you probably know and have considered most or all of this…
I’ll try to get your latest bits installed tonight and get you feedback ASAP. Thanks for your patience and I hope you keep the door open to suggestions. Not trying to bag on your or your work, I still think it’s great stuff!
You clearly do not understand what shared memory is, there is zero copy, both programs/apps see the same physical memory at once. We only copy/upload the actual frame data to the GPU, just because the RAM is available doesn’t mean we use it.
DXGI, if there are no frame changes it just waits.
What issue?
We have done all of this over the last year testing which works best, every proposal you have suggested here require the CPU reading every byte of memory and processing it, then putting it back somewhere else. This compounds the issue, there is no way to improve this.
The only viable encoding is h264 ON the GPU, but it adds a ton of latency which goes against the entire point of this project.
You are describing one of the technologies behind MPEG, the amount of computing required to identify the deltas and apply them is enormous.
The ONLY performant option to reduce total bandwidth is to use the YUV420 color space at the sacrifice of color quality, but in practice even with the hardware acceleration the process of packing & unpacking it yields far too much overhead to make it viable.
Again, it requires compressing every single frame which means reading the entire frame into the CPU from RAM, doing math, then putting it back. Then on the host, reading it from RAM into the CPU, doing math, then putting it back, then handing it to the GPU. Not only do you add CPU overhead but you double your RAM bandwidth requirements.
Uhh, pardon me, not meaning to offend. Really.
Weird, I see the memcpy statement. Into shm on the frame server side at least. But yeah I didn’t write the code. Ok I see you posted in one side of the code. Great!
Still, my point is you could compress data as it gets copied into shared memory and decompress it as it gets copied out. But what do I know, I’m obviously a dumb-ass, right?
Hover issue I mentioned earlier on this thread. And supplied a video of…
If you say so… I’d think that since the data is in-cache for that memcpy it wouldn’t be a big hit to pack it. But you’re right, I haven’t exactly spend the time you have on it so I bow to your expertise.
I never implied that you are dumb, just ignorant as to how these things work. It’s not as simple as it looks for several reasons.
This is what I have been working on the last few months to try to improve the picture for everyone…
This is a hardware accelerated DMA upload from shared ram directly into the GPU using the GPU’s DMA engine. This is the fastest, and lowest latency option for evry platform, including IGP. It’s a WIP and very experimental at the moment which is why it has not yet been seen in the public eye.
Note, I already have an experimental LG client using this, and it’s completely compatible with LGMP, which is why LGMP works very hard to keep memory allocations (if you can call it that, more like assignments) aligned so that they can be made into userspace dma buffers (udmabuf).