Looking Glass - Changelog

gnif · July 20, 2018, 12:04am

https://passthroughpo.st/vfio-increments/

gnif · July 23, 2018, 3:12am

Hrmm, seems I have messed up the required packages also. A11 uses libssl-dev still. I will correct the documentation.

gnif · July 23, 2018, 5:32am

For those that wish to run the latest bleeding edge version I have now setup AppVeyor to perform continuous builds. As usual the latest bleeding edge version is NOT supported, if you decide to run this version you should be a developer and ready to dig into the problem to either fix it or provide detailed technical details as to the issue.

https://ci.appveyor.com/project/gnif/lookingglass/build/artifacts

Note that this version of the windows host executable is built against the lastest master version of the client and will NOT be compatible with older releases.

gnif · July 24, 2018, 10:28am

I am starting to look into this, at this point I think the best approach would be to write a kernel driver to create a V4L endpoint for KVMFR, that both the Looking Glass client and OBS can access.

This would be a major overhaul of the project but would make Looking Glass much more capable. The only thing preventing this at this point in time is… well, time. Support from patreon to continue working on Looking Glass has been amazing, but when it comes time to crunch the numbers, $300 a month is chicken scratching for the time involvement.

So far I estimate (very conservatively) that I have spent no less then 200 hours on the project, this includes Kernel fixes (NPT, ThreadRipper), QEMU fixes (PS/2), writing the IVSHMEM driver for Windows and Linux, and Looking Glass itself. When I perform programming for my clients I usually charge $100AUD/h for my time, which would bring the cost of the project to an estimated $20,000 AUD to date.

As such my priority must be on work that pays to feed the family, I am putting time into Looking Glass as I can, but unless the income from Patreon increases the amount of time I put into Looking Glass is rather limited.

This is not a call to arms for donations, it’s just helping people who are not developers to understand the cost in time to build such things, and asking the community to be patient as I do my best to continue this project.

gnif · July 25, 2018, 9:47am

Under Domain->Devices, if the Devices secion is missing, create it.

Edit: I have updated the guide to include this information.

gnif · July 27, 2018, 1:02am

Good news! After two solid days of trying I finally figured out how to pass the captured DX11 texture through a pixel shader on the host. This has been a critical part of getting color space conversions working, needed for both NV12 and H264 encode!

I am really hoping to get colorspace conversions functional for A12 to help with high resolutions.

Edit: I hate directX… so complicated.

gnif · August 2, 2018, 10:34pm

anything above 1280P at this time doesn’t work well due to GPU->RAM copy performance. I am working on adding YUV420 support as time permits to allow these higher resolutions at the cost of some color accuracy.

See: https://www.patreon.com/posts/egl-opengl-20453466

Faster ram “may” help… but it’s unlikely as we are limited by the GPU’s copy performance.

gnif · August 2, 2018, 10:57pm

At this point where the bottleneck is mostly guesswork, to actually figure this out it requires some pretty expensive equipment ($10K AUD used), and even then there is no guarantee.

So trial and error is the only way.

gnif · August 7, 2018, 2:26am

I have managed to get some traction on the state of VEGA reset. So far the word is that nothing has been done to address this. However it seems there might be a way to fix the reset by means of a pci quirk for these particular cards, if so I will implement this ASAP and try to get the patch upstreamed.

gnif · August 7, 2018, 8:03am

I had a bit of a play and learned a bit more about the early post/init of the Vega series. There is a register to write that resets the ASIC on the chip, which works, but once done the card needs posting which involves a fair amount of code, which is doubtful would ever be allowed in as a PCI quirk.

It does seem possible, but really AMD should fix the VBIOS to properly support FLR, espesially since it’s so involved to reset the card properly.

Edit: I have been put into contact with a group of developers at AMD and I have started digging deeper into the problem. After shutting down the guest and unloading vfio-pci, even the amdgpu module can not re-init the card, offering the following errors in dmesg:

[15555.608910] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[15555.608956] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D780 (len 279, WS 16, PS 4) @ 0xD884
[15555.608996] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing A7F0 (len 219, WS 8, PS 4) @ 0xA8BB
[15555.609034] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 9BAC (len 381, WS 0, PS 8) @ 0x9C31

Decompiling the AtomBIOS I have determined that this is occuring during “MemoryTraining”, which is stuck in the following loop:

  0104: 01050000585c0100  MOVE   reg[0000]  [XXXX]  <-  00015c58
  010c: 3c61010000        COMP   reg[0004]  [..X.]  <-  param[00]  [...X]
  0111: 490401            JUMP_NotEqual  0104

Hopefully AMD can comment on what exactly is going wrong and why the ASIC will not re-init.

Edit2: Further digging reveals a change in the ASIC_Init AtomBIOS routine on later versions, it seems a block of init code has been jumped out rendering it completely inaccessible. Here is the intro to ASIC_Init in my VBIOS.

  0006: 370000            SET_ATI_PORT  0000  (INDIRECT_IO_MM)
  0009: 4be50004          TEST   param[00]  [X...]  <-  04
  000d: 496601            JUMP_NotEqual  0166
  0010: 4be50002          TEST   param[00]  [X...]  <-  02
  0014: 441e00            JUMP_Equal  001e
  0017: 4be50040          TEST   param[00]  [X...]  <-  40
  001b: 49da00            JUMP_NotEqual  00da
  001e: 4a65530002        TEST   reg[014c]  [..X.]  <-  02
  0023: 49c000            JUMP_NotEqual  00c0

And in later versions:

  0006: 370000            SET_ATI_PORT  0000  (INDIRECT_IO_MM)
  0009: 4be50004          TEST   param[00]  [X...]  <-  04
  000d: 495901            JUMP_NotEqual  0159
  0010: 4be50002          TEST   param[00]  [X...]  <-  02
  0014: 442100            JUMP_Equal  0021
  0017: 4be50040          TEST   param[00]  [X...]  <-  40
  001b: 49cb00            JUMP_NotEqual  00cb
  001e: 435301            JUMP   0153

Note the additional unconditional JUMP at the end! This bypasses all the early init calls, including MemoryInitialization, which in turn calls MemoryTraining. This makes me wonder how these are being setup in the first place if the calls are completely skipped, perhaps it relies on any pre-post configuration that may have been done at power on instead.

Edit 3: Interestingly and of note, the ATOMBios is run by an interpreter in the kernel, the card doesn’t actually execute any of this itself. It would be entirely possible to replace the ATOMBios with a version on disk instead of executing the one in the ROM, or even re-write it directly into the kernel module.

gnif · August 7, 2018, 5:33pm

I have been informed by AMD that the Vega10 series require a PSP mode3 reset to return them to a pre-init state and there is no simple reset feature at all, if it is working for others it is just dumb luck.

I have compared several BIOS revisions now and the only differences between them are extremely minor, even across brands/models. About the only thing I can see that actually changes BIOS to BIOS are voltage or clock initialization values.

I am not quite sure what the PSP is other then perhaps “Platform Security Processor”, but it’s in the context of the GPU, so not likely. Either way, I am digging through the amdgpu sources and it looks rather complex to implement this as a PCI quirk.

I have started down the tree of calls to see if I can get a bare bones implementation of a PSP Mode3 Reset working. For those inclined the area to look at is:

amdgpu_device_init
  \
    amdgpu_device_ip_early_init
    \
      soc15_set_ip_blocks
      \
        amdgpu_device_ip_block_add(adev, &psp_v3_1_ip_block);

The actual reset method is:

static int psp_v3_1_mode1_reset(struct psp_context *psp)
{
        int ret;
        uint32_t offset;
        struct amdgpu_device *adev = psp->adev;

        offset = SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_64);

        ret = psp_wait_for(psp, offset, 0x80000000, 0x8000FFFF, false);

        if (ret) {
                DRM_INFO("psp is not working correctly before mode1 reset!\n");
                return -EINVAL;
        }

        /*send the mode 1 reset command*/
        WREG32(offset, 0x70000);

        mdelay(1000);

        offset = SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_33);

        ret = psp_wait_for(psp, offset, 0x80000000, 0x80000000, false);

        if (ret) {
                DRM_INFO("psp mode 1 reset failed!\n");
                return -EINVAL;
        }

        DRM_INFO("psp mode1 reset succeed \n");

        return 0;
}

At a 10,000ft view it seems I will be needing to load the SOS microcode and initialize the PSP in order to perform this function… sigh. Why so hard AMD? I hate to say it, but nVidia FLR just works.

gnif · August 8, 2018, 4:02pm

Unfortunately it still isn’t working… back to AMD for guidance I go

Edit: I have been informed that I am poking at the correct registers, but the information on how to make this work seems to be … restricted. Since it involves the PSP I would assume that the information on this considered extremely confidential.

I can confirm that I can make the GPU reset, but it’s still in a useless state. In the hopes of furthering this I have very politely emailed Lisa Su directly to see if something can be done. Honestly I doubt I will get a reply, but who knows.

gnif · August 12, 2018, 10:38pm

Just a follow up on this. Today I am excited to state that I did indeed get a response from Lisa!

Let me ask my team to look into it. I will have someone get back to you.

I have also had quite a few emails from Alexander Deucher of AMD providing as much information as he is allowed to. This email in particular may be of interest to the technical community

At a high level we treat each GPU as an SOC. The SOC is built from a set of IP blocks (intellectual property) that provide various functionality. The driver is designed around the idea that each SOC is a collection of IP blocks. The IP blocks are versioned so that we can write a single driver component for all SOCs that contain that IP version. So, the general list of IPs that you may see on an SOC:
DCE - Display and Compositing Engine. This is the display block.
GFX - Graphics and Compute. This is the graphics and compute (shader) block.
GMC - Graphics Memory Controller. This is the memory controller for the GPU. It provides support for VRAM and vitualized access to VRAM and system memory for GPU clients.
SDMA - System DMA. This is a general purpose DMA engine on the GPU. It’s generally used for paging of GPU memory and for things like transfer queues in user mode acceleration drivers.
UVD - Unified Video Decode. This is the video decode and encode block on the GPU. It started out as decode only and later gained support for encode of formats other than H.264 as well.
VCE - Video Codec Engine. This is the video encode engine for H.264 video.
PSP - Platform Security Processor. This sets the security policy on the GPU and handles firmware loading for the other IPs. PSP must be functional to use the other IPs on the system.
SMU - System Management Unit. This is the clock and voltage controller on the GPU.

The IPs that are on a specific SOC are enumerated in the soc files in the driver (e.g., vi.c soc15.c, etc.). There is a high level IP structure and each instance of the driver stores an array of all of the IPs on the SOC. Those IP structures have a common API and the driver enumerates the list and then for major operations like init, fini, suspend, resume, etc., the driver walks the list and calls that API for each IP on the SOC.

Most of the IPs on the GPU provide a light weight soft reset mechanism to reset that specific IP. Depending on the type of hang a soft reset may or may not be able to recover the IP. If it’s not, you have to do a full adapter reset. This resets the entire GPU. PSP and SMU do not support soft reset. They cannot be reset once stared without a full adapter reset. On older asics adapter reset was done by writing a special sequence to pci config space. Internally this reset was handled by the SMU. For vega10 and newer, full adapter reset is handled by the PSP (mode1 reset). Soft reset is not currently implemented for any of the IPs on vega10, but it works similarly to older IPs. That said, you don’t need soft reset if you have adapter reset. It should also be noted that in the event of an adapter reset, the contents of vram should not be considered reliable.

These acronyms, while we knew what some of them were through educated guesses, the rest were unknown and the amdgpu uses these throughout.

gnif · August 16, 2018, 5:53am

For anyone that wants it, here is an updated version of the PulseAudio patch for Qemu 3.0. Note that this is not my work and it should be attributed to the original author @Spheenik.

0001-PA-fixes.patch (24.0 KB)

gnif · August 16, 2018, 6:06am

Oh, I have also found that the Intel HDA device is the cause of most graphical stalls I have been seeing in game such as BF4. For the uninitiated, audio hardware often is used as a timing source, or synchronization for sound effects, so when the emulated device starts to mess up, applications will suffer. You can see this when you try to watch a YT video when the sound has dropped out, it runs slooooow or choppy and fast.

The solution (sort of) is to use the AC97 device, but there is no signed driver for windows 10 for this device. For testing I personally signed the driver to make it function and found the emulated sound experience to be exceptional, all stalling artifacts are gone.

Unfortunately I can not share this signed driver without a permission from Realtek to do so, nor am I willing to without auditing the source first. As such I have asked Realtek if they would be so kind as to release the source to this driver. Any IP contained in the driver is very public today, and being a device that is obsolete hopefully they will comply with the request.

If they do so I will build a version for the Qemu HDA device and sign it for Windows 10.

gnif · September 23, 2018, 11:05am

Finally back at work on Looking Glass, over the past few days I have implemeted the core of the new modern OpenGL ES renderer (EGL). It’s still missing a few things such as mouse rendering, text, splash screen, etc… but enough is there to mess around with it.

The EGL renderer supports NV12 YUV420 decoding via OpenGL shaders, which finally makes this feature testable, however it seems the capture performance in Windows still needs some work as it’s on par with full resolution RGB still.

Hopefully over the next week I will be able to get the EGL renderer polished up and ready for release, and once done an official A12 release will be tagged.

gnif · September 24, 2018, 9:54am

Initial cursor rendering code went in today for the EGL renderer. Currently only masked colour and rgba cursors are supported at this time, monochrome will require a little more work.

gnif · September 25, 2018, 1:08pm

Monochrome cursors are now working with the new renderer.

gnif · September 26, 2018, 12:31pm

Updates to the DXGI capture wen’t in today that improve latency substantially and framerates somewhat.

gnif · November 3, 2018, 9:15am

I am too lazy to retype this, and just remembered that there is a change log here, so I am copying this directly from patreon to catch this thread up.

A bit of news
I figured it’s been a while since my last post and while it seems quiet here a ton of time has been poured into Looking Glass over the past month. I apologise if people are expecting blow by blow updates here, however as I am sure you are all aware, there is an enormous amount of time involved in programming, testing, tweaking, tuning, etc. Rather then spending my free time posting about the changes as they go in I feel that it is more important to put my time into the project that you are all so kindly donating towards. As always, you can view the GitHub commit history to see the work as it progresses.
With all that said, here are a few things of note.

OpenGL ES is nearly complete, it’s missing a few bells and whistles and once done it will become the default renderer.
A lot of effort has been spent in refactoring and profiling the DXGI capture code in the host application, several changes have been made that improve latency. At this point it seems that we are now faster then the capture API and any graphical stalls can now be blamed on the GPU’s implementation of the capture API.
Frame drops/microstutters that were previously seen when moving the cursor have been (at least in my setup) eliminated. Cursor updates are now completely decoupled from the capture stream and updates are now correctly accumulated fixing several rare instances of odd cursor behaviour.
Unfortunately as part of this refactoring there is currently a problem with Resolution Switching, the host application is crashing from time to time, I am yet to investigate this.

With all the above changes I have been successful in obtaining 60FPS at 5760x1200, spanning three monitors on the client. This was done by creating a custom resolution in the NVIDIA Control Panel for this mode and then launching the looking glass client with the arguments required to span my three physical monitors.

./looking-glass-client -d -n -x 0 -y 1080 -w $((1920*3)) -b 1200