Patch NPT on Ryzen for Better Performance | Level One Techs

gnif · October 31, 2017, 3:47am

Hi Weldell, I didn’t realize you were the guy I was talking to on Reddit till I saw the stream video just now. I just posted the following on the YT video but figured I would copy it here also.

Wow! I had no idea that this would go so viral when I fixed it. Honestly I just wanted good performance. Thanks for the props! I’d love to investigate the ThreadRipper or EPYC issues also but lack of hardware makes that impossible. It should also be noted that I did all this on a 1080Ti on an Asrock AB350 Pro4 with a Ryzen 1700X (with segfault bug, yet to RMA it). I did not experience any re-initialization issues at all with the 1080Ti, across the process of testing and debugging I must have restarted the VM at least 100 times without restarting the host. Running Debian 9 on a 4.12 kernel.

As for the frame buffer -> host, I am 90% complete on a solution on that front (I actually got this working before I discovered the NPT issue), needs some code tidy up to finish it off and the windows service that streams the data back to the host needs a re-write as at the moment it is a hacky proof of concept application. It works by using the ivshmem virtual device that comes with qemu that until a few weeks ago did not have a windows driver, so I wrote one (https://github.com/virtio-win/kvm-guest-drivers-windows/tree/master/ivshmem ). This shares a block of memory mapped ram between the host and the guest which is used to capture the frame buffer into. (@Level1Linux, if you would like to talk in depth about this feel free to email me directly, you have my address from the mailing list)

Using NvFBC I am getting unmeasurable latency between host and guest at 1080p full 24bit RGB uncompressed. In fact, I am writing this post with it right now :). Obviously NvFBC targets pro cards like Quadro and as such the service when rewritten will have an agnostic layer to support other APIs such as DXGI for consumer cards, and whatever else might become available in the future.

I also recommend a PCI sound card for pass through, the USB interface in KVM has timing bugs that cause device resets and re-initialization. Using my low latency client discussed above I can trigger keyboard and mouse faults which are due to the USB implementation in KVM. Passthrough PCI on the other hand is 99% done in hardware and as such avoids potential bugs in KVM and allows for native DMA performance which USB can not do.

The patch may be included sooner as there are talks of back porting it to earlier kernels.

Finally, it needs to be said that my motivation for all of this is personal, I am not getting paid for this, nor do I intend to make a profit from it (although it would be nice). When it’s ready the code will be released to the public on github.

wendell · October 31, 2017, 5:29pm

Nice work. And this is work that really needs to be done. Keep making progress and Level1 can be your (or anyone in a similar position) “PR firm” haha. The driver work is especially exciting because, like I said, this will happen long before we see sr-iov in the consumer space, I suspect.

Just like “zero-copy” TCP stacks took over almost overnight, I want to see a zero copy video stack (well, it’s not quite the same, but I’ll take what we can get) maybe with some type of bus mastering if we’re lucky. It makes sense there would be no measurable latency – I would think the arbiter will favor the cpu if the cpu needs the busy, but DMA should mean that, basically, the cpu is not involved. Do you think it is or will be the case we can get to a point where the memory space mapped to PCIe I/O can go direct to a frame buffer on the other card? The reason I ask is that I would suspect this will be even faster than system memory. I think, but not sure, crossfire uses a version of this to do its work.

I’ve been out of this space for a long time, but would love to put together a team. Hardware for testing and such is no problem, let me know what you need/pm me/etc.

P.s. The reset issue is only on threadripper. Ryzen has always been fine. For me the npt bug wasn’t a huge deal, for my workloads, and I never had issues on ryzen 7 either. Only threadripper. And x299 is flawless (but pricey).

SgtAwesomesauce · October 31, 2017, 5:45pm

Keep in mind that you’re pinning across two CCX. When the load switches between them, it can cause microstutter. Windows isn’t supposed to do this, but it’s still worth looking into, because behavior within a VM is still somewhat unknown, even with pinned threads.

You may want to do cores 0-3 and threads 8-11 as your pinning. That will pin your VM to the first CCX and only use 4 cores.

gnif · October 31, 2017, 5:46pm

Thanks

Unfortunately without driver support from the vendors I can’t see zero copy ever being possible. The NvFBC API for example allocates a buffer that frames are copied into, I suspect this is a memory mapped portion of video ram which leaves it beyond our reach to manipulate this way. At current I take this buffer and copy it into the shared memory segment, which then the host takes and copies into the texture. So in total there are two large copy operations in progress.

I believe that one of the copies could be eliminated by altering qemu to create the texture and map that into the guest directly, but this raises additional security concerns. While I believe this would be a nice feature to have I believe it should be deferred until we have a working solution first even if it isn’t as performant as it could be. This way it may draw more developers with more experience to this project accelerating development of it’s feature set.

wendell · October 31, 2017, 5:49pm

AMD’s open source strategy is finally starting to hit critical mass with their open source driver. Whether that would have something useful is another matter, but right now the Vega 64 performance with the open source driver is on par with the closed source 1080ti driver from nvidia.

That’s fine about PoC first, refine later, I tend to be pragmatic like that too.

gnif · October 31, 2017, 5:58pm

I have not looked into Vega from a hardware perspective, it was simply out of my price range at the time and didn’t perform as well as I expected for it’s cost. I was very attracted by the open source support for it though and open nature of AMD. Personally I hate NVidia, they can rot for all I care, I have several times been on the side of having information suppressed that I published regarding their hardware. If you didn’t figure it already, I was the one to discover the hack to make quardros/teslas/grids out of the 6xx series GPUs. This was out of a desire to use Mosiac under Linux, which was crippled but worked fine in Windows.

Anyway, if the Vega has a hardware capture API I would gladly add support for it to the guest application, I spent most of today re-writing my kludge of a program into something elegant and maintainable so that it can support multiple capture APIs. The only hold back on writing such features is lack of hardware to test on. Over the next few days I hope to find the time to add DXGI to the application so it can be used by those without NVidia hardware, once this is done I will be releasing the code under GPLv2.

Edit: I just tried to PM you, the forum states that I am not allowed

SgtAwesomesauce · October 31, 2017, 6:23pm

It seems like you’ve got the wherewithal to get these issues solved, but you’re lacking the funds to invest in testing hardware. Have you reached out to AMD to see if they’re willing to send you samples? They may be able to help you out, especially considering the hard work you’ve already done to solve the major problem holding me back from this platform.

I’d definitely donate to a “Get gnif a threadripper and vega system fund” and I’m sure there would be others interested in helping out as well (the /r/VFIO and /r/amd subreddits come to mind) if AMD isn’t willing to provide development hardware.

Your trust level is new user Once that moves to basic, you should be fine. that requires the following:

entering 5 topics
reading 30 posts
spending 10 minutes reading posts

More info on trust levels can be found here

gnif · October 31, 2017, 6:48pm

Thanks

No, this is the first time I have been involved to this level with something of this nature. I’d appreciate any advice available on this.

That would be great, people have already been generous with donations for the NPT fix already, it really took me by surprise how much interest there is in this. The fist time I looked into PCI passthrough was about 10 years ago on Xen and very broken, I gave up on it quickly when I found that my new motherboard had a broken IOMMU implementation.

I wont say that I am an expert in all this, most of it is self taught. I didn’t even know how KVM worked two weeks ago and spent several days working through the AMD specifications testing each part of the system searching for the problem. I still have a lot to learn about KVM’s inner workings and how things play along with IOMMU.

There are several bugs I would like to fix in Qemu also that are not CPU/Hardware related, such as a bug in the i8042 PS2 controller implementation I am yet to dig into in detail (seems like a race condition, the virtual device has no interlocking and can and does get entered by multiple threads simultaneously).

wendell · October 31, 2017, 6:50pm

Are you also the one that did the thing with “laptop gsync panels?” if so I hexedited the driver and was able to confirm that. They shut that down so fast I’m still reeling over it.

PM issue fixed.

gnif · October 31, 2017, 6:51pm

Thanks :). And no, that one wasn’t me. PM sent.

SgtAwesomesauce · October 31, 2017, 6:55pm

I’m not sure how this works, Wendell might know better.

I had the same experience when I realized my 3770k didn’t support IOMMU. Very sad.

Yeah, everyone in my department is running Linux with a windows VM and a 480 or a 580 passed through for proprietary windows stuff and games. When the NPT patch hit patchwork, one of my underlings came running into my office with a huge smile on. You’re a hero in our office.

Marf · October 31, 2017, 10:04pm

Wendel! You are my Hero again!

I compiled a patched Linux kernel just yet! Thanks alot for the guide! I never did it before (because there was no reason )

Tested yet 2 Benchmarks: Unigine Vally: DX11 (high preset, AA off) and Resident Evil 6 Bench Tool: DX9c (all High, AA off)
both on a passthroughed Sapphire RX 560 oc to 1434 MHz CPU and 2GHz VRAM.

Before the Patch:
Unigine ~52 FPS; 2170 Points (min FPS 18,5 / max FPS 103,8)
RE6 BT ~3700 Points

After the Patch:
Unigine ~ 58,4; 2443 Points (min FPS 28,0 / max FPS 107,2)
RE6 BT ~9600 Points

DirectX9c Games got a huuuuugh boost! DX11 Games where always playable but slightly choppy.

Now its veeeery smooth on my System.
Btw. I’m rocking the following System:

Ryzen 5 1600x OC to 3.925Ghz
RAM: Corsair Vengeance 3200 @ 3200MHz of course
Board: Asus Crosshair 6 Hero BIOS 17.01
Host Disk: Crucial M500 m2 SSD
Guest Disk: QCow2 Image
GPU: Sapphire RX 560 Pulse 4GB + Arctic Accelero Xtreme IV 280x Custom Cooler

Im quite happy with the results

Mit freundlichen Grüßen
Marfi

Update 2017-11-02:
I have to add: in my case a crucial package was not installed on my Ubuntu to do the Kernel compile. Please take care the package libssl-dev is installed via apt or synpatic package-manager.

Marf · November 2, 2017, 7:26am

Hi all,

I started yesterday a larger benchmark session with a wide varity of DX11, DX9c, DX12 and on Vulkan API Games. I want to compair NPT=1 vs NPT=0 vs Native Windows 10 performance and post my resultes (probably in a new Thread…). I hope some one find it usefull or “entertaining”
Here the list of Games/Benchmarks that i wanne to use (i.g. started to use):

Benchmarks

Unigine Vally (DX11)
Unigine Heaven (DX11)
Unigine Superposition (DX11)
Resident Evil 6 Bechmark Tool (DX9c)
Resident Evil 5 Benchmark 1 (DX9c)
Resident Evil 5 Benchmark 2 (DX9c)
Tomb Raider
Rise of Tomb Raider (DX12)
Steam VR Performance Test
Ashes of the Singularity: Escalation (DX12 and Vulkan)

Now I will not use my RX 560 card for the Passthrough, instead I pass my XFX R9 280x thru my VM. On earlier tests I noticed around 30% performance loss compaired to native use. My RX 560 “only” lost around 10% Performance with the buggy Nested Page Tables (what was quite intresting).

I think I can post the NPT=1 numbers this evening (German Time)…

Viele Grüße
Marf

Edit 2017-11-03:
As promised, I created just yet a new Thread where I will post my benchmark numbers. Please understand… WIP …
You will find the thread here: GPU Passthrough Performance Numbers: Ryzen NPT Patch vs Buggy NPT vs Native Windows

mrjakesk8 · November 3, 2017, 8:39pm

An update for anyone watching: I made 2 concurrent mistakes while testing the patch, causing me much confusion for an couple of hours. I both failed to apply the patch correctly (first time using a custom kernel) and failed to properly re-enable NPT. This explains why I was experiencing less-than-ideal performance. After remedying these 2 errors performance improved hugely, and I am (more or less ) officially done tinkering. This passthrough thing is now final the solution to all my woes! Benchmarks in this thread: GPU Passthrough Performance Numbers: Ryzen NPT Patch vs Buggy NPT vs Native Windows

dailan · November 4, 2017, 2:53am

Having trouble applying the patch on fedora 27 (trying kernel 4.13.11-301), and I’m out of ideas for on how to fix it. I’m able to compile the kernel on its own, or even just with a working Aur-acs patch. However adding the NTP patch on top of that causes rpmbuild to fail before I even start compiling.

Here are the commands I use to reproduce this

fedpkg clone -a kernel
git checkout -b my_kernel origin/f27
sudo dnf builddep kernel.spec
./scripts/newpatch.sh Aur-acs.patch
./scripts/newpatch.sh ryzen.patch
make release
sudo fedpkg local

pulled the patch from patchwork kernel, and I applied the aur-acs fixes (which compiled just fine on its own). I don’t think its required, but I installed my matching kernel-devel package just in case. (also util-linux)
I’m going to assume I’m missing something obvious, but if you need any more information I can provide it. Sorry if this is an inappropriate thread for this, at this point.

mrjakesk8 · November 4, 2017, 3:36am

I’m sorry I can’t be of more help - I found this stage challenging too. In the end I downloaded the kernel from kernel.org, applied the patch and built it using variations on Wendell’s commands with help from this page: https://fedoraproject.org/wiki/Building_a_custom_kernel

Perhaps it’s a missing dependency issue? I fear any advice of mine beyond here is likely to cause more harm than good. I recommend reading that Fedora wiki thoroughly though

dailan · November 4, 2017, 5:23am

OK, I think I got it working. I had to use another method to apply the patch (We can pretend I used the “cat | patch” method as described in the fedora wiki and not just gedit like the filthy casual I am). I’ll run some tests tomorrow to see if I really got it working. Maybe I’ll throw together a few benchmarks against the kernel without the ntp patch if I have the time. Thanks for your help.

gnif · November 4, 2017, 10:39pm

@dailan post the build log (use pastebin or something similar), it’s very unlikely that anyone here is going to replicate your steps to figure out the error you were getting.

dailan · November 5, 2017, 12:34am

Unless I’m missing something, the build-log doesn’t seem to have any useful information. Here it is. It does, however, point to this code. The warning about unexpanded macros seems to be harmless.
AND just in case, my current ryzen.patch file. If anyone wants any more information, I would be happy to provide it.

dailan · November 5, 2017, 4:25pm

Alright here are some quick and dirty benchmarks, the difference that this patch made was night and day.

In total war warhammer II’s campaign benchmark at 1080p
Before: Min 10, Max 25, Average 16.9 FPS
After: Min 43, Max 86, Average 61.6 FPS

In total war warhammer II’s battle benchmark at 1080p
Before: Min 6, Max 17, Average 10.8 FPS
After: Min 43, Max 61, Average 54.3 FPS

In Civ 6’s graphics benchmark at 34440x1440
Before: 99th 65.95, Average 47.66 frame times in MS
After: 99th 24.16, Average 18.68 frame times in MS

Civ 6’s AI benchmark at 3440x1440
Before: Average 33.2 second turns
After: Average 26.5 second turns

Skyrim (just eyeballing it) outside
Before: Min 1, Max 35, Average 20 FPS
After: Min 30, Max 50, Average 40 FPS

Skyrim (just eyeballing it) inside
Before: Min 30, Max 55, Average 40 FPS
After: Min 60, Max 60, Average 60 FPS