Return to Level1Techs.com

Threadripper Reset Fixes


#1

Continuing the discussion from Looking Glass - Guides, Help and Support:

I can confirm the patch works great on Fedora 27 with both nvidia 1080/1080ti, Radeon Pro WX5100, Fury and Vega.

In addition it “kind of” works for pci bus resets of other devices like nvme. I was able to hot remove an already-installed nvme and re-add it issuing only a pcie bus reset to get it working again (previously, it would go offline alltogether). Hot plugging a new device will not work because it needs some hotplug support in UEFI.

I was also able to confirm, I think, the pcie bus reset that can occur in some situations with an overheating 10 gig intel X540 is also resolved.

Strictly speaking I haven’t tested this non-graphics PCIe stuff on Kernel 4.15 at all so it is possible other kernel patches and updates solve other TR-related PCIe issues, not Geoff’s patch, but for now I’m calling this a win for Geoff.

I had to modify the patch slightly for fedora, here is a quickie mini-howto on fedora:

# sudo dnf install fedpkg fedora-packager rpmdevtools ncurses-devel pesign
# sudo dnf install rpm-build flex perl-devel perl-generators openssl-devel hmaccalc elfutils-devel

# fedpkg clone -a kernel
# cd kernel

# fedpkg switch-branch master
# # or possibly something like fedpkg switch-branch f27 -- the main thing here is you get kernel 4.15.something. At the time of this mini-how-to 4.15 is nooott quiiiitteee out but close enough for us yokels.

# vi kernel.spec
# -- in here you want to uncomment the define line for .local and I changed .local to trpatch

# then find the patches section and toward the end add Patch999: tr.patch 
# tr.patch is attached to this message :D and adapted from Geoff's patch to apply to
# this slightly different "the fedora way" 

once that’s done do a

fedpkg local

and rpm -i the appropriate rpms from arch/x86_64 … and you should be good to go. The usual steps for enabling iommu, vfio-pci, etc all still apply.

tr.patch (3.4 KB)

and my PCIe tree for all this bridge hot plug madness:

\-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
             +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
             +-01.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
             +-01.1-[01-07]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43ba
             |               +-00.1  Advanced Micro Devices, Inc. [AMD] Device 43b6
             |               \-00.2-[02-07]--+-00.0-[03]--
             |                               +-01.0-[04]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               +-02.0-[05]----00.0  Intel Corporation Wireless 8265 / 8275
             |                               +-03.0-[06]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               \-04.0-[07]--
             +-01.2-[08]----00.0  OCZ Technology Group, Inc. RD400/400A SSD
             +-01.3-[09-0a]--+-00.0  Intel Corporation Ethernet Controller 10-Gigabit X540-AT2
             |               \-00.1  Intel Corporation Ethernet Controller 10-Gigabit X540-AT2
             +-02.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
             +-03.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
             +-04.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
             +-07.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
             +-07.1-[0b]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
             +-08.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
             +-08.1-[0c]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
             |            +-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
             |            \-00.3  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
             +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
             +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge

Livestream: Threadripper Fully Operational for VFIO/Passthrough | Level One Techs
Kernel 4.15 Finally Released
ASUS Zenith X399 Threadripper with multiple Vega 64 passthrough mixed results
VFIO Won't hand off GPU to VM
#2

Once again really really good work! I’m impressed! I CANNOT WAIT!!! until I get a second GPU to go into that it!

Very good work!


#3

Unfortunately the pci-dev guys don’t agree that this is a valid way to address this issue. It has been recognized as a quirk rather then an implementation bug, I will be reworking this patch as a TR specific quirk fixup over the next week.

Functionally it will perform the exact same thing though.


#4

MSI and my contacts there said they would pass along and understood the fix. They may get back to folks from amd that weigh in on the mailing list. I see what Alex is saying but looking at differences between the pcie trees on Xeon it makes sense why TR is the way it is.


#5

Is this patch similar at all to the one r/AMD came up with a month or so ago? Because that’s what I’ve been using with pretty good results. https://www.reddit.com/r/Amd/comments/7gp1z7/threadripper_kvm_gpu_passthru_testers_needed/

Trying to decide if it’s worth switching to this


#6

It’s very similar, but instead of blindly rewriting the entire PCI configuration space it only re-writes the areas that should be re-written.


Threadripper & PCIe Bus Errors
#7

I would be very interested if AMD interpreted the spec the same way I did here, it is not very clearly worded. Also I would like to know why the CPU’s view of the PCI configuration space seems to be cached, or is there some lower level bug going on here?


#8

@wendell a live-stream or video on this process would be awesome.


#9

Test System:
AMD Ryzen 1950X
AMD Radeon WX7100 (Host)
NVIDIA GTX 1080 FE (Guest)

Intel 900p Optane 480gb (Host storage)
Samsung 960 Pro 1tb (Guest storage)

Some benchmarks w/looking glass:

the NVMe is passed through as a PCIe device:


… and has near native performance.


#10

Would this work on Power9?


#11

Thanks, patch works for me. I used to use “ugly patch”.

HW: Tr 1950X, X399 Taichi (BIOS 2.0), 4x8GB 3200/CL14

HostOS: Xubuntu 17.10x64(RX 550) with vanilla kernel 4.15_RC9+gnif patch - left half of 4K LCD (windowed Unigine Tropics … phoronix-test-suite) + Looking Glass from GuestOS1

GuestOS1: Windows 10x64 (GTX 1080TI) - 1600x900 - right half of 4K LCD (fullscreen Unigine Heaven)
GuestOS2: Windows XP (ATI HD5770) - right LCD (fullscreen 3Dmark2001)


#12

We are the IT directors, we are the managers and we are the programmers, the next-gen programmers who want to be able to run all the platforms simultaneously…

…“We cook your meals, we haul your trash, we connect your calls, we drive your ambulances. We guard you while you sleep. Do not… fuck with us.”


#13

Offtopic question, but how is the noise in your build (with having three GPUs in there)?

I think you already talked about this in your video, but does this NVMe drive support resetting?

It depends on if Power9 supports the features needed to isolate the GPU for the virtual machine.


#14

RPM newbie here.
Running fresh install of Fedora 27.
DNF failed with:
No match for argument: rpmbuild

My google foo says rpmbuild has been depreciated. Any pointers?

RockApe


#15

Dnf install (list of packages) ?


#16

[[email protected] ~]$ sudo dnf install rpmbuild perl-devel perl-generators openssl-devel hmaccalc elfutils-devel
Failed to synchronize cache for repo ‘timlau-yumex-dnf’, disabling.
Last metadata expiration check: 1:12:01 ago on Sun 28 Jan 2018 06:00:12 PM CST.
No match for argument: rpmbuild
Package perl-devel-4:5.26.1-402.fc27.x86_64 is already installed, skipping.
Error: Unable to find a match


#17

[[email protected] ~]$ sudo dnf search rpmbuild
Failed to synchronize cache for repo ‘timlau-yumex-dnf’, disabling.
Last metadata expiration check: 1:14:11 ago on Sun 28 Jan 2018 06:00:12 PM CST.
= Name & Summary Matched: rpmbuild ======================
drupal7-rpmbuild.noarch : Rpmbuild files for drupal7
== Name Matched: rpmbuild ===========================
copr-rpmbuild.noarch : Run COPR build tasks
drupal8-rpmbuild.noarch : RPM build files for drupal8
== Summary Matched: rpmbuild =========================
perl-macros.x86_64 : Macros for rpmbuild
auto-buildrequires.x86_64 : Work out BuildRequires for rpmbuild automatically


#18

Sorry, I meant the rest of the list of packages. You just need fedpkg and it’s dependencies


#19

Just finished installing the rest of the packages. The rpmbuild command does work. So I guess I will continue with the install.

Is there a IRC channel?

RockApe


#20

“Offtopic question, but how is the noise in your build (with having three GPUs in there)?”

I used cheap sound meter (range 30-130dB) for same test (HostOS: RX550 Unigine-Tropics, GuestOS1:GTX1080Ti Unigine-Heaven, GuestOS2: HD5770 3Dmark2001).

Ambient noise in room was 31dB (all off), idle HostOS (38dB), all GuestOSs running (41dB), all benchmarks running (48dB) in distance 1m . The worstest character of noise has HD5770 (reference card), next probably ASUS RX550 2GB (GTX 1080Ti Strix OC is OK).

Rest of build: Corsair Obsidian 750D(2x 140mm intake, 1x 140mm outtake)+2x Noctua outtake, Asrock X399 Taichi, Threadripper 1950X, Noctua NH-U14S TR4, 4x8GB G-SKILL 3200/CL14, 2x 3,5" HDD, 1x 2,5" SSD, 1x NVME.

I’m OK with noise, but I consider to place computer into spare room (99% of time is empty) and use <5m USB/HDMI/DP/LAN cables for connection with LCDs/KBs/MSs/HUBs.