Threadripper workstation build w/ hardware GPU passthrough - Advice needed

Thanks for the information @chaos4u that all sounds quite alarming. I’ve been looking on PC part picker for the model numbers of RAM listed on the Reddit post. They don’t seem to list any sticks with identical model numbers most ending with “32GTZ” instead of “16GTZ” for example. Sourcing the correct model numbers shouldn’t be a problem with the power of Google, so that’s not a big deal.

Do you think the Noctua TR4 models would be better than the Enermax liquid cooler ?

System stability is of the utmost importance and I’d like to avoid too many hacky solutions. Are there better options with less caveats ? Perhaps Threadripper is not the correct platform for achieving my requirements?

as of right now, there is not a whole lot of cooling solutions for threadripper. there is some water block stuff that just came out and people are talking that the best method is dual loop cooling for keeping the ripper beast cool while under prolonged load. but this all gets really technical and expensive regardless ekwb has a configurator if you want to pursue a custom loop cooler for your system.

https://www.ekwb.com/custom-loop-configurator/

more are coming

as far as teh tr4 vs the aio a il review poped up showing its about the same

with that being said you can still run your thread ripper with an aio just be mindfull of your temps you can also boost the cooling of the aio with the cooler master jetflow 120’s while they may be a lil loud at full rpm they do generally cool better.
Just because they are blowing a hell of lot of air . (there maybe better fan aio combos that are larger and more ideal for cooling)

so if you want to go for it get the fans and watch your temps and when a proper aio comes out get that one … right now i have the corsair h80iv2 with the default fans and it is usable for normal surfing and gaming but if i start throwing torture tests at it and other heavy workloads the temps keep on climbing . i have yet to put win10 back on it and start up a another testing round to see if the coolermasters prevent the endless climb in temps.

i have however goofed around with them in the bios and they do keep the cpu cooler by 2 - 3 degrees . but thats not really a good test .

as for the memory most of those modules can be found on new egg using the search you may have to back out the last 3 to 5 letters sometimes get em to pull up and then you can matchem up from the list .

here is my experince with memory thus far .

kingston hyperx predator on the msi qvl 2 sticks to 3 sticks worked 2 sticks were stable at 3200 at 16gb . unfortunately the 4th stick was a doa and i had to send the entire kit back . (this was hynix memory)

so instead of wating around i purchased a set of PNY anarchy 3200 16gb (not on any qvl its klevv which also appears to be hynix) it also booted and it appeared to be stable in win10 at 3200. so 2 sticks with 16gb maybe a safe bet .

at least untill we get better info on 4 and 8 stick tested setups that work really well. the 16gb route may not be sexy but saves money in the short run and allows you to play with the build until we get better bioses and or memory.

looking over that overclockers forum thread can also get you some ideas on what memory to try i decided to risk it all on bought something probably very stupid …

if that dont work out ill probably go with the gskill flare series unless something else comes to light.

Now if you plan on running at just 2400mhz you may be more open to more memory choices but your benches will be a few hundred marks slower. it will come down to a decision on risking higher performance or playing it safe for stability.

ultimately that will be a choice you will have to make and live with when making your final memory purchase.

as far as thread ripper goes it is fun lil project whether it will be something that we can max out and be stable is yet to be seen. im very disappointed in the lack of realworld experiences in regards to thread ripper builds . there is a ton of reviews just throwing bench marks at it which it chews through . unfortunately information regarding peoples day to day experiences with it are lacking . another person in another forum swears by his build though. i have yet to get mine in a state where im ready to do the same.

getting back on the topic of virtulization using the consumer cards geforce or the 580 may not be the best option since they require hacky solutions to make work or have a bug or two that may cause you minor headaches.

moving to nvidia workstation cards will eliminate the driver error that occurs on geforce cards unfortunately this comes at quite an expense but it appears to be what works best when it comes to virtualization .

also making a choice on the virtualization platform you want to go with is whole nother ball game so be sure you have an idea which way you want to go with that and build your system accordingly to the recommended hcls…

because just jumping in blind here will either be a lot fun or a whole lot of misery depending on how much you like to tinker with settings and configs.

Thanks for sharing your findings! I’ll not lie, the idea of ending up with a system that may be buggy and non functioning is definitely not where I want to be. Especially on my first PC build.

It looks like Wendel had more success with IOMMU on the i9 ?

I just checked on Crucial’s website for ECC memory compatible with that motherboard and it seems they are currently out of stock. Here’s the part numbers: CT10708146 and CT10708232. (both are 16GB sticks).

1 Like

Thanks for that @dj9.

I also got an email from crucial saying that they guaranteed compatibility with (the motherboard at least) this one:
http://uk.crucial.com/gbr/en/x399-gaming-pro-carbon-ac/CT10708152#productDetails

Specs: DDR4 PC4-21300 • CL=19 • Dual Ranked • x8 based • Unbuffered • ECC • DDR4-2666 • 1.2V • 1024Meg x 72 •

1 Like

GPU

Looking to build something similar with nvidia and this workaround for drivers:

https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#.22Error_43:_Driver_failed_to_load.22_on_Nvidia_GPUs_passed_to_Windows_VMs

My fallback plan, if I can’t get things to work is to do the inverse and run Linux in hyper-v.


Cooling

Custom 360+280 loop based on ekwb parts - total cost around 800 (those fittings and ports add up)


RAM

I’ve recently stumbled onto this, which to me tells me ECC (not EEC) ram might not be worth it:

https://research.google.com/pubs/pub35162.html

For x399 you need unregistered ECC DIMMs (UDIMM) , registered / RDIMM is used in huuuge servers

64 GB is the quantity I’m after.


Motherboard

I’m looking into ASRock X399 Taichi , no particular reason, looks decent.

1 Like

I created an account for really one purpose: are you me?

I too am looking to replace my Mac Pro 2008 desktop. It’s a wonderful box and other than bumping up the video card every so often and replacing some of the drives with SSDs, it’s been a real trooper. But it’s time to let it go. [I’ve also got an Ivy Bridge box that I’m looking to merge into this one. ] But I’m mainly using my machines for hard core development (distributed systems … yay compiles on multiple OSes!) as well as some gaming.

FWIW, here’s what I’ve been toying with.

PCPartPicker part list: https://pcpartpicker.com/list/w2s9M8
Price breakdown by merchant: https://pcpartpicker.com/list/w2s9M8/by_merchant/

CPU: AMD - Threadripper 1950X 3.4GHz 16-Core Processor
CPU Cooler: Noctua - NH-U14S TR4-SP3 140.2 CFM CPU Cooler
Motherboard: ASRock - X399 Taichi ATX TR4 Motherboard
Memory: G.Skill - Trident Z 64GB (4 x 16GB) DDR4-3200 Memory
Storage: Kingston - Predator 240GB PCI-E Solid State Drive (Purchased)
Storage: SanDisk - Ultra II 960GB 2.5" Solid State Drive (Purchased)
Storage: (5 3.5 HDDs) (Purchased)
Case: Fractal Design - Define XL R2 (Black Pearl) ATX Full Tower Case
Power Supply: Corsair - 1000W 80+ Gold Certified Fully-Modular ATX Power Supply (Purchased)
Optical Drive: Plextor - PX-B320SA SW Blu-Ray Reader, DVD/CD Writer (Purchased)
Monitor: Dell - U2711 27.0" 2560x1440 60Hz Monitor
Other: MSI Radeon R9 280x Gaming (Purchased)

You may want to look into how VLC does their stuff, they do all the cross compiles with gcc on Linux and it’s working for them.

Afaik, running os x (even if only in a VM) would require swapping out the kernel - there’s download links on www.amd-osx.com

Cross-compiles work great when your audience is mainly comprised of people looking for executables. For Apache Software Foundation projects, the end product is source code. It’s important that our bits compile natively on those platforms. I do a lot of portability work under that context. Using something like Jenkins + Xen, it’s fairly trivial to build up a small army of OS installs to do testing.

But we’re getting a bit off topic… :slight_smile:

Sorry for ot, how do you test xcode/os x with xen ? or do you just verify clang and some Darwin libraries?

For OS X testing, I’ve got a Mac Book Pro laptop I tend to carry with me. I tend to do my first pass of coding on it (curled up on the couch or whatever).

I’m looking to use this threadripper box for Ubuntu and CentOS under Docker, but NetBSD, FreeBSD, Illumos, Windows, etc under Xen or KVM. At some point, I’ll likely look at what the folks over at insanely mac are doing. It looks like they’ve just about got Hackintoshes running on Ryzen.

One question I’ve got is whether or not I’ve got enough cooling. I have a feeling I should probably add another fan. I’m assuming that’s why @munt has that 80mm fan.

2 Likes

Your case should already include 3 fans + power supply.

280x adds another 15+30w of heat when idle, and up to 150w when under load.

I think it should be fine… If you don’t get the airflow, through the case your delta-T between the inside of your case and outside will be higher, meaning your CPU die temperature will be higher, but then again with an IHS that huge, and a cold plate that big, it should be able to soak up whatever short term extra thermal energy two relatively small ceramic encased chips burp up, so it should be theoretically more tolerant to running hotter.

The only question is whether an extra fan would allow all the fans to run slower enough to cause less noise overall (probably not), there’s no way of knowing without testing…

The quickest way for me to verify thermals is to fire up AIDA64 (yes for Windows, yuck) and have it run a CPU+GPU stress test for about 20 minutes, it’ll plot your thermals for you - in that toasty - pre warmed up environment you can then run other tests

Just following up with my current build sorry for prolonged delay, Irma preoccupied me and kept me away from my new toy :frowning: . so anyways heres goes .

Running a 1950x on the MSI X399 pro carbon
32 gigs of ram
and a corsair h80i v2 aio.
running a version of fedora 26 (viper 9 upgraded to 26)

First ill talk about the cooler the corsair can not handle this processor when over clocked past 3700mhz. temps cross 80C extremely quickly and the machine becomes unstable after prolonged running at these temps. while the cooler masters jetflo do help somewhat they are not enough to chill down this proc’. so its water block or wait for a better aio.

also concerning heat, the vrm heatsink gets very hot i would recommend using the included fan bracket or ensure the heatsink gets proper cooling cause that thing gets extremely hot.

Memory, i have single rank single row samsung b-die Team extreem 4133 mhz memory … and no go for stabillity at 3200 i had to drop this memory down to 2800 to get it stable. so even if using the samsung b-die memory your still at the mercy of the silicon gods.(or MSI bios engineers) i also had to place the memory in a different config form the manual recommendations i had to use the 4 rows furthest from the processor to get the machine stable.

gpu passthrough … Not working . jumping through the various hoops and reading all the guides gpu pass through is not working on this build, setup everything as far as i know correctly. added the graphics card and its audio to the guest and virtual machine becomes sluggish and unusable.

so at this time i do not recommend thread ripper on the msi x399 procrabon for a kvm visualization platform.

if you want to use it for a performance workstation be ready to spring for a water cooling solution.

it does seem to run windows 10 just fine … but honestly who wants to run windows 10 ??

also windows 7 will not install on this board some blue screen error keeps it from installing . even transplanting a syspreped install does not work.

so far this build has been a major disappointment and oh yeah dont expect any help from msi technical support either
had to figure out the memory issue all my own. you would think their tech support team would be willing to help you troubleshoot build issues suchs as memory configs.

1 Like

@chaos4u - sounds like you got a lot closer than I did …

How did you map your pci card into kvm devices on the MSI X399?

Also - have you looked into disabling the HET timer in the kvm which I have read seems to be a problem with windows / threadripper in other threads?

Thanks

TLDR; I think I can work around the problem I am having with a configuration change…

With my X399 MSI / 1950X / Fedora 26 setup … gtx 1080ti in pcie1 and gtx1050 in pci3 …
bios MS-7B09 1.40 08/21/2017

My secondary card, gtx1050, is in its own mmio group / bound to virtio … but when using virt-manager gui to add the device results in the following error during kvm start:


could not access /sys/bus/pci/devices/0000:11:00.0/config

So looks like libvirt is looking for a simple path based on the bus / slot / function of matching device I selected in the gui but is in the wrong place under my devices tree in fedora 26…

from my virsh edit …

...
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x11' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </hostdev>

Instead my device is showing up under … /sys/bus/pci/devices/0000:00:03.1/0000:0b:00.0/ … a lot deeper down … so my idea is a) find a way to make a symlink to point to the device (or remove and re-add the device somehow to make it appear where libvirt is looking for it)
or b) find a different way to describe the device …

The following idea of using a qemu arg almost works for me in virsh edit …

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>win10c2</name>
 ... 

<qemu:commandline>
    <qemu:arg value='-cpu'/>
    <qemu:arg value='host,hv_time,kvm=off,hv_vendor_id=null'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,id=vgpu1,sysfsdev=/sys/bus/pci/devices/0000:00:03.1/0000:0b:00.0'/>
</qemu:commandline>
</domain>

But then I get stuck with qemu not being configured right to access /dev/vfio … no access … looks like I need to add /dev/vfio to the list of allowed devices for qemu on f26 to make this work… Stuck here for now unless there is another way to do it.

FYI…

ls-iommu …
ls-iommu | grep Group\ 2

IOMMU Group 2 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
IOMMU Group 2 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453]
IOMMU Group 2 0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050] [10de:1c81] (rev a1)
IOMMU Group 2 0b:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)

lspci -tv

-+-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-01.3-[41]----00.0  Intel Corporation PCIe Data Center SSD
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-03.1-[42]--+-00.0  NVIDIA Corporation GP102 [GeForce GTX 1080 Ti]
 |           |            \-00.1  NVIDIA Corporation GP102 HDMI Audio Controller
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           +-07.1-[43]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 145a
 |           |            +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1456
 |           |            \-00.3  Advanced Micro Devices, Inc. [AMD] USB3 Host Controller
 |           +-08.0  Advanced Micro Devices, Inc. [AMD] Device 1452
 |           \-08.1-[44]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1455
 |                        \-00.2  Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
 \-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Device 1450
             +-00.2  Advanced Micro Devices, Inc. [AMD] Device 1451
             +-01.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-01.1-[01-09]--+-00.0  Advanced Micro Devices, Inc. [AMD] Device 43ba
             |               +-00.1  Advanced Micro Devices, Inc. [AMD] Device 43b6
             |               \-00.2-[02-09]--+-00.0-[03]----00.0  ASMedia Technology Inc. Device 2142
             |                               +-02.0-[04]--
             |                               +-03.0-[05]--
             |                               +-04.0-[06]--
             |                               +-05.0-[07]--
             |                               +-06.0-[08]----00.0  Intel Corporation I211 Gigabit Network Connection
             |                               \-07.0-[09]--
             +-01.3-[0a]----00.0  Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
             +-02.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.0  Advanced Micro Devices, Inc. [AMD] Device 1452
             +-03.1-[0b]--+-00.0  NVIDIA Corporation GP107 [GeForce GTX 1050]
             |            \-00.1  NVIDIA Corporation GP107GL High Definition Audio Controller

virsh nodedev-list --tree

± pci_0000_00_03_1
| |
| ± pci_0000_0b_00_0
| ± pci_0000_0b_00_1
|

virsh nodedev-dumpxml pci_0000_0b_00_0

<device>
  <name>pci_0000_0b_00_0</name>
  <path>/sys/devices/pci0000:00/0000:00:03.1/0000:0b:00.0</path>
  <parent>pci_0000_00_03_1</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>11</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x1c81'>GP107 [GeForce GTX 1050]</product>
    <vendor id='0x10de'>NVIDIA Corporation</vendor>
    <iommuGroup number='2'>
      <address domain='0x0000' bus='0x0b' slot='0x00' function='0x0'/>
      <address domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
      <address domain='0x0000' bus='0x0b' slot='0x00' function='0x1'/>
      <address domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </iommuGroup>
    <numa node='0'/>
    <pci-express>
      <link validity='cap' port='0' speed='8' width='16'/>
      <link validity='sta' speed='8' width='16'/>
    </pci-express>
  </capability>
</device>

ll /sys/bus/pci/devices/0000:00:03.1/0000:0b:00.0/iommu_group/devices/

total 0
lrwxrwxrwx. 1 root root 0 Sep 10 00:16 0000:00:03.0 -> ../../../../devices/pci0000:00/0000:00:03.0
lrwxrwxrwx. 1 root root 0 Sep 10 00:16 0000:00:03.1 -> ../../../../devices/pci0000:00/0000:00:03.1
lrwxrwxrwx. 1 root root 0 Sep 10 00:16 0000:0b:00.0 -> ../../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.0
lrwxrwxrwx. 1 root root 0 Sep 10 00:16 0000:0b:00.1 -> ../../../../devices/pci0000:00/0000:00:03.1/0000:0b:00.1

Looks like Wendel got this working on the Gigabyte board … Level1 Linux: Livestream (Setting up PCIe Passthrough on Fedora on X299 and Threadripper systems) | Level One Techs

And GreyBoltWolf makes it look easy … Play games in Windows on Linux! PCI passthrough quick guide

See also https://www.linux-kvm.org/page/How_to_assign_devices_with_VT-d_in_KVM
and https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/chap-Guest_virtual_machine_device_configuration.html

So I have hope it can be done if I can just finish jumping through the hoops…

Thanks

Kinda the same impression here as well.
Asrock fatality x399 and pci-PT is a no-go in the current state.
RMA´d the whole thing.

The 1080ti wont come out of sleep for Passthrough. Works ok for Vega.

MSI are working with me on it though so that’s awesome of them :-p

1 Like

Have you had a chance to try a Polaris card yet? I think you mentioned somewhere (reddit?) that was going to be attempted IIRC.

It seems nobody has yet mentioned the Nested Page Tables bug with Zen CPUs and Linux KVM. When using GPU passthrough on KVM with a recent AMD CPU, the performance in 3D applications suffers greatly unless you disable NPT. Unfortunately, though, the overhead of emulating the page tables in software creates a relatively large CPU bottleneck inside of your VMs. I’ve run into this issue on my 1800X and wouldn’t really recommend anybody use Zen CPUs with Linux KVM for GPU passthrough until this is resolved. I’ve heard that it works fine with VMWare and Xen, though, so if you plan on using those hypervisors it’s probably fine.

1 Like

I thought that Polaris didn’t suffer from bus reset. (at least some of them)

It’s not supposed to, but comments like this one over at the AMD forums make me think that everything needs to get retested. :frowning: