Threadripper 32 core 3970x + VMWare ESXi 6.7. My journey

Hey all,

I’ve had an interesting past week trying to get the 3970x working on VMWare ESXi and I thought I’d post my thoughts here, both to ask for help from smart people, but also to provide information for others who are looking to go down this route.

My goal is to build a system system, virtualized on the ESXi platform, that will allow me to game, monitor data, develop, test, basically do everything I want. I’ve chosen the AMD 3970x platform to do this.

My specs are as follows:

  • AMD 3970x 32 Core CPU
  • 128GB Corsair RAM CMT64GX4M4K3600C18
  • Asus ROG Zenith 2 Extreme motherboard w/0702 bios
  • H740p (Dell/LSI) raid controller w/8GB NVFlash + BBU
  • 8x 2TB Crucial MX500 SSD in SFF8643 capable enclosure
  • 3x 1TB Rocket PCIe v4 NVME
  • 4x 12TB Seagate Ironwolf NAS HDD
  • GTX 2080Ti GPU#1
  • GTX 1060 GPU#2
  • 2x GT910 ancillary GPU
  • beQuiet TR4 cooler
  • Boatload of Noctua 120mm/140mm fans
  • Fractal Design XL R2 Case
  • 4x 1080p BenQ 27" IPS monitors (blue light reduction models)
  • 1x Asus ultrawide giant monitor (forgot model)
  • 2x 2K 32" Dell/Acer monitor

There are a bunch of other misc parts as well, such as multiple USB pcie cards, x16 ribbon extenders, bluetooth USB adapters, and the like.

STATUS SO FAR:

Raid: The H740p raid controller came out of a Dell server. I love this controllers. They are configurable directly in the Asus UEFI which is amazing.

I’ve used ESXi for years, as well as Dell servers, and am fairly comfortable with these platforms. That said, I make no claims to be an absolute guru.

I set up 8x 2TB SSD in a RAID6 with roughly a 30% overprovision. RAID6 was chosen for double parity as well as crazy fast read speeds. Note: The 30% overprovision (or underprovision if you’d like) has a substantial impact on the TDW of the drive. I’ve purchased 2x other 2TB drives and have done over 3000 drive writes @ 30% OP and it still works fantastic. The Crucial MX500 was chosen due to having enough capacitor charge to write out its buffer in the event of a power failure. Speeds w/o drive cache and raid cache are @ 4GB/s and w/raid/drive cache @ 7GB/s (extended beyond 8GB write/read).

There was an initial issue with the raid controller being in Pcie x4 mode by default, of which I corrected via the Asus UEFI.

VMWare ESXi 6.7u3 was installed on 1x NVMe pcie4 drives. I decided to put ESXi on these drive as I will use it will double as a staging ground for doing vfstool disk conversions, so a high disk IO will be nice for that.

VMWare’s datastore is on the RAID6 array.

THE ISSUE:

I’ve created 1x VM so far, a Windows 10 LTSB 1603 to test things out. 16GB RAM, 250GB HDD, 6 core, thick eager.

I’ve passed through the GTX 1060 + GTX 1060 audio through to the Win10 VM and it boots up directly to a monitor. Yay! The only tweak needed was the following:

hypervisor.cpuid.v0 = FALSE

Which enabled the Nvidia GPU to act as normal.

FIRST ISSUE: The audio has severe issues. It is extremely choppy and playing a video via Youtube chops/skips/crackles. Yuck. I’m not sure how to correct this. Latest Nvidia drive and audio is direct via the monitor, so it effectively is GPU -> HDMI audio to monitor.

SECOND ISSUE: No keyboard or mouse

ESXi appears to filter out all HID USB mapped devices. I can’t seem to find a way around this. Does anybody know?

What you can then do is pass-through an entire USB controller to the VM and then plug whatever you want, such as a keyboard/mouse, into that USB device, and it’ll show up. But it doesn’t.

I have enabled all the USB devices to be pass-through and tried mapping them to the VM. From what I can tell, there are 2x USB controllers easily seen: ASMedia + AMD.

The AMD usb controller simply won’t “enable” for pass-through. It keeps on saying the ESXi host needs a reboot. So no go there.

The ASMedia one allows me to add it as a pcie-passthru to the VM, however, nothing in the VM works. No USB devices are recognized. No keyboards, no mice, no USB drives, nothing.

I then added a USB Controller card to the system (StarTech USB 3.1 PCIe Card - PEXUS313AC2V) and mapped that. However, nothing shows up as well. No keyboard/mice/drives, nothing.

THIRD ISSUE: Disk Response/Active Time: 100%

For some reason, the VM has a crazy high Average Response Time / Disk IO which is slowing things down, despite only pushing under 1 MB/s. An esxtop u shows the following:

11:19:53am up  1:41, 797 worlds, 1 VMs, 6 vCPUs; CPU load average: 0.04, 0.02, 0.01

DEVICE PATH/WORLD/PARTITION DQLEN WQLEN ACTV QUED %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s DAVG/cmd KAVG/cmd GAVG/cmd Q
naa.6d0946608826be0025b455c6ea6eabe0 - 192 - 0 0 0 0.00 250.40 80.13 170.27 0.17 1.84 0.11 0.00 0.12
t10.ATA_____ST12000VN00072D2GS116___ - 31 - 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Notice the 250 CMDs/second. This is a crazy amount of commands for doing virtually nothing. I’ve done some googling and it lead me to doing the following:

Use this PowerCLI command: fsutil behavior set DisableDeleteNotify 1

This did not change anything. The VM host is doing nothing and all counters reflect massive capacity.

IN CLOSING:

3 issues:

  1. How do I pass through and get to work USB controllers? I just need a local keyboard/mouse.

  2. How do I get audio to work correctly on a VM w/pass-thru GPU?

  3. How do I resolve the crazy disk queue times?

I’m working on this around the clock to get it working. I’m firmly committed to getting this to work. I have different USB hubs coming in the mail, different USB controllers, multiple bluetooth USB devices, even NVMe bifurcation + PCIe x4 breakout cables to PCIE x16 slots coming in. I’m going to make this work.

I’m going to consolidate all my efforts into this forum thread as I work on this. I’ll probably post updates, and then counter updates, and then back-tracking updates, so you’ll see the entire thought process as it goes and likely watch me make a ton of mistakes.

If you have any ideas on how to help, or things to chime in with, PLEASE DO SO!

This is my first post on L1Techs and I’m excited to be here!

waves to Wendell

3 Likes

Please let me know what I can do to make this post as best as possible!

Do you guys like pictures? Specific part #s? Anything else I can do to make this post more informative?

UPDATE:

With regards to the H740p raid controller and my settings:

I set things up initially with 64kb blocks, write through, no read ahead, with all cache disabled. This was my initial thought as I prefer integrity over speed. I’ve been testing with write and read ahead + controller/disk cache, and speeds have gone from ~4GB/s to ~7GB/s, but I’m not sure I’ll stick with it.

With regards to the USB not being able to pass-thru:

I’ve tested a few things on the VM.

vhv.addPassthru = “TRUE”

Did nothing useful.

Various iterations of the following:

pciHole.start = “1200”
pciHole.end = “2200”

Did nothing helpful either.

vhv.addPassthru = “TRUE”

Didn’t do anything noticable either.

While my UEFI/BIOS has all the virtualization enabled (iommu, acs, svio, vt, etc), I fiddled with this in ESXi per a thread I read:

VMkernel.Boot.disableACSCheck = true
pciPassthru4.msiEnabled = FALSE (whereas 4 = your pcie#)

But that didn’t help either. Note, I didn’t have to do the passthru msi setting for my GPU to work. It just worked right away.

I’ve been tailing the /var/log/usb.log as well as vmkernel to see if anything jumps out, but nothing really.

Lastly, I’ve played around with different VMDirectPath settings in the following:

/etc/vmware/passthru.map

But still never got the USB controller pass-thru to recognize any devices.

So the issue #1, #2, and #3 are still going on.

This is a major reason I gave up on ESXI when I tried it a couple of years ago. I believe you are correct in that the only solution is to pass a controller through.

You may need to buy a PCIe aid in USB controller to get this to work, if the onboard ones are not playing nice.

Is message signal inturupts enabled? That is often needed for audio issues on Linux KVM passsthrough, and I think ESXi may be the same.

Looking at your post in the FLR patch thread, maybe you could try playing around the the reset setting in ESXI to see if that can help the controllers.

Go down to PCI Function resets section
https://kb.vmware.com/s/article/2142307

Make sure to reboot the host machine if doing this testing, as the controllers might work on first VM boot after a host boot.

The reason is the FLR patch in the other thread basically just disables FLR for the specified devices.

Hey there!

Yea I definitely have been playing around with the passthru.map but haven’t had any luck.

I went out and got a new USB controller pcie card and am running into similar issues as before. Here is the error I’m getting in the vmkernel.log

2020-01-21T07:48:26.169Z cpu10:2102427)WARNING: PCI: 384: Dev @ 0000:29:00.0 did not complete its pending transactions prior to being reset; will apply the reset anyway but this may cause PCI errors
2020-01-21T07:48:32.176Z cpu10:2102427)WARNING: PCI: 471: Dev 0000:29:00.0 is unresponsive after reset

So definitely something with the resets. However, I’ve modified the /etc/vmware/passthru.map with every iteration I could think of.

For example:

# passthrough attributes for devices

# file format: vendor-id device-id resetMethod fptShareable
# vendor/device id: xxxx (in hex) (ffff can be used for wildchar match)
# reset methods: flr, d3d0, link, bridge, default
# fptShareable: true/default, false

# Edit for USB controller
1912 ffff d3d0 default

I modified the passthru.map to include that. Then I started up the VM. Same error:

Failed to register the device pciPassthru2 for 41:0.0 due to unavailable hardware or software support.

Whereas the vmkernel.log shows what I pasted above.

I’ve also added this to the VM to no avail:

pciPassthru2.msiEnabled = FALSE

My USB Controller is this btw:

uPD720202 USB 3.0 Host Controller
ID 0000:2a:00.0
Device ID 0x15
Vendor ID 0x1912
Function 0x0
Bus 0x2a

Any idea what to do?

Many thanks!

UPDATE:

I rebooted my ESXi host after removing my passthru map. Somehow it just suddenly starting working. So I have that situation resolved.

ALSO, the disk IO stuck at 100% situation seems to be due to a VMFS6 bug. Reverting to VMFS5 for the datastore fixes that issue.

I’ve ran an initial benchmark of Unigen and obtained 140fps @ 1920x1080p all high settings with a simple GTX 1060 on an 8 core 16GB RAM VM with full pass-thru. Very, very nice!

Now time to set up a second VM, this time with a GTX 2080 Ti on full pass-thru w/keyboard and mouse.

Stay tuned!

2 Likes

Update:

I have 2x VMs that are desktop replacements now. 1x with a GTX1060 and 2x with a GTX2080Ti.

I have a USB hub connected to each USB controller on each VM. I have a keyboard/mouse switch to share between systems, yet each has a bluetooth USB drive. All of them bluetooth connect to the same speaker for output and it runs seamless.

ONE ISSUE:

I am running into a problem: On the 2080Ti system, I can’t open the Nvidia Control panel. I get the following error:

C:<snip long path>\nvcplui.exe
CLiP license device ID does not match the device ID in the bound device license

Anybody have a solution for this? Researching it now…

UPDATE: I learned something new! Nvidia now by default distributes the control panel via DCH, which integrates with Microsoft. This allows Microsoft, for example, to track every time you open the Nvidia Control Panel.

You can opt out of this tracking by doing the non-standard Nvidia driver install, called the “standard” version. This no longer gives you that CLiP licensing error (aka Microsoft can’t validate you as a user to track launching of the Nvidia Control Panel).

So this problem is now SOLVED!

What’s Next?

I am going to be installing 2 more GPUs.

My plan is to utilize a PCIe bifurcation device to split 1x PCiex16 lane into 4x Pcie x4 slots.

I’ve purchased a PCIe 4x NVMe controller card which typically is used to install 4x NVMe drives.

I’ve then purchased 4x NVMe M.2 --> x4 PCIe slot converters. This will effectively allow me to plug in 4x PCIe devices into this… The plan is to install 2x GT710 GPUs (monitor only) as well as the aformentioned GTX1060.

I’m working on this and will have an update!

PS: If anybody knows a way to split a PCIe x16 into 1x 8 and 2x 4, PLEASE LET ME KNOW! I’ll buy it asap!

1 Like

UPDATE:

I’ve received an ASUS 4x NVMe card that goes into a PCIe x3 slot. I also have 4x NVMe M.2 to PCIe x4 slot adapters.

I am hoping I can ribbon out from this card to basically plug in 4x GPUs, all on PCIe x4, into this single PCIe x16 card. We will see! The goal is to have multiple VMs each having their own display.

But I’ve run into an issue:

Issue with GTX 2080 Passthru:

It seems that, under load, and possibly during power-state changes, that the VM completely resets itself. One moment I’ll be playing some game that is modestly GPU intensive, and BAM, the VM bluescreens and resets.

It has also reset itself when coming out of monitor sleep mode, too.

I have a feeling it might be related to PCIe timings, or possibly powerstate changes / reset.

I’m experimenting with different pass-thru map types. The default is “bridged” and I’m doing d3d0 right now.

If anybody has a solution to this, please let me know!

EDIT: I am attempting to set this:

pciPassthru0.msiEnabled = FALSE

and do so for 0/1/2/3 (the 4x pcie devices for the 2080Ti)

To see if that helps.

Sorry for bumping a 4 month old thread, but have you got any update to this? I want to do basically the same thing, including the bifurcation to get more x4 slots.

Have you ironed out the last of the issues you were dealing with and would you do it again the same way or have you got any tips if I’m going to do the same thing?

thanks

Hi StartupTim and thanks for sharing your experiences. How to configurable and/or access the setup of raid controller H740P from Asus UEFI. I cannot figure it out. Which menu? What steps? I have same motherboard, newer firmware (maybe a problem), same CPU and same raid controller. I would really appreciate the procedure for configuring the H740P on non-Dell hardware.

Dear StartupTim,

Can you share the costs behind the purchase of your system? Excluding monitors and the NAS, just the

  • AMD 3970x 32 Core CPU
  • 128GB Corsair RAM CMT64GX4M4K3600C18
  • Asus ROG Zenith 2 Extreme motherboard w/0702 bios
  • H740p (Dell/LSI) raid controller w/8GB NVFlash + BBU
  • 8x 2TB Crucial MX500 SSD in SFF8643 capable enclosure
  • 3x 1TB Rocket PCIe v4 NVME
  • GTX 2080Ti GPU#1
  • GTX 1060 GPU#2
  • 2x GT910 ancillary GPU
  • beQuiet TR4 cooler
  • Boatload of Noctua 120mm/140mm fans
  • Fractal Design XL R2 Case

I’m trying to benchmark the market outside.

Thanks a lot.

Regards,

Esxi is a hypervisor, a tiny piece of software installed on a physical server and allows you to run multiple operating systems on a single host computer.

This is the build I would love to replace my current tower with when I grow up.

Sick build OP. This would run circles around my much more space and power consuming setup. Life goals right here.

Just wanted to note that you can do usb passthrough with esxi, but will require yyou to add a line similar to this in your vm
usb.quirks.device0 = “0x046d:0xc539 allow”

and then edit /bootbank/bootcfg
with a line similar to this

CONFIG./USB/quirks=0x046d:0xc07d::0xffff:UQ_KBD_IGNORE:0x04d9:0x0419::0xffff:UQ_KBD_IGNORE:0x05e3:0x0608::0xffff:UQ_KBD_IGNORE:0x0835:0x1411::0xffff:UQ_KBD_IGNORE:0x3297:0x1969::0xffff:UQ_KBD_IGNORE:0x046d:0xc539::0xffff:UQ_KBD_IGNORE:0x1532:0x0904::0xffff:UQ_KBD_IGNORE