ASUS Zenith X399 Threadripper with multiple Vega 64 passthrough mixed results

At this point I’ll try anything haha. You got it I’ll isolate to one GPU in an 8X slot and report back. I did find a 0902 beta UEFI update that I wasn’t aware that existed. I’m currently flashing that now. Doubt I’ll be seeing and change from that but it’s worth a shot.

Groupings all looked good. GPU and audio devices were assigned to the same group and isolated to themselves. I’ll post that in just a moment when this comes back up from the UEFI update.

Might also be worth investing in some soft tubing or putting the air coolers back on if you still have them. At least while we debug.

https://pastebin.com/F18vrk2R

Looks like IOMMU groups 26 and 27, and groups 14 and 15. it’s odd that it split them in two different groups for one device right? I’m guessing that won’t be a problem. I do have ACS enabled as well as PCIe ARI Support. Those shouldn’t have and adverse affects will they?

are you talking about the PCIE ACS Override patch?

If you are using ACS, that would be what’s causing the split.

It most likely won’t, but if you can get them in the right IOMMU groups without using ACS, you’ll probably want to do that.

1 Like

I’ve had to cut tubes for both instances so I’m pretty covered. Just a pain to have to drain and refill, but I’ve gotten pretty quick at it on this build. I have a vega64 in 3 out of 4 slots so I could always just pull/add power to those I want to test. I’d assume that doesn’t damage the card or the board XD.

You’re definitely right, wishing I had some soft tubes to be able to use in the meantime though!

1 Like

Shouldn’t be a problem as long as you don’t do it while the machine is running, to my knowledge.

ACS enabled within the BIOS is all. Ah, I’m a dummy, I was wondering if ACS was causing that split but I didn’t much mind to it just as long as they were isolated from anything else. I’ll go ahead and disable it and test just in case. Will post the results shortly!

I think ACS needs to be enabled. I’ve never seen an option for that, so I can’t say for sure. I’ll defer to Wendell on that one. I was talking more specifically about the kernel patch that overrides the ACS configuration that the UEFI sets up.

Gotcha, well for the sake of education. Here are the groupings now without ACS:

https://pastebin.com/7AjXcM5n

groups 2 and 9 are now bound with PCI bridges it appears.

Update: The VM still posted and had no issue assigning the GPU’s. Issued a restart just to see. Same result as expected. No reset :frowning:

That is a fair bit different. Interesting. That must just be an alternative method of doing the ACS groupings. I wonder why ASUS included it.

Any useful dmesg output?

Here’s what I’ve found for errors, I’ll post the full log in a pastepin as well.

Full log: https://pastebin.com/PFXDC0hE

 20:17 System encountered a non-fatal error in i2c_dw_init_master() 
 20:17 sp5100_tco: I/O address 0x0cd6 already in use kernel
 Reboot
 20:10 System encountered a non-fatal error in i2c_dw_init_master() 
 20:10 sp5100_tco: I/O address 0x0cd6 already in use kernel
 Reboot
 19:45 System encountered a non-fatal error in i2c_dw_init_master() 
 19:45 imjournal: ignoring invalid state file /var/lib/rsyslog/imjournal.state [v8.32.0] rsyslogd
 19:45 imjournal: fscanf on state file `/var/lib/rsyslog/imjournal.state' failed [v8.32.0 try http://www.rsyslog.com/e/2027 ] rsyslogd
 19:45 sp5100_tco: I/O address 0x0cd6 already in use kernel
 Reboot
 19:28 System encountered a non-fatal error in i2c_dw_init_master() 
 19:28 sp5100_tco: I/O address 0x0cd6 already in use kernel
 Reboot
 18:58 System encountered a non-fatal error in i2c_dw_init_master() 
 18:58 sp5100_tco: I/O address 0x0cd6 already in use kernel
 18:58 could not read from '/sys/module/pcc_cpufreq/initstate': No such device systemd-udevd 2 
 Reboot
 18:46 System encountered a non-fatal error in i2c_dw_init_master() 
 18:46 sp5100_tco: I/O address 0x0cd6 already in use kernel
 18:46 could not read from '/sys/module/pcc_cpufreq/initstate': No such device systemd-udevd
 Reboot
 18:08 error: process_write: write failed sftp-server 2 
 08:48 wil6210 0000:04:00.0 wlp4s0: wil_halp_vote: HALP vote timed out kernel 2

Heading back to the BIOS to re-enable ACS.

Not a lot to go off of. My thoughts were that there’s some obscure feature on the board that’s messing things up, or that I did something wrong when compiling the kernel.

No luck, seeing the same issue.

It leads me to think that I either compiled the kernel wrong, or that theres a feature of the board that may be causing an issue. Just a guess really.

I have this board too, I’m using with 2 nvidia 980ti’s… successfully on bios 0804. Perhaps try that bios to narrow it down?

There is a second IOMMU option buried in AMD PBS/CBS or something, one of those menu’s, as well as the option you’ve already found. I have both enabled - I think the buried option is disabled by default…

Hey new here, I have the same motherboard and wanted to chime in to contribute to the pool of knowledge:

I finally got my setup just about perfect. On the ROG ZENITH there may be a bug of sorts regarding UEFI/bios settings.
Counter-intuitively, I have had best success disabling compatibility mode in UEFI. Using UEFI version 902.

I have a rx550 for host. I did not want to wast an x16 PCIE slot on it, so i had it in the second x8 slot. Initially even choosing it as the primary GPU in the UEFI under ‘tools,’ did not allow the system to POST through the rx550. I found a post suggesting disable compatibility mode, and it worked! :smiley:

Coincidentally this also allowed the ThreadRipper Reset Patch to finally work with my board as well.

Took a while and a lot of hours I should have been studying… But i finally have my setup just about right! Rx 550 for host Fedora, Vega for ROCm ubuntu VM, and gtx 1050ti for ubuntu vm. Nice to be able to do deep learning in VMs. The software configs are so picky, its nice to be able to get a VM working then simply back up the VM :). Great to be able to roll back to working state with the bleeding edge stuff

I’m running Manjaro linux and to get rid of this error I had to ensure that the sp5100_tco module was loaded before the i2c module.

To do this I had to edit /etc/mkinitcpio.conf and add the module to the following section (right at the top):

# MODULES
# The following modules are loaded before any boot hooks are
# run.  Advanced users may wish to specify all system modules
# in this array.  For instance:
#     MODULES=(piix ide_disk reiserfs)
MODULES="sp5100_tco"

Then I had to rebuild the initial ramfs (using mkinitcpio).

Once that’s done you should see the following in the boot log (at least, this is what shows up on my boot):

[    1.166511] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver
[    1.166587] sp5100-tco sp5100-tco: Using 0xfed80b00 for watchdog MMIO address
[    1.166596] sp5100-tco sp5100-tco: Watchdog hardware is disabled

PS: I’m also using the ROG Zenith Extreme board.