X399 MSI MEG Creation IOMMU issues

Hey,

few weeks ago i asked for suitable Mainboard for my Workload: Mainboard recommendation without "with Overclock Slow VM under Linux" restriction

Because the ASUS ROG Zenith Extreme have a strange bug, that if CPU is overclocked, all VMs stutters extremly.

So, last week i got a very good deal for a MSI MEG Creation and buyed one.

Now a week later full of work to strip down my System (Hardtubing Full Watercooling with 3 GPU’s and a 10gbit Nic, everything have a Waterblock), and then to rebuild and rebend all Tubes, my System is again finish and ready to rumble.

Everything works so far too. Only IOMMU not.

Softwareside, i dont change everything except that i changed the PCI Adresses from /sys/bus/devices/… from the old adresses to the new that this Mainboard gives.

I can start my Windows VMs, but the Display stays “black” (seeing MSI Uefi Logo and under it the Fedora logo) forever. Then i hooked up the virtual Display in the VM, and see that Windows boots forever (the circle turns around forever). And after 20 Minutes the Boot crash with a Bluescreen something about watchdog violation.

i even cant start a Ubuntu Live CD for examble in the VM. Nothing.

This is the bigest issue i have now with this Mainboard.

The other one, are little ones. For Example only one NIC is showing up in Linux instead of the twos. Or that i dont cant find a Setting in UEFI for changing Memory Interleaving into Channel (NUMA / UMA). And in the ASUS Board i had to activate SRIOV (or how it is called) too to work properly. In the MSI Uefi only exist one Setting called IOMMU (enable / disable). And that that Bord only have 10% of the Settings possibilities in comparsion with the ASUS one, i even dont mentioned.

Cant nobody help?

Maybe @wendell could shine some light on this,
when he sees this topic.

i was going to write something last night but im basing this of a pro carbon x399 and its not really going to be helpful . during my ordeals with my pro carbon, waiting for a good bios from msi there was a release that removed some features from the bios, like the memory interleave and some other settings, so that may be why your unable to find them settings in the bios .

the other maybe helpful, is a post on reddit concerning the immo groupings

apparently they were successful with there setup, but there is a hint about the nic card being an issue …

another success here

which has a link and a clue that maybe helpful. good luck :slight_smile:

Thanks for the Answers. I start yet to look at it and to try the things out.

Meanwhile, some Solution and some new Problems i found :slight_smile:

The Setting for Memory Interleaving exist. I found it very very deep hiding. Under CPU OC Menue (i post the exact path later).

Then, i maked some benchmarks with Standard Bios Settings. The Results are a little better as with the ASUS Bord, but not worth to mention.

But what i found out, that my CPU Temp is high like the Sky. Under full Load around 70°C. With a stable Overclock to 4 GHz all Cores, around 87°C. That with a Heatkiller Pro Waterblock, 3x 360mm Radiators and 3 Liters of Water. Full Pump and Fan Speed. That not normal. With the ASUS Bord (EKWB Monoblock) i had in this Scenario around 58-62°C with a 4 GHz all Cores overclock.

The Thing what makes me sceptical is the fact, that the Radiators are Ice Cold under the “87°C load”. The 2 Water Temperature Sensors i have, post Temps something about 30°C, wich are the same Temps as with the ASUS Bord.

First i thought it may be a Temp Offset from AMD. The 27°C Thing. But i only know that the first Ryzen that had. No the Threadrippers. But with 27°C difference, that would be the same Temps like with the ASUS Bord. But then again, i cant believe it, because now (i have 22°C Ambient now here), lm-sensor says my CPU is now at 35°C Idle. 35-27 would be 8°C. And we know all, that sub ambient cooling isnt possible with normal Watercooling.

I already tried two another Waterblocks. Same Results.

Ok, now i know why the Windows VM dont boot. The Disk is missing (passthrough /dev/disk/by-id/).

fdisk -l and gparted dont show the nvme disk.

Started Ubuntu Live CD. There dont show either. Only the 2 Samsung NVMEs that plug in under the PCI Slot. The Third Crucial one at the right Side from the Mainboard, dont show up.

Curious. In Bios that Disc shows, and i can access the Disk within my secound native installed Windows.

Ok im stupid.

Now i have all NVMEs and all NICs showing up. I stupid horse forgot after editing my /usr/sbin/vfio.sh file to rebuild initramfs.

So after dracut --force -v, everything is like it should be.

My two 1080ti’s:
44:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Device 85ea
Flags: bus master, fast devsel, latency 0, IRQ 59, NUMA node 1
Memory at ae000000 (32-bit, non-prefetchable) [size=16M]
Memory at 70000000 (64-bit, prefetchable) [size=256M]
Memory at 80000000 (64-bit, prefetchable) [size=32M]
I/O ports at 4000 [size=128]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities:
Kernel driver in use: vfio-pci
Kernel modules: nouveau

0b:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Device 85ea
Flags: fast devsel, IRQ 10, NUMA node 0
Memory at e0000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at c0000000 (64-bit, prefetchable) [disabled] [size=256M]
Memory at d0000000 (64-bit, prefetchable) [disabled] [size=32M]
I/O ports at 3000 [disabled] [size=128]
Expansion ROM at e1000000 [disabled] [size=512K]
Capabilities:
Kernel driver in use: vfio-pci
Kernel modules: nouveau

Have now vfio-pci in use. Nice.

After a test, i still had no Screen at 0b:00.0.

I looked in /proc/iomem and dmesg, and found out that i have to use video=efifb:off this time, because the Mainboard only post to first PCI Slot, even than there is no Screen connected (The ASUS Bord post to the gpu, where a screen is connected).

After that and a reboot, i tested again, the error dont showing anymore in dmesg, but still no Screen in the VM.

The only thing that i have now in dmesg is:

[ 800.286033] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x900

Ok, i got it working! :slight_smile:

I simply delete the whole VM. And created a new xml.

Thanks for the mentally support! :slight_smile:

Ok now i got Windows 10 20h2 installed.

But the workaround for the GPU ERROR 43 for the Nvidia Card seems not working anymore.

I always used the Settings descriped in the Archwiki. But there, they removed the Section entirely. Dont know why. https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF

I tried https://marzukia.github.io/post/fedora-32-and-gpu-passthrough-vfio/

But that dont work too. Still ERROR 43 Code.

VFIO in 2020 Fedora 32 - An alternate route dont work either.