Proxmox 9.0 + Intel B50 SR-IOV -- finally its almost here! Early Adopter's Guide

Is anyone else having issues with suspend though? I guess on proxmox it’s not so common to suspend but I have my B50 in a workstation that suspends most of the time…

Are you using one of the resulting SR-IOV device partitions on the host itself, or are you expecting that there is a “residual device” left even after SR-IOV partitioning?

I’m using one of the VFs. There actually still is the physical device, but anything demanding fails/crashes I presume since only a few MB of VRAM is left unallocated.

I’d like to be able to keep e.g. one VF worth of VRAM to the physical device to use on the host without having to use prime offloading.

Perhaps what I want is “PF mode”? I’ll see what I can do with the 6.18 rc kernel in the weekend…

From phoronix

  • Enabling SR-IOV support PF mode by default on supported platforms without needing to have built the Linux kernel with the CONFIG_DRM_XE_DEBUG option. This also includes enabling SR-IOV on the Xe driver when opting to use this modern driver with Tigerlake, Alder Lake, and Arctic Sound graphics rather than using the default i915 driver.

Add pcie_aspm=off to the kernel parameters on your host to calm the fan.

I solved the suspend issue: it wasn’t due to the GPU but due to my NIC… I moved my connect-x 3 to a chipset-connected slot (bottom one on my proart x670e). Due to this my NFS drive mounted with RDMA was not working and crashing suspend (and just not working in general). Seems like RDMA does not work on the chipset slot on this board…

I spent the past 4 weeks trying to debug my proxmox server and then I saw this video and remembered I started getting kernel panics when I tried and failed to get SR-IOV working on my intel 265k. I saw that this whole setup doesn’t work on kernel 6.8, which coincidentally is the only kernel I can get working in recovery mode.

I basically tried changing all I could to be the same as this guide. Still not working. I always used PVE 9.0 on this system, since problems :upside_down_face:. Has anyone else tried the old guide, and gotten kernel panics, and now not able to get this working? I would like to get jellyfin and a virtual desktop or 2 working at the same time.

I have a DKMS running because I have an intel killer E5000 nic which isn’t supported before I think kernel 6.15 and it’s my management nic so I kinda need it. I think I remember shutting this down would stop crashes as well, so I don’t know if it’s involved somehow.

Any suggestions other than reinstalling promxox? I have yet to test a backup :grimacing:

Was reading through the thread and wanted to check if my framework 13 with the Meteorlake 125h supported sriov and found this while running the xe driver. I might try and setup a VM and see if I can pass the iGPU through.

nicholas : ~#  sudo lspci -s 00:02.0 -v
00:02.0 VGA compatible controller: Intel Corporation Meteor Lake-P [Intel Arc Graphics] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Framework Computer Inc. Device 0009
        Flags: bus master, fast devsel, latency 0, IRQ 153, IOMMU group 0
        Memory at 6018000000 (64-bit, prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Intel Capabilities v1
                CapA: Peg60Dis- Peg12Dis- Peg11Dis- Peg10Dis- PeLWUDis- DmiWidth=x4
                      EccDis- ForceEccEn- VTdDis- DmiG2Dis- PegG2Dis- DDRMaxSize=Unlimited
                      1NDis- CDDis- DDPCDis- X2APICEn- PDCDis- IGDis- CDID=0 CRID=0
                      DDROCCAP+ OCEn- DDRWrtVrefEn+ DDR3LEn+
                CapB: ImguDis- OCbySSKUCap- OCbySSKUEn- SMTCap- CacheSzCap 0x0
                      SoftBinCap- DDR3MaxFreqWithRef100=Disabled PegG3Dis-
                      PkgTyp- AddGfxEn- AddGfxCap- PegX16Dis- DmiG3Dis- GmmDis-
                      DDR3MaxFreq=2932MHz LPDDR3En-
        Capabilities: [70] Express Root Complex Integrated Endpoint, IntMsgNum 0
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
        Capabilities: [d0] Power Management version 3
        Capabilities: [100] Null
        Capabilities: [110] Process Address Space ID (PASID)
        Capabilities: [200] Address Translation Service (ATS)
        Capabilities: [420] Physical Resizable BAR
        Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [400] Latency Tolerance Reporting
        Kernel driver in use: xe
        Kernel modules: i915, xe

I see the same on my alder lake (desktop) GPU but it doesn’t seem to work… Check the kernel logs dmesg | grep xe, mine says not supported. Things are still in the pipeline though, but I’m not sure about meteor lake specifically:

I got a B50 and followed the guide. However I don’t have a windows machine to do that firmware update with. So I tried using a vfio pass-through. Unfortunately that seems to have bricked my B50 :frowning: as I cannot post anymore. Perhaps I did something wrong, but I bound the B50 to the vfio-pcie driver as you normally would.

I have a second B50, but I would really like to avoid to brick that one too. What should I do?

So the whole system doesn’t post anymore? Is that with the B50 as the only GPU, the main GPU, or as a secondary GPU? Perhaps you could still post if another GPU is present and the primary one… That could give you a chance to unbrick later on.

How exactly did you do the update (as a warning to others)? Any error messages or crashes? Does event viewer show anything (if you can get into the windows VM with a different GPU or the QEMU one)?

Some (including me) did the upgrade on the passed through GPU and it worked fine, but it obviously it’s the riskier option…

Probably I’d try to get my hands on a windows system to flash the other one, either quickly install windows on a spare SSD or see if you can borrow a system somewhere?

Well I just passed through the card as you would any other. The driver seemed to install fine and it worked in the windows VM until I rebooted the proxmox host, to try and setup SR-IOV with the new firmware.
I have verified that it is the B50 that causes the system to not post anymore no matter what GPU is inserted alongside the B50. Removing it fixes the issue. I have tried every available PCI-e slot too, so that is also not the reason. I think the card genuinely bricked itself in some odd way.

I don’t really know anyone with a desktop windows machine, as they either have a laptop or aren’t using Windows as their OS. I guess I can’t install the firmware without windows then.

I wish they would allow us to just flash the firmware separately man. I bricked one of my cards because of this crap, I am a little irritated by that I have to admit.

Any tips on how I can make minimise the risk when flashing it with a pass-through? Like perhaps I passed through the card incorrectly?

Hi guys,

After playing with the B50 for about 3-4 days, I thought I’d share my experiences.

First off, I didn’t read the guide. I just installed the newest firmware (in Windows), installed the Proxmox 6.17 opt-in kernel, set the kernel parameters, and it just worked. The whole process took about an hour.

For anyone wondering about gaming performance or what can be done with this GPU, here’s my setup. My server is running a Ryzen 5600G (a 5950X is on the way), 128GB DDR4 RAM, and a 2TB SSD.

I split the GPU into 12 VFs and created 12 VMs, each with 2 cores, 8GB RAM, and 128GB of storage. I installed Windows, Sunshine, and Steam on all of them.

I then bought several low-cost games like Bioshock Infinite, Batman: Arkham Knight, Shadow of the Tomb Raider, and Borderlands 2.

I also tried CS2 as a more modern title, but it was too heavy for the small 1.5GB VRAM / 8GB RAM VM. I mean, it started, but it was super slow, though it was playable if I dropped it to 800x600.

I think there are some issues with DX12, because in Shadow of the Tomb Raider there are weird artifacts and other texture bugs, and in DX11 mode, the performance hit is big.

I didn’t encounter any hard problems or crashes, even when running all 12 VMs simultaneously with games running on all of them. I forgot to mention, due to the limited VRAM, I set the resolution to 720p. So all in all, it’s performing really well.

If anyone else tries this, a quick tip: the Intel Driver Update utility within the VM crashes 9 out of 10 times, but it seems to install everything correctly anyway. It also takes a long time to install, so if it freezes, just wait for like 30 minutes and track the Task Manager to see if it’s still processing. If your CPU utilization drops to 0%, then you can just kill the frozen update window, no worries.

I ordered a 5950X to see if that improves performance, but right now even the 5600G is performing OK. The CPU usage casually floats around 200-900%.

1 Like

This could actually be the host mobo needing a bios update as the fw update to the b50 changes pcie resources and rebar a bit. does your host system have a bios update? what motherboard?

1 Like

Hi Wendell, there is indeed a slightly newer bios. Let me install it. I’ll post again in case it still doesn’t work. I have to verify the exact model, but I believe it is an Asus board.

Hmm, I got the same as you after looking at dmesg a bit more closely

[    2.558349] xe 0000:00:02.0: [drm] Support for SR-IOV is not available

Looking more closely at the devices under /sys/ it seems to confirm no SR-IOV

[root@betelgeuse xe_0000_00_02.0]# ls -l /sys/devices/xe_0000_00_02.0/
total 0
-r--r--r-- 1 root root 4096 Nov  1 15:03 cpumask
drwxr-xr-x 2 root root    0 Nov  1 15:03 events
drwxr-xr-x 2 root root    0 Nov  1 15:03 format
-rw-r--r-- 1 root root 4096 Nov  1 15:03 perf_event_mux_interval_ms
drwxr-xr-x 2 root root    0 Nov  1 15:03 power
lrwxrwxrwx 1 root root    0 Nov  1 14:41 subsystem -> ../../bus/event_source
-r--r--r-- 1 root root 4096 Nov  1 15:03 type
-rw-r--r-- 1 root root 4096 Nov  1 14:41 uevent

It was fun learning how to install KVM on cachy though :sweat_smile:

I’m going to switch drivers and see if I can get the full fat i915 driver installed on cachy.


Update

oOo

[    4.446811] intel_rapl_msr: PL4 support detected.
[    4.866007] i915: You are using the i915-sriov-dkms module, a ported version of the i915/xe module with SR-IOV support.
[    4.866009] i915: Please file any bug report at https://github.com/strongtz/i915-sriov-dkms/issues/new.
[    4.866009] i915: Module Homepage: https://github.com/strongtz/i915-sriov-dkms
[    4.866154] i915 0000:00:02.0: Running in SR-IOV PF mode
[root@betelgeuse 0000:00:02.0]# ls /sys/devices/pci0000:00/0000:00:02.0
aer                   consistent_dma_mask_bits   dma_mask_bits    graphics  i2c-6        irq             modalias     remove        resource0_wc      sriov_drivers_autoprobe  sriov_vf_total_msix
ari_enabled           consumer:pci:0000:00:1f.3  driver           i2c-10    i2c-7        link            msi_bus      rescan        resource2         sriov_numvfs             subsystem
boot_vga              current_link_speed         driver_override  i2c-11    i2c-8        local_cpulist   msi_irqs     reset         resource2_resize  sriov_offset             subsystem_device
broken_parity_status  current_link_width         drm              i2c-12    i2c-9        local_cpus      numa_node    reset_method  resource2_wc      sriov_stride             subsystem_vendor
class                 d3cold_allowed             enable           i2c-13    iommu        max_link_speed  power        resource      revision          sriov_totalvfs           uevent
config                device                     firmware_node    i2c-5     iommu_group  max_link_width  power_state  resource0     rom               sriov_vf_device          vendor

[root@betelgeuse pci0000:00]# echo 2 > sriov_numvfs
[root@betelgeuse pci0000:00]# dmesg
[  425.113842] pci 0000:00:02.1: [8086:7d55] type 00 class 0x030000 PCIe Root Complex Integrated Endpoint
[  425.113921] pci 0000:00:02.1: DMAR: Skip IOMMU disabling for graphics
[  425.114088] pci 0000:00:02.1: Adding to iommu group 20
[  425.114099] pci 0000:00:02.1: vgaarb: bridge control possible
[  425.114100] pci 0000:00:02.1: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[  425.114107] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem
[  425.114252] i915 0000:00:02.1: enabling device (0000 -> 0002)
[  425.114347] i915 0000:00:02.1: [drm] Cannot read valid display IP version; disabling display.
[  425.114349] i915 0000:00:02.1: [drm] Found meteorlake (device ID 7d55) integrated display version 0.00 stepping N/A
[  425.114360] i915 0000:00:02.1: Running in SR-IOV VF mode
[  425.114363] i915 0000:00:02.1: [drm] preliminary graphics version 12.70 media version 13.00
[  425.115219] i915 0000:00:02.1: [drm] GT1: GUC: interface version 0.1.24.4
[  425.115259] i915 0000:00:02.1: [drm] GT1: GMD_ID 0x3400008 version 13.0 step C0
[  425.116611] i915 0000:00:02.1: [drm] GT0: GUC: interface version 0.1.24.4
[  425.116654] i915 0000:00:02.1: [drm] GT0: GMD_ID 0x311c004 version 12.71 step B0
[  425.117750] i915 0000:00:02.1: [drm] GT1: GUC: interface version 0.1.24.4
[  425.117792] i915 0000:00:02.1: [drm] GT1: GMD_ID 0x3400008 version 13.0 step C0
[  425.118586] i915 0000:00:02.1: [drm] VT-d active for gfx access
[  425.118653] i915 0000:00:02.1: [drm] Using Transparent Hugepages
[  425.118905] i915 0000:00:02.1: [drm] GT0: GUC: interface version 0.1.24.4
[  425.120914] i915 0000:00:02.1: [drm] GT0: GUC: interface version 0.1.24.4
[  425.121708] i915 0000:00:02.1: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[  425.121712] i915 0000:00:02.1: HuC firmware N/A
[  425.125754] i915 0000:00:02.1: [drm] GT1: GUC: interface version 0.1.24.4
[  425.127695] i915 0000:00:02.1: [drm] GT1: GUC: interface version 0.1.24.4
[  425.128279] i915 0000:00:02.1: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[  425.128282] i915 0000:00:02.1: HuC firmware PRELOADED
[  425.141471] i915 0000:00:02.1: [drm] PMU not supported for this GPU.
[  425.141762] [drm] Initialized i915 1.6.0 for 0000:00:02.1 on minor 0
[  425.142337] pci 0000:00:02.2: [8086:7d55] type 00 class 0x030000 PCIe Root Complex Integrated Endpoint
[  425.142427] pci 0000:00:02.2: DMAR: Skip IOMMU disabling for graphics
[  425.142666] pci 0000:00:02.2: Adding to iommu group 21
[  425.142687] pci 0000:00:02.2: vgaarb: bridge control possible
[  425.142688] pci 0000:00:02.2: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[  425.142695] i915 0000:00:02.0: vgaarb: VGA decodes changed: olddecodes=none,decodes=none:owns=io+mem
[  425.142701] i915 0000:00:02.1: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[  425.142851] i915 0000:00:02.2: enabling device (0000 -> 0002)
[  425.142928] i915 0000:00:02.2: [drm] Cannot read valid display IP version; disabling display.
[  425.142931] i915 0000:00:02.2: [drm] Found meteorlake (device ID 7d55) integrated display version 0.00 stepping N/A
[  425.142945] i915 0000:00:02.2: Running in SR-IOV VF mode
[  425.142946] i915 0000:00:02.2: [drm] preliminary graphics version 12.70 media version 13.00
[  425.143327] i915 0000:00:02.2: [drm] GT1: GUC: interface version 0.1.24.4
[  425.143372] i915 0000:00:02.2: [drm] GT1: GMD_ID 0x3400008 version 13.0 step C0
[  425.144038] i915 0000:00:02.2: [drm] GT0: GUC: interface version 0.1.24.4
[  425.144080] i915 0000:00:02.2: [drm] GT0: GMD_ID 0x311c004 version 12.71 step B0
[  425.144844] i915 0000:00:02.2: [drm] GT1: GUC: interface version 0.1.24.4
[  425.144885] i915 0000:00:02.2: [drm] GT1: GMD_ID 0x3400008 version 13.0 step C0
[  425.145519] i915 0000:00:02.2: [drm] VT-d active for gfx access
[  425.145573] i915 0000:00:02.2: [drm] Using Transparent Hugepages
[  425.145797] i915 0000:00:02.2: [drm] GT0: GUC: interface version 0.1.24.4
[  425.147848] i915 0000:00:02.2: [drm] GT0: GUC: interface version 0.1.24.4
[  425.148428] i915 0000:00:02.2: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[  425.148432] i915 0000:00:02.2: HuC firmware N/A
[  425.151128] i915 0000:00:02.2: [drm] GT1: GUC: interface version 0.1.24.4
[  425.152909] i915 0000:00:02.2: [drm] GT1: GUC: interface version 0.1.24.4
[  425.153478] i915 0000:00:02.2: GuC firmware PRELOADED version 0.0 submission:SR-IOV VF
[  425.153481] i915 0000:00:02.2: HuC firmware PRELOADED
[  425.156436] i915 0000:00:02.2: [drm] PMU not supported for this GPU.
[  425.156634] [drm] Initialized i915 1.6.0 for 0000:00:02.2 on minor 2
[  425.156934] i915 0000:00:02.0: Enabled 2 VFs
[root@betelgeuse pci0000:00]#
1 Like

Thank you Wendell you rock, this did the trick. I would have never thought to check for a BIOS update, but that worked.

SR-IOV is now available and I can continue setting it up. Thank goodness the GPU wasn’t bricked.

riu@pve01 ~> sudo lspci -s 0a:00 -v
[sudo] password for riu:
0a:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics] (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1114
        Flags: bus master, fast devsel, latency 0, IRQ 90, IOMMU group 25
        Memory at 2200000000 (64-bit, prefetchable) [size=16M]
        Memory at 2400000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at fc200000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, IntMsgNum 0
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
        Capabilities: [d0] Power Management version 3
        Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [110] Null
        Capabilities: [200] Address Translation Service (ATS)
        Capabilities: [420] Physical Resizable BAR
        Capabilities: [220] Virtual Resizable BAR
        Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [400] Latency Tolerance Reporting
        Kernel driver in use: xe
        Kernel modules: xe
1 Like

neat, what board and model was that specifically?

It’s an Asus Crosshair Hero VIII with a 5950x. So yeah, always update your BIOS xD
I’m running a small Proxmox machine for various home services. It’s now up and running with 4 VMs using the B50 for graphics acceleration.

I read somewhere in this thread that someone wanted to make this work on Linux for Remote Sessions but failed.

Well, I spent some time testing that out too, and it works! I’m using Arch, Sway, and Sunshine. No Dummy Plug needed.

Actually, I noticed that it seems to run even better on Linux VMs, probably because my Arch VM only takes about 800MB of RAM in idle.

(Side note: CS2 didn’t start at all, but I’m looking into that right now.)

But I can confirm: Gaming on 12 Linux VMs simultaneously works too.