Kernel 6.11 has broken my Iommu groups

I’ve been trying the 6.11 kernel since it plays better with the nvidia 560 drivers but realised that it has changed my iommu groups.

I’ve confirmed that in 6.10.10 I have the same group layout that in the current lts (6.6.52). Also I have not found any similar case with this particular kernel update.

I went for a total of 14 groups with my secondary gpu conveniently isolated to only eight groups with the secondary gpu grouped with my main nvme storage among others.

Do any of you know if this is the intended behaviour or should I file a bug report? Where is the proper channel to get the info and logs to the correct dev?

Thanks in advance.

There’s still a Bugzilla instance at bugzilla.kernel.org and the process that gets more than zero attention from the Linux Kernel Mailing List is:

  • Find the mainline kernel commit that brought about this change
  • Write a Bugzilla entry that specifies your hardware, the commit that caused the regression and makes the case that you want the nVidia 560 drivers from 6.11 as a reason for them to fix 6.11
  • Copy the content of the bug report to the LKML and the PCI/PCIE and ACPI sub-list

That ‘find the commit’ sounds daunting, but you get to use a binary chop to subdivide the search space into a good and bad half and then resume search on the bad half to find the culprit. If there’s 4000 or so commits, it should take 12 iterations of the test to complete, not 4000 iterations.

Step 0 – confirm that it’s not caused by patches added by your Linux distribution, clone the mainline kernel git tree, copy the config from /boot/ and build a local edition using your distribution’s step by step instructions. Install and reboot to see the IOMMU groupings.
Step 1 – find the last-known good marker, say the hash from the commit that’s tagged 6.10.10, and a bad commit hash that’s tagged 6.10.10 (git version tags are like symlinks to a commit hash.
Step 2 – check the git bisect documentation then begin git bisect [bad hash] [good hash]
Step 3 (and loop) – compile this revision, test it, and report to git bisect either git bisect good or git bisect bad.

When completed, there should be a commit that causes these IOMMU groups to fall into a different layout than you’ve previously desired. There wasn’t anything highlighted in the summaries of the 6.11 merge window I read (LWN first half / LWN second half) that points to an ACPI or PCIE change.

Good luck,
K3n.

1 Like

I’m seeing the same thing, wondering if you reported it.

My GPU is in the same IOMMU group as other devices that I cannot bind to vfio-pci (pass-through) on kernel 6.11.3 but the GPU is in its own IOMMU group on kernel 6.10.12.

I did some more testing and found that it works on 6.10.14 but breaks on 6.11.0.

More testing shows that it works on 720261c but breaks on 9330697. So it broke somewhere in one of these 38 possible commits:

933069701c1b507825b514317d4edd5d3fd9d417
527eff227d4321c6ea453db1083bc4fdd4d3a3e8
fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c
7846b618e0a4c3e08888099d1d4512722b39ca99
33c9de2960d347c06d016c2c07ac4aa855cd75f0
8e313211f7d46d42b6aa7601b972fe89dcc4a076
2c9b3512402ed192d1f43f4531fb5da947e72bd0
c43a20e4a520b37c2ef6d4f422de989992c9129f
9fa23750c6e591a6e095057ec07c81dddec0d72c
8326f5e1a47b1a657524678cb62b264a84fbea7e
13a7871541b7f5fa6d81e76f160644d1e118b6b0
f557af081de6b45a25e27d633b4d8d2dbc2f428e
d2be38b9a5514dbc7dc0c96a2a7f619fcddce00d
3c3ff7be9729959699eb6cbc7fd7303566d74069
3f386cb8ee9f04ff4be164ca7a1d0ef3f81f7374
8e5c0abfa02d85b9cd2419567ad2d73ed8fe4b74
ef035628c326af9aa645af1b91fbb72fdfec874e
acc5965b9ff8a1889f5b51466562896d59c6e1b9
09ea8089abb5d851ce08a9b1a43706e42ef39db2
04d17331ca33744e1426fdeee7ba5e975c4b2239
aba9753c0677e860f982edff98c7fe5a2b97758c
d7e78951a8b8b53e4d52c689d927a6887e6cfadf
53a5182c8a6805d3096336709ba5790d16f8c369
33cf098770930a9b782d3983e1b0127bdc203216
9c67f9084af3f84e63abb44b82316fe0dbccd5d5
12cc3d5389f313f07222b000fefa2cd8fc98c4f8
a4f9285520584977127946a22eab2adfbc87d1bf
f4f92db4391285ef3a688cdad25d5c76db200a30
f66b07c56119833b88bffa4ecaf9f983834675de
4305ca0087dd99c3c3e0e2ac8a228b7e53a21c78
661fb4e68cf62bf52eacfcd9b3b0d93fe4260c5b
afd81d914f6fb3e74a46bf5d0dd0b028591ea22e
ebcfbf02abfbecc144440ff797419cc95cb047fe
3d51520954154a476bfdacf9427acd1d9538734c
ef7c8f2b1fb46d3fc7a46d64bb73919e288ba547
07e773db19f16f4111795b658c4748da22c927bb
c434e25b62f8efcfbb6bf1f7ce55960206c1137e
5c28424e9a348f95e3c634fe2ed6da8af29cc870

edit: deleted

This is why I gave up on the whole thing. Its so niche and bleeding edge, its not worth it if kernel updates break it. (I run Debian Testing). Still looking for a 4k/120fps KVM switch.