Troubleshooting NVIDIA GPUs for AI on 6.x Linux Kernels (ubuntu 22.04 LTS, 6.5.0-28-generic)

I have run up against an issue trying to bring up some RTX A800s I’ve… ahem… borrowed… on Linux 6.8.5-028 (ubuntu 22.04 LTS).

[  597.093543] pci 0000:01:00.0: enabling device (0000 -> 0002)
[  661.970513] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[  661.970521] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
[  661.971789] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
[  661.971791] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
[  661.971792] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR4 is 0M @ 0x0 (PCI:0000:01:00.0)
[  661.971794] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR5 is 0M @ 0x0 (PCI:0000:01:00.0)
[  662.016974] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.171.04  Tue Mar 19 20:30:00 UTC 2024
[  662.029245] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.171.04  Tue Mar 19 20:26:16 UTC 2024
[  662.037498] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  662.067860] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x40:762)
[  662.068430] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  662.068543] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[  662.068680] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[  662.155850] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.

I have gone so far as to prevent nvidia.ko from loading at boot, and inspecting the situation with bar space and lspci; it doesn’t seem entirely unresonable:

However I am not able to



because of the errors. This is what I would normally do to manually resize the bar space to make it work. The card does not accept these changes, and you can see from the output that Bar0 and Bar3 are fixed sizes, at least the kernel thinks so, anyway.

Upon loading nvidia.ko:

1:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f6] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:180a]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 279
        IOMMU group: 34
        Region 0: Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at <ignored> (64-bit, prefetchable)
        Region 3: Memory at <ignored> (64-bit, prefetchable)
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Null
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (ok), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Via message B,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
                Vector table: BAR=0 offset=00b90000
                PBA: BAR=0 offset=00ba0000
[...snip...]
    Capabilities: [bb0 v1] Physical Resizable BAR
                BAR 0: current size: 16MB, supported: 16MB
                BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB
                BAR 3: current size: 32MB, supported: 32MB
        Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 4, stride: 1, Device ID: 20f6
                Supported Page Size: 00000573, System Page Size: 00000001
                Region 1: Memory at 0000000000000000 (64-bit, prefetchable)
                Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0

I am still investigating this issue. There is no way to use this GPU on a Threadripper system under Linux; this GPU works fine under Windows on the Sapphire Rapids platform.

This occurs with nvidia-535 and nvidia-550 at least. I also tested back to nvidia-470 and that didn’t immediately work, but I didn’t 100% confirm if it was the same issue. Theoretically the A800 is supported since … version 430… ish?

I have so far been able to reproduce this issue on Threadripper 5000 series CPUs on WRX80 motherboards, and Threadripper 7000 series CPUs on ASRock and Asus WRX90 motherboards… at least with RTX A800.

The Fix

So with modern GPUs, generally, you must make sure that you have a lot of bios options set: Enable IOMMU (auto isn’t good enough), Enable AER, PCIe 10 bit tag support and possibly SR-IOV, depending on if you want that, with these higher-end GPUs.

It may also be necessary to set kernel parameters like pci=realloc – always check cat /proc/cmdline to be sure that changes to the kernel commandline “stick” – are you using grub or systemd-boot or ?? etc.

I ended up booting to rescue mode (init=/bin/bash) and stepping through a bunch of pcie nonsense to troubleshoot what was, ultimately, needing the right combination of bios settings, above, that strangely didn’t matter for RTX A6000 GPUs but did matter for these.

1 Like

I have seen a few issues on AMD systems that Phoronix reported on that fixed performance, but that may not have anything to do with the PCI I/O regions.

Normally I would tell you to switch to the latest kernel, but Nvidia prop kernel-space drivers. I am assuming you can’t use NVK.

How does it work with Linux on the Sapphire Rapids platform? Just to rule things out.

inspecting the situation with bar space and lspci

What was the report?

However I am not able to

The text box after is blank.