I have run up against an issue trying to bring up some RTX A800s I’ve… ahem… borrowed… on Linux 6.8.5-028 (ubuntu 22.04 LTS).
[ 597.093543] pci 0000:01:00.0: enabling device (0000 -> 0002)
[ 661.970513] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 661.970521] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
[ 661.971789] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
[ 661.971791] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
[ 661.971792] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR4 is 0M @ 0x0 (PCI:0000:01:00.0)
[ 661.971794] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:01:00.0)
[ 662.016974] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.171.04 Tue Mar 19 20:30:00 UTC 2024
[ 662.029245] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.171.04 Tue Mar 19 20:26:16 UTC 2024
[ 662.037498] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 662.067860] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x40:762)
[ 662.068430] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 662.068543] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ 662.068680] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[ 662.155850] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
I have gone so far as to prevent nvidia.ko from loading at boot, and inspecting the situation with bar space and lspci; it doesn’t seem entirely unresonable:
However I am not able to
because of the errors. This is what I would normally do to manually resize the bar space to make it work. The card does not accept these changes, and you can see from the output that Bar0 and Bar3 are fixed sizes, at least the kernel thinks so, anyway.
Upon loading nvidia.ko:
1:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:20f6] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:180a]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 279
IOMMU group: 34
Region 0: Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at <ignored> (64-bit, prefetchable)
Region 3: Memory at <ignored> (64-bit, prefetchable)
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Null
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Via message B,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
Vector table: BAR=0 offset=00b90000
PBA: BAR=0 offset=00ba0000
[...snip...]
Capabilities: [bb0 v1] Physical Resizable BAR
BAR 0: current size: 16MB, supported: 16MB
BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB
BAR 3: current size: 32MB, supported: 32MB
Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
VF offset: 4, stride: 1, Device ID: 20f6
Supported Page Size: 00000573, System Page Size: 00000001
Region 1: Memory at 0000000000000000 (64-bit, prefetchable)
Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
I am still investigating this issue. There is no way to use this GPU on a Threadripper system under Linux; this GPU works fine under Windows on the Sapphire Rapids platform.
This occurs with nvidia-535 and nvidia-550 at least. I also tested back to nvidia-470 and that didn’t immediately work, but I didn’t 100% confirm if it was the same issue. Theoretically the A800 is supported since … version 430… ish?
I have so far been able to reproduce this issue on Threadripper 5000 series CPUs on WRX80 motherboards, and Threadripper 7000 series CPUs on ASRock and Asus WRX90 motherboards… at least with RTX A800.
The Fix
So with modern GPUs, generally, you must make sure that you have a lot of bios options set: Enable IOMMU (auto isn’t good enough), Enable AER, PCIe 10 bit tag support and possibly SR-IOV, depending on if you want that, with these higher-end GPUs.
It may also be necessary to set kernel parameters like pci=realloc – always check cat /proc/cmdline to be sure that changes to the kernel commandline “stick” – are you using grub or systemd-boot or ?? etc.
I ended up booting to rescue mode (init=/bin/bash) and stepping through a bunch of pcie nonsense to troubleshoot what was, ultimately, needing the right combination of bios settings, above, that strangely didn’t matter for RTX A6000 GPUs but did matter for these.