How to make Linux obey kernel parameters?

mathew2214 · May 13, 2023, 3:13pm

well then, is it feasable to be able to automatically detect this degraded state and then to spam resets until it gets into a working state? to save me the trouble of having to physically observe the VM and then reboot it myself multiple times?
also: sounds like something AMD could fix with a patch to their closed-source firmware. why havent they? seems like something QA would find out about quickly.

gnif · May 13, 2023, 3:15pm

No, the GPU is a black box and what we do know about it’s internals and how to even perform what we can so far (BACO Reset) took months of R&D.

As for QA… normally BUS reset only happens on system reboot, or GPU crash recovery.

A sane device listens for this request and actually resets it’s entire state like the system has been rebooted. AMD GPUs ignore it entirely and if you try to issue a BUS reset the entire PCIe bus looses track of the GPU state as it (the PCIe controller itself) now expects the PCI device to re-negotiate it’s link and as it doesn’t it “falls off the bus”.

Instead AMD opted to try to recover from crashes and have hundreds (last I looked) of lines of code in amdgpu to try to figure out what went wrong and to recover the GPU from it’s crashed state. It’s honestly ludicrous the amount of man hours wasted when they could have just effectively power cycled the GPU back to a known working state like NVIDIA do.

BACO stands for Bus Active, Controller Off. It’s how vendor-reset tries to recover the device (and it’s proprietary), by literally turning it off and on again. But as BACO implies, this is not a full reset, the PCIe BUS controller and PSP (platform security processor) in the GPU itself do not get reset.

mathew2214 · May 13, 2023, 3:22pm

just tested it. i cannot recreate the problematic behavior on spice/qxl.

mathew2214 · May 13, 2023, 3:23pm

there ought to be some sort of standard or whitepaper mandating this for PCIe devices.

gnif · May 13, 2023, 3:24pm

That’s just it, there is. As AMD do not handle a reset correctly, technically their devices can not claim to be PCIe compliant.

mathew2214 · May 13, 2023, 3:27pm

so then why hasnt anybody sued them? these cards ARE advertised as PCIe cards right? isnt it false advertising to claim a standard your product doesnt meet?

and why doesnt their QA department at the very least test for compliance with industry standards? seems unbelievable that a company as big as AMD would be so incompetent and/or lazy on so many levels that nobody anywhere caught this before release.

gnif · May 13, 2023, 3:31pm

See: https://www.reddit.com/r/Amd/comments/jehkey/will_big_navi_support_function_level_reset_flr/

The main issues why people do not complain/sue are:

The GPU should never crash to begin with so a bus reset should not be needed. The complaints of AMD GPU black screen under windows is very common and is actually the GPU crashing and attempting to recover.
It’s niche, once a system is up and running a bus reset is a thing of last resort to recover a falting GPU. We just happen to take advantage of it for VFIO passthrough which these GPUs were never advertised to support.
People often misdiagnose the issue, as you have here and blame it on some other component of a very complex system.

mathew2214 · May 13, 2023, 3:33pm

well, it seems GPU replacement is in order. do any of the Navi series cards have these issues fixed?

gnif · May 13, 2023, 3:35pm

None, infact the latest generation are even worse, it’s real hit and miss if they work or not.

As much as I hate the tyrannical anti-competitive behemoth that is NVIDIA, if you want a realiable VFIO experience, use NVIDIA.

Another option is the Intel ARC GPUs, they work, but have their own issues and I honestly would wait until they get the drivers in better condition first.

mathew2214 · May 13, 2023, 3:38pm

i havent owned an nvidia gpu in over a decade. and a decade ago i had such a difficult time that i vowed never to consider them again.

seems sticking with my current hardware is unfortunately my best option here.
i will work to develop a way for the VM to detect it’s degraded state and then reboot itself when detected, automatically.

gnif · May 13, 2023, 3:40pm

No worries, sorry I can’t give you a better solution.

I also had made the same vow at one point but when faced with no other option had to resort to being gouged for hardware again.

I did though manage to stick to my non-Intel vow, been running AMD since 2006

mathew2214 · May 13, 2023, 3:47pm

oh man. you were there for the first gen bulldozers. we likely share many of the same scars then.

it should be noted: when booting the VM to Windows 10, i cannot recreate the problematic behavior. it only occurs when the guest boots to Linux.

Level1_Amber · May 13, 2023, 3:48pm

Did he get the permission boost?

mathew2214 · May 13, 2023, 3:49pm

i got this:

Level1_Amber · May 13, 2023, 3:50pm

That was me lol

mathew2214 · May 13, 2023, 3:50pm

cool. thanks.

Level1_Amber · May 13, 2023, 3:50pm

Welcome

H-i-v-e · May 13, 2023, 5:39pm

I did not know that failed resets can result in this behavior, glad you people figured it out! I can only support what was mentioned about NVIDIA, they are pricey, but at least they work as intended without jumping through hoops.

Preston_Bannister · May 17, 2023, 3:43pm

Walked through this problem a couple of years ago, and came to the conclusion that there was no practical way to increase the effective PCIe payload size (outside very custom hardware).

What I recall is the Linux kernel parameter is only the start. There are other constraints baked into the hardware that pretty much lock us to 512 byte PCIe transactions.

Possible that I missed something. Do not recall the exact details, only that this turned out to be a dead-end. (In my case was trying increase AXI transaction sizes inside an FPGA, but was limited by the PCIe transaction size on the PCIe bus and bridge.)

With the above said, the PCIe transaction size is likely not affecting your performance.