I’ve got a fleet of 12 machines all running 4 S7150x2 GPUs for SR-IOV - these machines usually work without issue but I’ve noticed on 3 of them this has happened:
They appear as W7100 GPUs instead of what they actually are - S7150x2s.
I’ve flipped through the UEFI and tried toggling through a few options as well as reinstalling the cards or swapping them with cards from other machines. So far nothing seems to work.
I’ve seen people convert W7100 GPUs into the S7150 but I’ve never heard of an actual S7150x2 being identified as a W7100 (let alone one with two GPUs on a single board).
Any idea on what this could be or how it might be possible to fix or at least narrow the possible cause?
Yes I’m aware they are the same chipset. The issue is that some of the machines I have in my server room (12 servers total) identify the GPUs as W7100 cards while they are S7150x2s.
I have installed a number of these physically identical Dell servers with the S7150X2 and all of them (except for 3 of the 12) have identified the cards in Ubuntu 20.04 as S7150X2s. Other than having installed the base system I have not installed anything else on the 3 machines affected by this issue. On the other machines (the ones which correctly identify the physically installed cards as S7150X2s) I am running AMD GPU-IOV module (GIM).
I believe this issue could be related to a UEFI setting or perhaps there is a defect in the motherboards in the three machines have so far been affected by this. I would like to find a way to resolve this issue or at least the search for what might be causing this as I’d like to get these machines into production when possible.
Perhaps this could be a driver issue (or mesa? issue), where the card is still showing up as a S7150, it is just being misinterpreted as a W7100 by whatever software in your OS.
Can you use lspci to get the device ID? 1002:692b = W7100 1002:6929 = S7150
If it is 6929, then it is definitely an OS software issue.
I’ve swapped in cards from another machine that identify them as S7150x2 GPUs when plugged in. When they are installed in the affected machines they begin to be identified as W7100s. I’ll try reflashing as well but the issue seems to be isolated to 3 servers that can have any number of identical cards swapped in from any working production device (which does seem them correctly as S7150x2s) which appears to result in no change in how the GPUs are identified. I suspect this may be something on 3 of the 12 otherwise identical servers relating perhaps to the motherboard. I’m not sure exactly but I suppose reflashing might be worthwhile if for nothing else just to eliminate the possibility that the cards may be the issue.
Will take a look. above 4g decoding is on in bios/uefi. Not sure on csm, but will take a look and compare setting by setting. Might need to check uefi image versions as well.
Enabled all of those on a fresh install on a clean machine. Bug came up on a forth machine. Now I’m genuinely at a loss. If I end up figuring this one out I’ll post the solution.
That does appear to be the case. I’ve swapped in devices that appear to be S7150x2 boards on a machine which is not affected by this bug into one of the now four machines that are affected by the bug and they begin to appear as W7100s. When swapped back into the original machine they were installed in they appear as S7150x2 boards again. Given this behaviour I would be surprised if the GPU firmware were the culprit although I suppose that may be possible… I guess? The issue appears to be the machines they are installed in identifies them incorrectly at least in so far as I can tell.
Hmmm. Perhaps it is firmware related. One dumped from a normal card, another dumped from a card affected by the W7100 bug. I’ll try reflashing the affected cards and see what happens.
This still wouldn’t explain why installing a normal card in one of the affected machines causes the card to identify as a W7100 when it is otherwise identified as an S7150x2 in regular machines. Would I sound crazy to inquire if these cards have several regions of memory where the VBIOS may be stored? EEPROM partitions?