S7150x2 appearing as W7100

I’ve got a fleet of 12 machines all running 4 S7150x2 GPUs for SR-IOV - these machines usually work without issue but I’ve noticed on 3 of them this has happened:

They appear as W7100 GPUs instead of what they actually are - S7150x2s.

I’ve flipped through the UEFI and tried toggling through a few options as well as reinstalling the cards or swapping them with cards from other machines. So far nothing seems to work.

I’ve seen people convert W7100 GPUs into the S7150 but I’ve never heard of an actual S7150x2 being identified as a W7100 (let alone one with two GPUs on a single board).

Any idea on what this could be or how it might be possible to fix or at least narrow the possible cause?

A s7150 is just 2 w7100 on one card they are the exact same die, just 2x

It probably only lists 5 GPUs on that page
Can you use more than the 4 GPUs in blender or another program

Which driver are you using

Yes I’m aware they are the same chipset. The issue is that some of the machines I have in my server room (12 servers total) identify the GPUs as W7100 cards while they are S7150x2s.

I have installed a number of these physically identical Dell servers with the S7150X2 and all of them (except for 3 of the 12) have identified the cards in Ubuntu 20.04 as S7150X2s. Other than having installed the base system I have not installed anything else on the 3 machines affected by this issue. On the other machines (the ones which correctly identify the physically installed cards as S7150X2s) I am running AMD GPU-IOV module (GIM).

I believe this issue could be related to a UEFI setting or perhaps there is a defect in the motherboards in the three machines have so far been affected by this. I would like to find a way to resolve this issue or at least the search for what might be causing this as I’d like to get these machines into production when possible.

Perhaps this could be a driver issue (or mesa? issue), where the card is still showing up as a S7150, it is just being misinterpreted as a W7100 by whatever software in your OS.

Can you use lspci to get the device ID?
1002:692b = W7100
1002:6929 = S7150

If it is 6929, then it is definitely an OS software issue.

@TheCakeIsNaOH
image|720x291

Now that is weird.
I have no idea then.

Do they have identical addin cards?

Like some have 10g nics whole others use onboard

Or some use M.2 drives

those particular cards were part of a lightweight vdi setup, you’ll need to reflash them probably. they have the wrong pcie id

I’ve swapped in cards from another machine that identify them as S7150x2 GPUs when plugged in. When they are installed in the affected machines they begin to be identified as W7100s. I’ll try reflashing as well but the issue seems to be isolated to 3 servers that can have any number of identical cards swapped in from any working production device (which does seem them correctly as S7150x2s) which appears to result in no change in how the GPUs are identified. I suspect this may be something on 3 of the 12 otherwise identical servers relating perhaps to the motherboard. I’m not sure exactly but I suppose reflashing might be worthwhile if for nothing else just to eliminate the possibility that the cards may be the issue.

@GigaBusterEXE Ya the addin cards are all identical across the fleet of 12. All in the same slots as well in each case.

csm vs not csm in the bios? uefi? above 4g decoding? all those settings the same?

Will take a look. above 4g decoding is on in bios/uefi. Not sure on csm, but will take a look and compare setting by setting. Might need to check uefi image versions as well.

Enabled all of those on a fresh install on a clean machine. Bug came up on a forth machine. Now I’m genuinely at a loss. If I end up figuring this one out I’ll post the solution.

dump the gpu rom, if it’ll let you, and diff them:
https://listman.redhat.com/archives/vfio-users/2019-March/msg00005.html

To be clear here the same physical gpu behaves two different ways depending on the system(s) it is in? or ?

That does appear to be the case. I’ve swapped in devices that appear to be S7150x2 boards on a machine which is not affected by this bug into one of the now four machines that are affected by the bug and they begin to appear as W7100s. When swapped back into the original machine they were installed in they appear as S7150x2 boards again. Given this behaviour I would be surprised if the GPU firmware were the culprit although I suppose that may be possible… I guess? The issue appears to be the machines they are installed in identifies them incorrectly at least in so far as I can tell.

Hmmm. Perhaps it is firmware related. One dumped from a normal card, another dumped from a card affected by the W7100 bug. I’ll try reflashing the affected cards and see what happens.
image

vbindiff output.


Definitely a different ROM.

This still wouldn’t explain why installing a normal card in one of the affected machines causes the card to identify as a W7100 when it is otherwise identified as an S7150x2 in regular machines. Would I sound crazy to inquire if these cards have several regions of memory where the VBIOS may be stored? EEPROM partitions?

This has me curious now.

For anyone who reads this later tl;dr on the fix here is I pulled a working S7150x2 VBIOS file off one of the unaffected machines and flashed it with GitHub - Strunzdesign/amdvbflash: AMD vBIOS flash utility for Linux
Pretty simple it turns out.

Still odd that normal unaffected cards getting swapped in still showed up as W7100s.

Any chance one of those machines is using UEFI and one is using legacy/BIOS?