RX Vega 64 Issues with Movement Between Systems

I have a V64 that I was moving to an Ubuntu 16.04.3 (Kernel 4.15.3) system from a known working config in a Win10 environment. Linux was working just fine, but when I moved the GPU back to the Win10 system that worked just fine before it moved, the GPU didn’t register as an OpenCL device. The same software that I was running before refused to let it operate, and I now have a useless V64 sitting on my desk.

Is there a fix for this?

Win10 System Specs:

Xeon X5450
Asus p5q-em
6GB DDR2
256GB SSD
GTX 690
(Plus, initially, A V64 on a Riser)
Ubuntu System Specs:

Core 2 Quad Q8300
Random Dell OEM mobo
4GB DDR3
16GB SSD

(I know this stuff is old and totally jerry-rigged, Just wanted to know if there was any hope of me ever getting it working again.)

Thanks in Advance!

You could try with the proprietary linux driver, to see if it re-issues the graphics card’s “gpuid” Then re-install ms10 if that’s the os u want

What does GPUID do and Why does it change? I have a WX 5100 that had this same problem. Worked on Windows, worked on Linux, didn’t work on Windows. Result was a $500 Brick. (On Windows)
(ノò_ó)ノ︵ ┻━┻

Also, I was using the proprietary AMDGPU-PRO Driver. Tried using ROCm on top of it, but I discovered that I was using outdated hardware that ROCm didn’t support. :frowning:

Ment “gpuid” as in graphics card identification, not as in it’s an actual command/linuxterm

You could type lspci into terminal to see if it’s listed and other id’ed devices, also to see if it registers there. (Kernel 4.15+ mandatory for vega)

The issue itself could be a number of things, one not uncommon in my experience is messed up firmware. Which most likely will be fixed later in kernel 4.15+ (now at 4.15.3) or 4.16

Another thing you could do is to install a testing version of an upcoming kernel (last i checked there was kernel 4.15.3.2), where the issue/s may be fixed, which is doubtful. Still worth a shot

So, something about the card (GPUID) is being altered by the linux kernel, and there doesn’t seem to be a fix for it?

Also, I was referring to GPUID as an entity.

Is the VBIOS being reflashed with new IDs on linux?

That’s my guess, yes. Kernel 4.15+ was a massive step for many things, including amd vega support / freesync, meltdown,spectre etc. It’s all very new / bleeding edge, somewhat optimistic to expect to much from the developers at this time (4.15 is still very recent, also there is probably a good reason it’s not LTS)

I don’t think it actually re-flashes the card, though it may have given the card another / generic id by mistake? (speculation) To fix it, you could probably install windows and then if your bios supports it, get the drivers from the bios or install another version of windows 7/8/8.1 to “fix” the card with a driver update? Or rma it of course (It’s viable for rma)

I’d never get it back! Retailers’d just refund me and charge me $1K for a new one.

How do I give the card its proper GPUID manually? Or is there a way to do this?

Also, What is its proper GPUID, for reference?

True, that could happen

You don’t want to even think about doing that manually, your best bet is to install radeon software via windows and get that to do it automatically for you. By installing a windows version and a compatible driver version or even firmware (research it well, if you’re taking chances as it may brick the card) could take care of that, a lot faster than any linux distro/kernel version.

It should be recognized as: Radeon RX Vega - VEGA10 (Proprietary) Or VEGA 64 (Open source) In the about section or VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (revision number) from the terminal command lspci

Huh. Cool. Thanks!!

1 Like

Ok what is going on in this thread???
I just find one confused person trying to help another confused person. Yeah sorry I don’t mince words.

Let’s start this from the beginning.

Document for me what you see with your GPU + add lots of screenshots so I can gain an idea of the situation.

Do a clean install of Radeon Crimson Drivers.
GPU-z pic. Device manager pic. Test a game. Test your OpenCL application of choice.

Also there is no magic GPUID changing or VBIOS flashing going on.
Where did this idea even come from?
In case someone confused what a Shadow Rom is. No it’s not a VBIOS flash. It just means that at Boot, the contents of the GPU’s VBIOS are copied into a location in system RAM so that for subsequent access to the GPU VBIOS data the system doesn’t access (or wear out) the slow VBIOS flash EEProm chip.

In short, I need details.

The more information the better, because in order to understand anyone’s problem and help I need to know what you did, what your system state is, what you are trying to do. What your have done that didn’t work etc.

3 Likes

Going along with what @catsay is mentioning, you could be running into a GPU firmware issue.

The cards get initialized in GNU/Linux with the firmware that is shipped/maintained in the kernel. It is possible that the firmware loaded in your ubuntu system is Older/Newer/Missing features in comparison to what the MS Windows system is expecting.
https://wiki.archlinux.org/index.php/AMDGPU#Loading

As @catsay mentioned, do a clean install of Crimson.

I’ll admit to being confused. Don’t mince words if you don’t have to. :smile:
First of all, my config is total ghetto. A mix of old and new.
Device Manager:

GPU-Z:
image
There’s one thing that GPU-Z doesn’t report correctly, though. The V64 isn’t running in PCIe 3.0 x16. Quite far from it, it’s probably running in 2.0 or 1.0 x1.

I’m currently mining with it on my current system to facilitate larger upgrades in the future. (TR Anyone?) It’s literally just sitting on top of my current system plugged into a USB riser and power. Not my final use case for it, just something to help out a bit. :smile:

OpenCL App:

It [The OpenCL Platform] most certainly does exist, as this exact config has worked for months on end just fine. I just moved the GPU to a Core 2 Quad server running Ubuntu 16.04.3 LTS with an updated 4.15.3 Kernel for some testing to see if it was compatible with machine learning stuff. Closed source AMDGPU-PRO was installed before ROCm. Then I discovered that ROCm requires PCIe 3.0 [facepalm] and as such, moved it right back. Win10 wouldn’t recognize the device, and here we are.

All pictures were from before any reinstallation of drivers, etc.

Update: All is well after Crimson reinstall. Full steam ahead.
Now I would like to learn how this happened and why. Also, @catsay mentioned something called a Shadow ROM. When does the linux kernel load this, and where from? Does the kernel control the GPUs functions based on the copy kept in RAM, or the copy on the EEPROM?

Sorry for the pile of questions. I just love to learn from people who know what they’re doing. :slight_smile: I’ve been dying to know about this stuff for a while now, especially since I bricked my WX5100 for use on Windows. (No post but same install circumstances as the V64.)

BIOS Shadowing

In essence and this is something all major operating systems and machines in general do at startup. ROM BIOS shadowing is the process of copying the BIOS from slow ROM (Usually a Serial Flash Chip) into RAM and using either hardware or CPU features to remap this particular section of RAM into where the normal address space of the BIOS usually resides.

Effectively the shadow BIOS becomes a faster overlay over the slow BIOS ROM.
This is done because reading RAM is much faster than reading ROM chips, thus BIOS-intensive operations are substantially faster.

A similar thing applies to GPU’s, a lot of the code they need to run and routinely access is stored in a slow serial flash chip on the card. So at startup they copy this BIOS code into faster VRAM/RAM for quicker access when it’s needed later.

The firmware, drivers or the operating system program something called PCI BAR’s (Base Address Registers) on the PCI controller that inform the PCI device of how it fits into the system. Various aspects of the card are mapped into the systems I/O port address space or memory-mapped address space.

This covers setting up things such as where the Shadowed VBIOS is located, the Video RAM address space and other configuration and communications channels that make up the Graphics cards functions which finally allow the system and yourself to see and use the PCI device. In this case the graphics card.

Now looking at your GPU-Z screenshot I noticed something peculiar. Namely that GPU-z indicate PhysX supprt. This leads me to suspect that there was a conflict between the Nvidia drivers for the GTX690’s and the RX Vega driver. That lead to something on the amd driver configuration being overridden likely because some part of the Nvidia driver path loaded first.

But It’s good to see that it’s now been resolved with a reinstall. :slight_smile:

1 Like

Cool! And thanks a ton!

If the VBIOS is loaded from the EEPROM chip at bootup, would it not be somewhat of an easy fix to just modify the kernel to “replace” that BIOS with another BIOS in a specified location in the filesystem, effectively “soft-modding” the BIOS?

Well yes and no.
The GPU itself is the first to initially load the VBIOS into VRAM and uses it for initializing the GPU.
After that any number of extra steps may occur depending on the specific hardware and operating system involved. But you generally cannot change the cards operating state after that.

With the exception of resetting the PCI device and all of the trickery such as is done in GPU passthrough setups where you can supply a VBIOS file for the virtual machine to use as a Rombar or Romfile. This would only takes effect inside the VM, and only for sessions where the “romfile” argument is present.

This is a great way for example to test out a bios without risking bricking the card with a bad flash. But it does carry it’s own risks and overall this isn’t a very well explored nor well documented area of GPU BIOS modding. So I can’t say for sure.

1 Like

Test OS would run Linux.

So, if you were to, say, intercept that BIOS traveling from the EEPROM to VRAM, (if it’s even possible at that stage) then you would be able to “soft-mod” the BIOS. Or you could reset the PCI device and have the new ROM take effect?

Would it be possible to reset the PCI device after the fact, just like what occurs during a PCIe hot plug? (which I know needs extensive hardware and software support)