AMD Raid 1 on Windows randomly died - NVME kicked out of array [semi-SOLVED]

Hellohi,
I have recently set up a Raid 1 array of two M.2 NVMEs to boot Windows off of to not have to worry about a disk failure leaving me with an unbootable computer (and losing my data).

Unfortunately this computer is now refusing to boot after around a month of use without issues.
It prompts me for my bitlocker key as it should and it accepts it. Then it shows me a splash screen with the caption “Preparing Automatic Repair” promptly followed by a windows recovery screen that first asks for a keyboard layout and then gives me the option to either troubleshoot or turn off the PC.

I find this pretty weird because as far as my understanding goes even if one of the disks were to have failed I should still be able to boot.

The RAIDXpert2 utility in the BIOS claims that the array is critical (which at this point I fully believe), it shows both disks as disabled and has one of them labelled as removed. The Controller shows both disks, one of them marked as ready the other as online. The Board Explorer claims that M.2_1 is empty and that M.2_1 has the AMD-RAID.
This entire thing has left me a bit confused.

I am not aware of any Windows update (or any other software really) being installed since the last time it has successfully booted. The M.2s are brand new WD_Black SN850Xs so I didn’t expect those to just fail either.

I think I now could choose the disk that was apparently kicked out of the array as a hot spare which would then make the array not critical anymore. But since I would expect to still be able to boot off of the disk still in the array I’m wondering if the disk still in the array is corrupt which would result in me overwriting a potentially good disk with garbage if its possible that the disk that has been kicked is not corrupt.

Is there a way I can inspect these disks to verify if/which one of them still has good data on it to be sure I’m not breaking anything? I would hope something like booting from a liveusb and then mounting them would work - I did a very primitive attempt of this, which failed. Do I have a chance of getting this to work if I get AMD Raid drivers and support for Bitlocker and NTFS installed onto the liveusb somehow (that sounds fun)?
Could I get any suggestions on how to approach fixing this? Or is this just doomed?

Welcome!

Normally I’d recommend booting a live windows image and running chkdsk, it plays nice with bitlocker, but I’m unsure of how the AMD software raid will affect it… As long as the AMD software raid passes through the logical volume to the OS it should be okay.

Do you use BitLocker with a TPM module like “Microsoft dictates it” or did you change the Group Policy Setting that allows BitLocker to also be used for the boot volume with software-only AES encryption?

I’m asking since in the past I encountered BIOS + SSD Firmware bugs with BitLocker with TPM, that led me to disable the TPM shyte and switch to BitLocker software-only encryption, even if it taxes the CPU a bit. Haven’t had any issues since then on Windows 10 22H2 and 11 23H2.

Also: Are you using an AM4 or AM5 motherboard?

Thanks! Using a Windows installation to do this does seem more straightforward than what I was trying to do.
In Disk Management the two M.2s show up individually and without a drive letter - I assume this is because this installation does not have the drivers to interact with the array. I don’t even get the option to assign a drive letter (though I have no idea if that would be a good idea as I have no idea what that actually does to the drive - for now the last thing I want is to modify any of the two drives in any way that could break things even more).
I have tried to run chkdsk against the GUID of the presumed volumes as described in [cannot include links but its on superuser: chdsk on partition without letter] - but I don’t think thats doing what I want because it says something about fat32.

I have disabled the “TPM shyte” via Group Policy because I didn’t want the decryption keys stored on the TPM. It was set up so that I had to enter the encryption key before every boot.
I do not recall the exact Windows 10 version, but I set it up fairly recently and used the most recent available at the time
The Motherboard is AM4: MSI MPG X570

Rather than using the windows installation, it might be advantageous to use something like medicat. That way you’ll have the full GUI and more tools to help (I think it comes preloaded with some pretty advanced tools to help with recover beyond what MS includes).

This is the correct assumption, if my memory serves me correct Windows will ask to initialize a disk to MBR/GPT and if you say yes to this it will append the disk.

I’d also do intensive memory tests (for example with Passmark Memtest), memory errors can f up a situation here severely (I’m only using ECC memory).

Fair point, I hadn’t run a memory test against this set of RAM in a hot minute. But ~9h and 4 passes later I’m fairly confident when I say that the RAM looks good.
Yes, it’s not ECC but that would feel a bit overkill for just a desktop to me (and I don’t think this mobo would even support it).

1 Like

Alright, I got myself a stick with medicat on it and have tried some of the tools that sounded like they are what I was looking for. But my suspicion that whatever tool I might want to use needs support for the AMD Raid shenanigans (which at the very least the ones that I have clicked through do not seem to have) has only increased. All of them showed the drives separately, healthy and “unformatted”…

This is unfortunate, I bet the AMD software raid is putting some kind of preamble on the disks making it so they aren’t as easily read when not being accessed through the software raid.
That being said, most all the partition recovery software will be able to detect this and recover the info.

Alright, I do have an update:
I booted off of another windows installation I had flying about and I installed the AMD RAID drivers onto it via pnputil similar to the usual setup during Windows installation onto an array.
This enabled other software to interact with the array properly and after unlocking it via the bitlocker recovery key all data was still present and seemingly intact.
I then got myself another identical drive to the two in the array and cloned all partitions onto it. This drive was actually bootable then.
There were ~500MB of unallocated space left which I assume was just deducted from the capacity of the drives in the array because thats how much data the AMD RAID drivers just rawdogged onto there (/reserved) to do their RAID things… (I could probably get that space back by moving the partitions a bit but I don’t think that’s worth the effort).
From all of this I deduce that it definitely was the AMD RAID shenanigans that broke this out of nowhere.
In the boot menu this drive still showed as AMD Raid - after switching the BIOS mode back to AHCI instead of RAID only the new drive showed up and with its usual name instead of AMD Raid. Trying to boot this way required me to input my recovery key instead of my password multiple times, ultimately failing to boot :confused:
Switching back to RAID mode seemingly reverted to the state before before the switch without breaking it any further.

At this point I don’t see a way to gracefully transition this into a working installation so I shall setup Windows from scratch again after transferring as much data as sensible to some kind of external storage that does not get gobbled up by the AMD RAID making it unusable without having RAID mode enabled.

Also I contacted AMD before and I have received a response in the meantime which basically said: Tough luck, go recreate the array - which I find … impressive :’

Lesson learned: Do not touch AMD RAID, its a bad idea, it will break, just, no.
(I’ve heard bad things about this before - I just understood it as “lacking features, performance bad, not worth the effort” - but this kind of failure mode in this setup makes it kind of a non-starter.)

Motherboard partners also seem to do zero testing of amd raid on bios updates.

It breaks almost every time in a way similar to this. (On wrx80/90).

1 Like

Have been using AM4 SATA and NVMe RAID1 on various x470 and X570 motherboards since 2019, while performance-wise it’s leaving much on the table never had issues with BIOS updates actually breaking things. Strange since I otherwise attract every other bug under the Sun.

PS: Am pretty pissed that AMD doesn’t seem to roll out fixes they did for AM5 for AM4 also since the RAID functionality has nothing to do with the hardware and the AM4 software RAID experience hasn’t improved much since the video you made a few years back.