Intermittently Unable to Boot into Linux

Dygear · January 10, 2020, 5:35am

When attempting to boot into Pop_OS recently, I’ve come across the following message during the boot process and get dropped to an initramfs prompt. The message is … mdadm: No arrays found in config file or automtically … I have 3 OSs running on my computer that are loaded with GRUB, the first is Ubuntu on a 512GB 950 Pro, and the other two are 1TB 970 Pros on a DIMM.2 card. I’m using the ASUS ROG Zenith Extreme (Litterly the one @wendell reviewed here…). The GRUB boot loader is on the 950 pro but I’m also having issues booting into that from time to time as well. I had no issues booting into the Pop_OS install that’s loaded on the DIMM.2 card until about 2 weeks ago when I started to have trouble booting into all of my operating systems except for ironically Windows 10 that’s loaded on the other 970 Pro on the DIMM.2 card.

System config is as follows.

MoBo: ASUS ROG Zenith Extreme
CPU: AMD TR4 1950X
RAM: G.SKILL TridentZ RGB Series 16GB (2 x 8GB) DDR4 3600 (PC4 28800) F4-3600C17D-16GTZRx2
GPU: AMD Radeon VII
NVMe:

SAMSUNG 950 Pro (512GB) in [M.2_1(SOCKET3) (Page 1-2 Item 15)] (GRUB & Ubuntu) (https://dlcdnets.asus.com/pub/ASUS/mb/socketTR4/ROG_ZENITH_EXTREME/E13369_ROG_ZENITH_EXTREME_UM_V2_WEB.pdf).
SAMSUNG 970 Pro (1TB) in DIMM.2 Slot. (Pop_OS)
SAMSUNG 970 Pro (1TB) in DIMM.2 Slot. (Windows 10)

[EDIT]

So I don’t think to include the GPU in my parts list last time, because it seemed like a NVMe problem. But on reboot of the system, I got error code 96 that reports as a fault of the NVRAM. It couldn’t POSSBILE be that another RADEON VII card has failed me and is slowly corrupting my drives!? Could it!?

mathew2214 · January 10, 2020, 5:30pm

this looks like your software RAID array has failed. and your root filesystem was on that array.
unless you’ve prepared for array failure and have a recovery plan to execute, your data is gone.

also: there is an ACPI failure in your kernel logs. ensure you’ve built the kernel with all necessary ACPI support options and firmwares if you’re going to be using quirky ACPI hardware.
specifically, the module “system76_acpi” seems to be causing issues.

Dygear · January 12, 2020, 3:14am

All I had to do was reboot my computer and I some how lost a software RAID setup? That freaking sucks! Is there a configuration that is not as susceptible to this type of failure? I’m fairly show I didn’t use software RAID or any type of RAID setup for these drives. I actually thought I set them up so that each drive had it’s own OS (512GB = Ubuntu, 1TB = Windows, other 1TB = POP_OS!). I wanted to set up the system so that I could remove any drive from the system and not have to worry about the rest of the computer failing. (Obviously grub is on the 512GB drive, but I can always boot directly to the drive using the UEFI interface.)

[EDIT]

Speaking of being able to boot directly into an OS from the UEFI boot menu. I just did exactly that for the POP_OS drive. I think the GRUB configuration is broken, as I can’t otherwise get into the Linux installs, but I just did with the UEFI get into my PopOS install. So maybe not all is lost. Maybe just my Ubuntu install?

mathew2214 · January 12, 2020, 4:36pm

many of the nicer hardware RAID card have features to prevent data loss even in the event of a power loss.

mdadm messages in those screenshot argue to the contrary.

i recommend setting up your linux drives to mount by partition UUID instead of any /dev/sdx designation. as /dev/sdx names can change whenever something is added/removed from any SATA port, this can confuse bootup scripts and cause issues. partition UUID’s only change if the partition is modified or reformatted.