Return to Level1Techs.com

Gigabyte AORUS NVMe Gen4 SSD BTRFS RAID1 Issues

Hi all,

I build a PC similar to @wendell DevOPS Machine.
On a Gigabyte x570 Aorus Master i have a Gigabyte AORUS NVMe Gen4 SSD 1TB in first and second slot.
In the third slot a have an Intel Optane 16GB ssd.

I installed Fedora 31 on the system.
For doing so i created the File systems as follows:
Before starting the installation i created a BTRFS raid 1 with the two Gigabyte SSDs.

sudo wipefs -a /dev/nvme0n1
sudo wipefs -a /dev/nvme1n1
sudo wipefs -a /dev/nvme2n1

sudo mkfs.btrfs -d raid1 -m raid1 -L gigabyte-nvme-raid /dev/nvme0n1 /dev/nvme1n1

Also I formated the Intel SSD with the disk utility of Fedora’s live system with gpt,
this was needed as the installer complained when it tried to delete all partitions on that disk without gpt.

The mountpoints / and /home go as btrfs subvolumes to the btrfs raid1.
/boot, /boot/efi and swapt to the Intel SSD.

I updated the system. The kernel now is 5.4.17-200.fc31.x86_64.

Now comes the problem.
After the system is running for a while the system suddenly gets stuck for arround 10 seconds. Then it recovers.
When looking into dmesg i can see issues that it can not write to nvme0. Something about APST, and then a lot of btrfs checksum errors.

To test a bit i reinstalled having set nvme_core.default_ps_max_latency_us=55000.
I got this number from sudo nvme id-ctrl /dev/nvme0 and inspecting ps 4, by adding up enlat + exlat plus some time extra. Description from https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe. Remark this is not the value 5500 as described in section Samsung drive error.

With this kernel command line set the Fedora the system freezes during installation in step post install something.

I reinstalled with nvme_core.default_ps_max_latency_us=0 which disables APST for all of the drives.
Currently the system is stable.

Does anyone observe the same issues with the drives and btrfs raid 1?
Does it harm the drives disabling APST and always have them running at full operational state?
Could it be a hardware issue instead?
Is there some geat fix comming in kernel 5.5?
Can firmware updates be installed without Windows 10, i tried to install Windows but it did not see any of the drives?

By the way the issue with fstrim -v /
as described in Devops Workstation: Fixing NVMe Trim on Linux
is not present.

Thanks a lot,
André

Power management is the bane of AMD, in general, across a wide variety of devices. It is very hard to get it right. I think you might be able to enable nvme polling like I did for linus, or the hybrid approach, and have even better performance. For a desktop system, the power mangement aspect is not worth worrying about for these devices.

I am glad it is stable now, but keep an eye on it and test it hard.

Thanks, i will look into your post Fixing Slow NVMe Raid Performance on Epyc

I invesigated a bit more activated apst again. Here is a journalctl output. The interesting part starts at Feb 14 19:00:40.
journal_error.txt (1.2 MB)

It turned out that nvme_core.default_ps_max_latency_ps=0 does not reliably avoid the issue.

Maybe it was a hardware issue. I observed that two times in a row the same disk failed also after swapping the disk between the m2 slots. The other times before i did not note down which of the two disk it was, unfortunately. The disk is exchanged with a new one. It looks good so far.