Linux NVME RAID0 best practices (on Threadripper)

I’ve seen a lot of mixed messages – most people say forget about the HW raid and go with software raid (mdadm). But I’m pretty sure they’re comparing against traditional e.g. SATA/SAS HW RAID, not CPU/NVME RAID like Intel VROC or AMD’s Threadripper system. And I’ve definitely seen others say Linux software raid doesn’t perform very well and you want the hardware assistance.

Empirically I tried software raid first, under a few configurations, and on iozone benchmarks was getting only 80% sustained write speedup on an 8x striped array (from about 1.2GiB/s single-drive to 2.2GiB/s on the array). That’s nowhere near what AMD claims you can get with their CPU raid – I should be getting more like 6-8GiB/s. I’m trying to set up the CPU raid now, which is super frustrating and poorly documented (see e.g. thread /t/129510 ).

And – I’ve already seen dozens of messages saying to forget about HW raid, but I’m specifically looking for people with experience with CPU-NVME RAID. :slight_smile:

This is pretty open-ended, but I’m looking for people with experience successfully setting up high-throughput NVME RAID arrays to advise either on how to get the throughput with Linux SW RAID, or how to successfully configure the CPU RAID.

Thanks!!

(Context: CPU is Threadripper 2950X on Taichi X399, with 2x Asus Hyper M.2 each with 4x NVMEs, each on 4x4x4x4x lane slots. RAID array will be used as temp/scratch space, so reliability isn’t super important.)

1 Like

What filesystem are you using, and what OS? Is there a reason you want a RAID array instead of a something like a ZFS mirror?

Software raid has become king in big data. Piss ant little controller chip cant compare to a CPU and its IO.

As too mixed messages. I would agree. Wendell likes ZFS. The billion dollar FS. Im a BTRFS man myself because its flexible. My needs are modest. Every new kernel its being patched. I dont see any ZFS patches. But it is solid regardless.

Even Fedora is embracing BTRFS while REDHAT has committed to Stratis. An undead XFS zombie.

If the 2020’s has challenges. Storage is up there :slight_smile:

I feel your pain. I spent extra to get a pair of crucial x4 drives for raid1 on my 1920x + Asrock Fatal1ty board as I like resilience on my boot drive… It didn’t go well.

Asrock’s “hardware raid” driver only works with windows. The bios settings are obtuse at best and down right misleading otherwise. Bottom line is I could not get it working after days of research.

In the end I created a mirror in software when setting up the mountpoints for Manjaro, then used mdadm to create a separate scratch space (and game drive) for stuff and things. Total of 4 drives, 2 in software raid1 and 2 in raid0.

Performance is “fine”, but I am not going for record breaking. The bottlenecks are elsewhere in the system, mainly network speeds and the spinning rust on my NAS.

For your needs you may want more peak read speed but frankly it’s the random iops that win the day for me and for that use case software raid is good enough.

If I were starting clean slate I’d use zfs for the boot mirror but still mdadm for the big drive.

Sorry that’s probably not what you wanted to hear!

I’ve got ext4 on it, but I’m doubtful the fs is going to make up for a 2GiB/s vs. 6GiB/s difference.

I’m aiming for max throughput for sustained read & write of large files (e.g., 10-100GiB). I don’t want mirroring, I want striping.

[added: sorry, OS is Ubuntu 20.04]

Marten – Like I said, I’m not talking about “HW RAID” as in some dinky controller chip. I understand that’s gone by the wayside. I’m talking about the NVME RAID support built into the Threadripper CPU itself, specifically for CPU PCIE I/O and NVME controllers.

Personally. I tend to avoid raid 0. Simply if something messes up. Everything is toast. Perhaps a jbod solution? Errm. Jbof? One drive for OS, and the other for /home.

Like I said, it’s a temp/scratch drive. All it holds are temp files. They die, I wipe it and restart the system.

Anyway, I never got the AMD NVME RAID working, so I’m back on mdadm. It works; I don’t think I’m getting the throughput I could, which is a little disappointing, but it does work. I also switched to XFS.

You mention this being a scratch disk, I’m assuming it’s a premiere or Resolve workload?

Also, you mention each drive is capable of ~1.2 GiB/s, have you tried running something from command-line to test if you can hit all of them simultaneously (as independent drives) and still reach that speed? Also, for what you have tried to do so far, have you noticed if it’s hitting only one core/thread?

Otherwise, I’m assuming that the PCIe lanes are all cpu lanes- ideally connected to the same CCX?

Running an lstopo to make sure everything is correctly connected could be beneficial. Some of the Guys/gals/non-binary pals might be able to chime in with more threadripper specific information, but I believe you want to avoid crossing the infinity fabric as much as possible on this sort of thing.

Also, worth mentioning that 8*1.2 Gib’s -> ~10 GiB/s before overhead is lot’s to deal with, but should be doable

EDIT: also, have you considered using one of the 8 for boot, then running the rest under HW raid? Also, have you tried testing UMA vs NUMA?

1 Like