Fast as Possible NVME NAS?

Happy New Year LevelOne forum!

I need to build a fast as possible rack-mounted NAS for a 5 gigabit network that will serve an NFS share. I have very low storage density requirements at just 2TB. Essentially I am looking for a network attached ~2 TB NVME drive. I am considering RAID0 with 2 NVME drives as the data that will live on this machine are either ephemeral outputs that don’t need to be kept for any amount of time or its code that is already backed up to git and it is my understanding you can gain some performance that way.

My budget isn’t really limited but I also want to do this as reasonably and cheaply as possible. I am trying to figure out the bare minimum CPU and RAM I’d need to keep up with the drives and network card. The server will only be used by one person (me) so I don’t think I’ll need a ton of cores. However, at times there could be have 2-3 applications/programs accessing the storage simultaneously.

Any tips for how to reason about this or things to look out for?

Im using 2x 980 1Tb Pros NVME now in a Threadripper pro machine over my network and they are pretty fast in RAID0 at around 12gbps Read and 13Gbps Write. NAS isnt really anything im that knowledgeable about but thought id leave my thoughts here as id also like to setup a NAS in the future and would like to here others thoughts on this.

I found this report from '15 outlining NVME’s impact on CPU. Seems like it is fairly minimal.

Some things to be aware of

  • A 5gbps network is going to be able to move just under 625 MB/s when it comes to bulk throughput. Even ZFS with 3 HDD based vdevs can get close to saturating that under the most ideal conditions (disks are nearly empty). Just about any semi-recent PCIe 3.0 NVMe is going to be able to saturate that on it’s own when doing just reads, or just writes. If that’s all you need, then it become significantly easier to build a system. Otherwise you may need to step up to a 10, 25, 40 or 50gbps network. 10gbps SFP+ hardware from ebay was the cheapest option last I looked, but things may be more favorable to 40gbps by now.

  • Nearly all NVMe drives have a bathtub curve of performance when it comes to load profile. They look incredible when either reading or writing, but mixed loads cause them to suffer badly. Only Optane drives, in particular the 2nd gen (p5800x) can handle mixed loads perfectly. 2nd gen optane is full duplex, and can read and write at the time. The problem is 2nd gen optane will turn even “no need for budget” into “Good god man, I have limits!”

  • Consumer SSD drives have a cache that lets them be as fast as they are rated, for a limited amount of time. Once that cache is exhausted they slow down significantly. Additionally, once SSD drives get full enough (depends on drive, but typically around 75%) performance drops quickly. Nearly all benchmarking you’ll see stay within the cache and are done on mostly empty drives, unless stated otherwise. I (and most others too) was mistaken about the purpose of the cache, it actually only holds the flash transition layer data. See: https://forum.level1techs.com/t/the-dram-cache-on-flash-ssds-wasnt-what-i-thought-it-was-it-only-caches-the-ftl/

  • NFS by default turns every write into a SYNC write. I’m fuzzy on the reasoning but the last time I looked at what others had to say about why it does this, it made me turn away from the idea of trying to change that. SYNC writes are one of the worst performance profiles, because the application (in this case, NFS itself) is requiring the filesystem (and the drives under it) to report that the write is safely committed (rather than just sitting in the controller cache) before the app can move on. Because of this, everything is limited by the latency of the drive responding. ZFS mitigates with a feature called the SLOG (Separate intent LOG) , which is a (preferably) mirrored pair of very low latency drives (though even same speed drives help!). I don’t know if it’s different for other filesystems, but on ZFS when SYNC writes are being made, the data is being streamed to an on disk ZIL (Zfs Intent Log) without being able to be optimized for location. This really important on HDDs, because without and SLOG eventually drive holding a database eventually becomes very fragmented and performance goes to hell even if it was good in the beginning. The fragmentation for SSDs isn’t a concern, but an SLOG will still help drive contention.

  • If you used EXT4 or XFS, you may look into setting up an external journal. In a similar manner, ZFS has the previously mentioned SLOG, as well as a “special” vdev for metadata. These options can reduce drive contention when you’re really punishing the drives. As always, look into the caveats if these sorts of special setups.

  • What does your failure resilience look like? If some data became silently corrupt or a drive dies, how badly is that going to ruin your day and your data? This happens for many reasons, but the ones I’ve personally see are a bad cable, bad connection, or a bad/overheating controller. You need ZFS or BTRFS to handle data integrity and catch these issues, but those have significant performance impacts. ZFS has a lot of work going into various performance aspects right now that will seriously improve things, but ultimately that is a ways off. Backups can take care of the second issue, but remember, if it isn’t automated, incremental (meaning you can revert to an earlier point in time), or tested to be working, then is isn’t a backup.

  • Another thing to consider is power loss protection. I’ve recently been using an old 960 Evo NVMe in a USB enclosure, and it suddenly stopped being visible. It turns out that SSDs really hate suddenly losing power and have numerous ways to become fucked up. The enclosure functioned perfectly with another drive, and when I direct connected the messed up drive into a motherboard, it was visible but clearly doing weird things. Fortunately I was able to bring it fully back by power cycling it. So you should definitely either have a decent UPS (sine wave) set up to tell your NAS to shut down on low power or NVMe with integrated capacitors that let it shut down cleanly. Optane which writes directly to it’s special memory also goes a long way, and enterprise ones still have capacitors. Note that there are some “regular” SSD drives that mention some kind of power loss protection (with weasel terms like “at rest”), but if it doesn’t have capacitors visible then that’s complete bullshit.

  • I’d personally avoid “onboard raid”. Too many issues with weird performance, no drivers for linux, and inconsistent/absent TRIM support, among other things.

Most of what I outlined mostly helps HDDs not be awful, but you did say as fast as possible…

For capacity to $$$ I would personally look for enterprise drives on Ebay. Do make sure to double check if the drive is known for problems. Do be aware that “support”, like warranty and updated firmware tend to be hard to come by with enterprise models, especially samsung.

TL:DR
Your real bottlenecks here are the network you’re working over (and how it’s tuned), and the need for dealing with SYNC writes safely and quickly.

If you use ZFS and merely dislike money instead of hating it, get a pair of Optane P4801X M.2 drives (note they are 110mm long, not the typical consumer 80mm!) from ebay.

2 Likes

Great write-up.

You can ignore NFS and pretty much approach this as 5x1Gbit drives.

Since you are working with code, latency and seek are the priorities (So any SSD will work fine) If you are going budget, make sure you have plenty of airflow or cooling as cheap NVMe can easily overheat and slow down or straight up stop responding over PCI Express.

I would consider having RAM = 2 x AVG(ProjectSizes)
CPU should have 2GHz core per each SATA or 2x2GHz core per each x4 NVMe.
As for technology - iSCSI might be faster, I would try both as capacity is unlikely going to be your problem.

As for the drives: All SSDs suck with prediction on 4K so you need to look at benchmark for non-cached random 4kB reads. SATA or NVMe does not really matter.

Skylake and Kaby have reached the repas world, so they would be the cheap options.

The main confusing part is that this setup is design to easy network traffic for many users in one LAN. For just one user, you might be better off getting better CPU for your computer.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.