Storage server build

i am tying to build out a high speed nvme storage server using 4* Intel Xeon 8480, 12* KCMYDRUG3T84, 32* 32gb DDR5 4800 MT ecc ,2* GIGABYTE MS73-HB1, and 2* Mellanox MCX653106A-EFAT what hba/raid card should I use and what file system

Before you embark on this journey, have you ever done high speed networking before?

You may know all of this already, and if so I apologize, just know that it isnt easy to get high performance file transfers over ethernet, even with beefy hardware, and you should know what to expect before you go spend all of this money on NVMe storage and 100Gbit networking so you don’t wind up being disappointed.

The reason being, it usually doesn’t work the way people who do it expect.

In my experience, networked storage usually just doesn’t scale well above 10Gbit, no matter how fast your NIC, or how fast the storage server is.

At least not to a single or a small number of clients.

You can saturate these beefy 100gig interfaces with large numbers of clients hitting them at the same time, but with less threaded workloads (like a home environment with one or two clients accessing the storage server at a time) you are likely going to be disappointed with the effective transfer speeds you get.

I have an Epyc 7543 in a Supermicro H12SSL-NT running an all in one (Virtualization AND storage) server.

As a NIC I have an Intel 710XL dual port 40Gbit NIC, with one of the two ports direct linked (no switch in between) to the identical NIC in my workstation.

The server has multiple storage pools, some NVMe only, some hard drive backed with NVMe caching.

I rarely see transfer speeds above 16Gbit (~2GB/s) and even that is rare. Even when I have eliminated all other botlenecks by either copying directly from and to a ramdisk on either side (or running iperf) I have never seen a low threaded transfers exceed 18-21Gbit speeds.

And even when it does hit good speeds, it only does so in bursts here or there, nothing consistent at all. Rarely do I get over 1 to 1.2 GB/s average. Most of the time it is down at 600-800MB/s average.

Oh, I can hot the full 40Gbit/s performance, by doing many simultaneous iperf transfers, but never when just doing file transfers over either NFS or SMB.

It’s like you just start hitting serious limiting returns as you go up above 10Gbit.

If you are very seasoned at this, you may know some file sharing and networking optimization tricks that will help, but I’ve tried almost everything I can find, and never been able to bend it to my will.

There are some things one can do to manually optimize things for the latency you see (which in my case is very low, as they are directly connected) and I haven’t actually gotten around to trying this yet. It may work. Not sure.

Just don’t expect to plug and play and be able to max out your NVMe drives remotely over those sweet 100Gbit NIC’s, because that is highly unlikely to work.

As for the cause? I don’t know. I can’t seem to find any obvious bottlenecks on either side of the network. The NIC’s should be able to handle it. The drives should be able to handle it. The CPU’s don’t show excessive loads, there is no way PCIe bandwidth is the limit.

I used to have the same problem just over a decade ago when I first got into 10gig networking. Everything over gigabit speeds just scaled terribly.

Based on that, best I can come up with is that the software we use in our operating systems is optimized for common bandwidth network products available to consumers, and above 10Gbit just isn’t particularly common yet, so no one has optimized the software for it.

This would explain why 10gig was so hard for me to achieve in 2014, and why it is so easy now, and why above 10gig is so hard to achieve now. I don’t know for sure though.

Anyway, just figured I’d drop this here before you start spending big bucks only to be disappointed.

1 Like

i am trying to go for 120gb/s because i am working with very large simulation data sets so i think this will be optimal solution and i already have the CPUs ram motherboards and network cars

I agree with o1.

Short answer
If you want to keep things simple and extract high performance out of NVMe devices, avoid traditional hardware RAID controllers—especially for NVMe. Instead, you typically want:

  1. An HBA or PCIe switch solution in pass-through/IT mode (no RAID features, just direct access).
  2. A software-based file system or volume manager that can handle high performance and data integrity (e.g., ZFS, mdraid + XFS, or a parallel file system like BeeGFS or Lustre if you’re in an HPC environment).

Why not hardware RAID for NVMe?

Most “RAID cards” on the market are traditionally built around SAS or SATA drives. Even if there are hardware RAID solutions claiming NVMe support, they can introduce bottlenecks or complicate your configuration. NVMe drives are designed for high IOPS/throughput and low latency, and passing them through directly to the OS (via an HBA or PCIe switch) typically yields better performance and more tuning flexibility.

In other words:

  1. Use a simple HBA or a PCIe switching card that exposes each NVMe directly.
  2. Let the OS manage software RAID or a next-gen file system like ZFS.

File system recommendations

1. ZFS

  • Pros: Strong data integrity features (checksums, copy-on-write), snapshots, flexible “RAID” (ZFS calls them vdevs—RAIDZ, mirrors, etc.).
  • Cons: Uses more RAM, best performance if it sees raw disks rather than partitions.

2. mdraid + XFS

  • Pros: Well-supported on Linux, simpler if you’re already used to classic Linux RAID. XFS is quite performant for large sequential I/O.
  • Cons: Fewer built-in integrity checks vs. ZFS (you rely on hardware ECC and RAID redundancies).

3. Parallel File Systems (BeeGFS, Lustre, etc.)

  • Pros: Great for HPC or multi-node scenarios with huge data sets. Scales linearly with more servers.
  • Cons: More complex to set up and maintain. Often overkill for smaller single-node solutions.

Performance notes

  • Saturating 100G or 120G of network bandwidth from a single client workload can be very difficult. On paper, you’ve got enough CPU and NVMe horsepower (4Ă— Xeon 8480, 12Ă— enterprise NVMe, tons of DDR5), but there are still real-world protocol and threading constraints—especially with SMB/NFS.
  • High throughput often requires parallelism—many simultaneous streams or concurrent I/O requests—to keep the network and the NVMe devices busy. One large single-thread file copy rarely hits maximum theoretical speeds.
  • For HPC workflows or large simulation data sets, you may already be set up to generate or consume data in parallel. In that case, you have a better shot at seeing closer to 100–120Gb/s.
  • Tuning MTU, RSS (Receive Side Scaling), queue depths, and kernel networking parameters can help. Also consider direct connections (no switch) or a high-end 100G+ switch to reduce latency.

Summary

Given that you already have the big Intel Xeon setup, enterprise NVMe drives, and 100G+ Mellanox NICs, your best bet is:

  1. Skip hardware RAID controllers for NVMe.
  2. Use an HBA/PCIe switch that offers direct pass-through.
  3. Pick a file system that matches your needs for performance and integrity—ZFS is a common choice for many people building high-speed servers.

This approach keeps your configuration flexible, performance high, and complexity lower than purchasing specialized (and often finicky) hardware RAID for NVMe.

1 Like

what hba/pcie cards should i use preferably more budget friendly then not

Might be worth also looking at FreeBSD as that’s what Netflix uses for pushing out data at these rates.
https://openconnect.netflix.com/en/appliances/

Well, the Kioxia CM7-R SSDs are, as you know, NVMe drives. So, you’ll either need to connect them directly to your PCIe lanes somehow, or get an HBA that will interface with them. But, then you have to consider your throughput, if you really need 100Gbps or more, you’re going to want to use as many lanes as possible. You can get something like a Broadcom 9620-16i Tri-Mode HBA, it will connect to all of your drives but it’s not going to give you anywhere close to the throughput than if you had every SSD connected directly to the CPU since it only has a 4.0 x8 connection itself.

So, after looking at what you spec’d in your OP, you may want to look at getting something LIKE three of these cards from LR-LINK…

But, you need to do your research because cards and adapters like this are a dime-a-dozen coming out of China and some of them have issues, or maybe only work at PCIe 3.0. So, reviews and forums are your friend. You absolutely cannot completely rely on what their website or store listings say.

You have to know, just as @mattlach said in his reply to you, it’s not as easy as creating a volume and a share to get that kind of throughput. And, getting that done is not the kind of thing I can just tell you in steps. It’s going to have to be setup specifically for your use-case to make it happen. You can learn most, if not all, of what you need in this forum, most likely. But, I would expect this to be a gigantic undertaking if you’re not really experienced in the storage world.

Also, it’s amazing what you can learn by just asking an LLM questions about what you’re wanting to do and giving it examples to work with. Have a full on conversation about why you need what you need and see what it tells you. It’s a really powerful tool.

what abought using a MCIO PCIe gen5 Host Adapter x16 -RETIMER-

Netflix is the very definition of “highly parallelized” though. We don’t know OP’s application, but most people asking in forums like these are trying to build a NAS for high speeds to and from small numbers of clients, which is why I cautioned that it may not work as intended.

Remote NVMe speeds to a single client on a LAN using beefy enterprise grade NIC’s should be possible in theory, but in practice it just doesn’t work.

I’m assuming you mean a PCIe x16 adapter card with MCIO outputs?

You could do it that way and just get MCIO to U.2/U.3 cables, but it’s going to be hard to get that working with PCIe 5.0.

With the number of lanes you have available and your throughput needs, it doesn’t actually make sense to use an HBA for this. You’re going to want the lanes going straight to the drives.

I would use one (or more) passive 16x to four 4x adapter boards. Which one to select depends on the drives (m.2, u.2, m.3, etc.)

Passive boards that rely on bifurcation cost the least, perform the best, and use less power, but require bifurcation to be supported by the motherboard (which is not always the case, I don’t know about yours. Most even semi recent server and workstation boards do though)

If you don’t have bifurcation support (or if you lack sufficient PCIe lanes to connect all of the drives directly) you may need to use some sort of switched solution. The PCIe Switch (also known as PLX Switch, as PLX - now owned by Broadcom like everything else - were the first to do this, at least in any scale) adds some latency and power use, and depending on how many upstream PCIe lanes you have, may restrict bandwidth.

Another option is to use so-called “tri-mode” HBA’s, like LSI’s 9400, 9500 and 9600 series HBA’s.

They are called “tri-mode” because they support SATA, SAS and NVMe.

I’d avoid the 9400 series for this purpose. They perform terribly in NVMe mode. 9500 is a little better, but IMHO, the only really acceptable series is the 9600 series.

In all honesty though, I would avoid using “tri-mode” HBA’s. They do some weird translation of the NVMe protocol, and present the drives to the OS as if they were SCSI drives. With how much software defined storage pools like direct access to drives, this makes me nervous.

In some cases - if you want to use Enterprise style NVMe compatible backplanes for easy hot swapping, you may not have a choice though.

As far as I am concerned, NVMe over “tri-mode” HBA, is more of a convenience solution than an actual effectiveness or performance solution. It depends on what you are going for.

1 Like