Does anybody here have experience setting up NFSoRDMA in Ubuntu 18.04 LTS with the "inbox" driver?

Well, I actually have the ethernet only versions of these cards. Mellanox MCX314A-BCCT. I think somehow these cards can be cross-flashed, but I am too chicken to try it. Also, I am not sure about what benefits I could expect.

The switch can act as IB only, Ethernet only, or VPI, where IB/EN can be specified on the port level.

Yep, that’s what actually got me to get all of this gear. I started with buying just two Aquantia 10Gb NICs and found these to be working great. But 10G switches were/are relatively expensive ($100+/port).
When I found out that for about the same amount of $$$ I can get “5x” the performance I did not hesitate :slight_smile:

Right now I sort of finished the “let’s figure out the underlying technology” phase and am getting ready to dive into some HPC-type applications I have in mind.

Gotcha.

That must be nice.

They didn’t offer that for the 100 Gbps layer. Grrr…

Same. I got into 100 Gbps IB because of a Linus Tech Tips video which was talking about how you can buy those NICs now off of eBay for 1/2 of the retail cost.

And whilst $350 might still be a lot of money to some, but in terms of $/Gbps(/port), it’s actually VERY cost effective, even if you might not be able to make full use of it (yet).

If you deployed 10 GbE and then found out that you were managing to saturate that bandwidth, then you’d have to pay for it all over again to move to something even faster.

But by deploying 100 Gbps (in my case) right off the bat, I can probably run it until newer generations of PCIe is no longer backwards compatible with PCIe 3.0 x16.

For storage services, I DOUBT that I will ever max out the bandwidth capacity.

But for HPC applications, I’ve gotten pretty darn close to it, which was why I deployed it in the first place.

(sidebar: it’s fun to scare IT directors and the likes when you tell them that the fastest networking layer that you have deployed in your home/in your basement is 100 Gbps IB. They have to scrape their jaws off the floor most of the time. That will never cease to amuse me.)

So…depending on which specific application you might be trying to dive into, I have found that, the HPC applications is actually easier (or some can be easier) to get RDMA working than for storage services.

Whilst I can’t disclose which applications I am using (as that’s under NDA), I can say that at least for one of the CFD applications, when I use it to test the network bandwidth (at least on my 100 Gbps IB), it can routely get about 80 Gbps out of it (10 GB/s).

So that’s pretty darn good.

On the FEA side of things, the measurement has not been as simple nor as straightforward to benchmark.

(The chiplet nature of the AMD Ryzen 9 5950X adds additional layers of complexity to the computer science theory in terms of what should I be able to achieve, both in terms of pure, computer science theory vs. what’s theorectically possible given the physical limitations of the hardware/architecture.)

The closest thing that I’ve been able to cook up in terms of a comparison between my old Xeon-based cluster and my new Ryzen based cluster is that the Ryzen is about 11% on average, faster in terms of time per FEA solution iteration. (My model has non linear geometry, non-linear materials, and a mix of linear and non-linear contacts, so it’s actually quite a difficult problem for FEA solvers to contend with.)

I would’ve expected a higher performance difference, but that does not appear to be the case in the actual benchmark tests that I am running with an internal case.

1 Like

I did zero stack tuning and totally max out my PCI lanes as the limit, most of the time. Maybe try Samba Direct and see what you can get with that? Try with and without RDMA to see what it does with just multiple threads.

NFS I am still getting about 3.2 gigabytes a second writing to an NVMe array from a machine with a CX-3 NIC in an x4 slot. That’s really close to as fast as the best PCIe 3 SSDs can do in an x4 slot locally on the machine.

These are sequential read/write tests done using either crystal disk mark or kdiskmark on linux. I usually use 2GB file size for the test. Benchmarks using something like dd perform significantly worse though.

On that subject, any QSFP+ cable will work (for QDR anyway), if you go for the ones explicitly marketed as IB they’ll be expensive, because…reasons. However if you can find them used for other purposes you can get them fairly cheap.

Thats interesting. we are using EDR though. When we do use QDR, those cables are plenty cheap.

Just randomly searching fs.com used to not support Infiniband but they seem to now. Haven’t tried them with IB stuff but they made us custom lengths before which was really nice.
https://www.fs.com/sg/products/154831.html

Jode,
Slightly off-topic, but if you are still looking for cheap switching on 10G, Check out Quanta LB6M. A lot of ex amazon units were dumped on the market about ten years ago so heaps around and go cheap. I got one with a scratched top case for $250. I didn’t care about the scratch as I hole sawed the top and mounted studs to run standard 120mm PWM fans then repainted after. Some people try to run low flow fans for the noise but I think that unit makes a lot of heat and needs more airflow if heavily used.

With the LB6M, the fan connections are standard stuff. Which is a lot easier than the Infiniband switches where you need to make adapter connectors or pull out the soldering gun. If you do not have a separate server room for noise control and only need 10gig it does work well. You do have to reflash it into TurboIron firmware to get RDMA and lose some features like flashing activity lights in the process. More info here: https://brokeaid.com/

1 Like

I think that Wendall, Lawrence from Lawrence Systems, LTT, and maybe even Patrick from STH have all said something to the effect of that creating an array of NVMe drives is really only useful for capacity, but not necessarily for throughput - or at least not in the same way how an array used to improve the throughput for HDDs.

The NVMe drives are usually plenty fast by themselves, so putting to NVMe drives in RAID0 for example, won’t necessarily double the bandwidth like it used to for HDDs.

For me, the storage benefit was just an added bonus. My system interconnect wasn’t originally built with that specific purpose in mind.

Having it for my HPC applications meant that I was also able to use the physical hardware for storage as well and allowed me to skip over 10 GbE entirely, wherever I am able to deploy the 100 Gbps IB.

That’s interesting, I wonder what kind of benchmarks they are looking at for that conclusion as I found the exact opposite. Using highpoint cards with NVMe Raid. I could never saturate EDR Infiniband on a single SSD. I am doing that in anything from file copies, file save operations or backup clones. Tests like crystal disk mark hold up and scale linear for me.

Consumer type SSDs don’t saturate EDR infiniband unless using contrived benchmarks. Enterprise-class SSDs get closer, especially the PCIe Gen4 kind.

It doesn’t seem to be a problem for me with them in the raid 0 configs on highpoint hardware RAID. Using full X16 PCIe. Not sure what else to tell you. to convince you but these are just normal Samsung EVO NVMe x4 units. I don’t think copying a 100GB 3d scan data file from memory or moving 1TB of random files are contrived benchmarks. Just choose to believe me or not…

1 Like

Linus (and by extension, Wendall), wasn’t running any specific type of benchmark that I am aware of.

cf. here, here, and here.

It was when they deployed the server for production use, that’s when they found this kind of an issue/error.

Do the transfer rates scale with the number of NVMe 4.0 x4 SSDs that are in your RAID0 array?

I tried googling which version of the Samsung EVO NVMe 4.0 x4 SSDs that you are using, but the only results that come up are, for example, the Samsung 980 Pro series (that’s NVMe 4.0 x4).

To that end though, assuming that you’re able to hit the peak of 5 GB/s STR write speeds all the time, that works out to be 40 Gbps.

Therefore; if there is linear scalability that adding a second Samsung 980 Pro 1 TB (I picked a capacity at random so that I can look up what the write speeds were) NVMe 4.0 x4, in theory, three of those drives in RAID0 (if you have linear scalability in the write bandwidth) would be 15 GB/s or 120 Gbps, which would exceed the 100 Gbps of 4x EDR IB.

I would be surprised if you get that kind of linear scalability in practice (or if are using “your real world data” to benchmark the RAID0 array).

I know that for my Samsung 860 EVO 1 TB SATA 6 Gbps SSDs, according to the spec sheet (Source: https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-860-evo-2-5--sata-iii-1tb-mz-76e1t0b-am/#specs), it says that it should be capable of 520 MB/s writes (4.16 Gbps).

I know that having four of those in a RAID0 (as it is in my micro HPC cluster headnode) – the best that it has been able to muster in a “real world usage” (i.e. not a benchmark like Crystal Disk Mark) is anywhere between 800 MB/s to about 1200 MB/s or so (vs. the 2080 MB/s that it should have been capable of).

(Sidebar: I did find the Samsung 860 EVO SSD, but that’s apparently a M.2 SATA SSD, not a NVMe SSD. (Source: SSD 860 EVO M.2 SATA 1TB Memory & Storage - MZ-N6E1T0BW | Samsung US))

If you’re able to achieve the scalability, that would be awesome.

I’ve yet to accomplish that. (And my four SATA 6 Gbps SSDs are attached to a LSI/Broadcom/Avago MegaRAID SAS 12 Gbps (9341-8i) SAS 12 Gbps HW RAID HBA, so the RAID HBA shouldn’t be a problem here.)

(Sidebar #2: I’ve moved away from burning through SSDs like that because whilst it is super fast, I have found that the faster the SSD is, the faster it will be for me to wear out the write endurance limit because I’d want to use it all the time, and as a result of that, I would very quickly consume the finite number of erase/program cycles of the NAND flash chips/modules on said SSDs.)

1 Like

Yes on every single benchmark we have the transfer rates scaled (basically) linear on the RAID0 with NVMe and that’s on both the 4.0 and 3.0 PCI versions of the highpoint card range. We use both since our first version of this was built before we had any PCI 4 machines. We did not use ZFS for the raid, we used hardware RAID based on highpoint cards. That is probably the first big difference between our setup and his. Locally they are perfect and it took time only to get the network side up to it with RDMA.

Based on everything I got from the links (thanks for those, that gives me a lot of context) firstly his performance wasnt non linear, it was a mess.

In the early testing of this we did have inconsistency on SMB Direct for the windows clients but it went away with RHEL serving up both SMB-Direct and NFSoRDMA.

CrystalDiskMark or KDiskMark depending Linux or Windows are giving us solid numbers as well as moving a massive ZIP file full of various software. Automated local disk clones all run at whatever the peak performance of the local disk is.

I am bummed out to hear the SATA array on LSI SAS card didn’t work out. I wanted to try to use a bunch of older SATA disks as an L2ARC cache but sounds like it was a no-go. That would be slower than we get out of the ZFS platter array now with no cache.

I can say one thing for sure, as we scaled up our ZFS platter array, it wasnt like we would get 6x the performance with 6x disks. But at some point we were on 10gigE for the storage network and then later we were on EDR Infiniband. That happened while the storage array was scaling up in the background out of sheer need for more space. Then I was pleasantly surprised it was really fast when we had a faster network. I dont know after how many SAS cards or how many drives it did what, but at some point we had platters getting way higher transfer speeds than you are getting from those SSDs in RAID 0.

It sounded like Linus had to work out some bottlenecks in the system performance as well as configuration, perhaps we just simply started with the right build spec for the job but to be fair our job is never 3 guys editing 4k++ videos at the same time. We move big files in one shot and then again later. Our use case is too small and the network too fast to wind up with multiple mega jobs happening simultaneously. Maybe our version of server or kernel already sorted those out, not too sure to be honest. They dont mention what OS exactly, what kernel, etc. We started with low latency and compiled our own kernel with some tweaks, which seemed felt right for the job of the machine.

Yep. That’s the general experience. It should explain the surprise to your posts.
Your sharing of your positive experience is very welcome.

I wonder if you could add some specifics. E.g. what are the exact Highpoint cards you’re using (they have quite a list of products).

Would you be able to share specifics (which benchmark, results)?

It’s not that I don’t believe your statement, but there are so many use cases that the type of benchmark will allow us to better understand your setup.

Thanks much!

Didn’t Linus get it sorted in the end with Wendall’s help?
Surely this is working at the enterprise level, it would be surprising if not?

Used highpoint SSD7505 and SSD7101. Have only ever used with Samsung EVO units either 970 or 980 and 1tb or larger per. I do quick web searches in Japanese and find other people testing those units and seeing the benchmarks scaling with no complex explanations of how they made it work. Stable SSD scaling was the easiest part of all this for me.

The hard part was SMB Direct and NFSoRDMA on a single linux server. Took too much to write about here but if anyone gets that far and needs help, happy to help if you drop me a line.

I wish I had more to offer but it worked on too many different OS and machines to have been a lucky config for the SSD scaling side. Just worked.

1 Like

No problem. You’re welcome.

Yeah, I can’t tell if it is the OS (I’m using CentOS, which is, of course, a RHEL derivative) or if it is the LSI/Broadcom/Avago HW RAID HBA.

Yeah, that’s always the case with the confluence of factors.

I would think that if I were to adding more hardware, then I would hope that the performance would scale accordingly.

(I’m also using XFS not ZFS.)

Yeah, I didn’t spend any time tuning the entire stack between the hardware and the software.

For me, in terms of how I am using the EDR as a storage service, I have both really tiny files as well as really large files, and therefore; that makes that trying to tune the stack virtually impossible (as it will either work a lot better for small files or large files, or they both equally are sub-optimal).

From my perspective, being that I am using SATA spinning rust, I expected that I will never be able to achieve the full EDR speeds. It is what it is.

SATA SSDs (even running on EDR/RAID0 on CentOS on a LSI HW RAID HBA, formatted as XFS, on an Intel core i7-4930K on a P9X79-E WS board) – I don’t think that it’s going to really maximise the capabilities of said EDR.

It didn’t sound like it. I’m not sure/I forget what they ended up doing with that server.

For my systems, I actually wrote the deployment procedure into a OneNote file.

So, it might not be a bad idea for you to document that, if for no other reason, so that it’s documented so that if you or someone else wants to do that (again), then you can just send that file or copy-and-paste it to a word processor and then send that out; that way, some of the research into the deployment has already been done. (Of course, that’s optional, but I’ve done that before and I’ve sent out a copy-and-pasted version of my deployment notes to other people before.)

That’s good to know, though.

I’ve historically stayed away from Highpoint HW RAID HBAs, but seeing that it worked for you, it might be worth for me to revisit that.

Thanks.

We found that we needed a certain amount of performance per core and then the number of cores was more for supporting a larger number of clients. So there can be some tradeoff. That CPU probably performs a lot less per core than what we are using. How is your CPU utilization while attempting high-speed transfer?

Highpoint had a page on benchmark and performance optimization which seems to have vanished. Hopefully, this 404 error is just temporary, and they didn’t take it down. I had bookmarked it some years ago. https://www.highpoint-tech.com/configuring-and-opitimizing

By chance, I was trying to get a highpoint 7101A NVMe PCIe 3.0 RAID controller working in a VM under linux. It doesn’t work, support also confirmed VM passthrough will not work as the bridge cannot be passed through.

However, in the process, I could pass through individual NVMe disks which meant I was not able to use hardware Highpoint RAID. I created a simple software raid 0 in disk management and benchmarked using crystal disk mark, 3 iterations on a 2gb file size. 4x Samsung 970 Plus 1tb drives. Sequential 9700MB/s read and write. I haven’t tried over the network as I tried a standard 100GB AOC cable (not specific to Infiniband) and the Mellanox 7790 Switch does not detect it. I ordered some more Mellanox brand cables and will report on the speed over the network. This is on Windows Server 2016 to share using SMB Direct to windows clients.

This is about 25 percent slower than direct using hardware raid though tested on two different machines. Maybe this is where people aren’t seeing good scaling?

So…that’s a little bit more difficult to precisely ascertain because it depends a LOT on what I am doing/transferring the data for.

i.e. I don’t “copy” the data over the network necessarily all that often. Usually, if I am reading and/or accessing the data over IB, it’s because I am trying to act upon it. (For example, right now, I am running integrity checks on the xz archive files, so the compute nodes in the cluster is reading the data directly off my micro HPC cluster headnode over IB.)

As such, NFS itself, I think, by default spawns something like anywhere between 8-20 nfsd processes, and each of those process will report 100% cpu utilisation to top, and yet, the load average would only be around 0.85, so trying to measure that – I don’t really have particularly great/accurate ways to measure CPU utilisation for my deployment.

I haven’t tested the headnode CPU utilisation when an application is using said NFSoRDMA though. The headnode hasn’t noticeably slowed down during transfers such that it would’ve caused me to even look at it in the first place.

(CPU utilisation testing when I was setting up my NFSoRDMA wasn’t a part of the testing suite because it isn’t like I am in a multi-user environment, where I am hammering the cluster headnode with multiple I/O requests from multiple users. It’s just me, so unless it slows down significantly, I wouldn’t even think nor bother to check what the CPU utilisation is.)

The Highpoint link, sadly, is also a 404 for me.

The lack of scaling that I am seeing has moreso to do with the more spindles I am adding to the RAID array, I am not getting a linear increase in read/write performance, regardless of whether that’s sequential performance and/or random performance.

But as I mentioned though, I am using spinning rust, so that might be expected.

(I don’t use NVMe SSDs in my cluster headnode because yes, they are super fast, but being that I can push hundreds of TBs of data through, with said NVMe SSDs being super fast, it will also mean that I will burn through the finite write endurance limit that much faster as well. This is why I have “un”-deployed pretty much ALL SSDs (including SATA SSDs) for this reason (I will just burn through the write endurance limit in hardly any time at all), and this is why I am using more spinning rust (still) than SSDs of any interface.

My “ideal” solution would likely have been Intel DC Pmem, but that division just got sold off to SK Hynix and Intel killed it, and GlusterFS no longer supports created distributed volumes of RAM drives running over the network. (distributed gluster volume of tmpfs mounts) It USED to be able to do that, but it can’t do that anymore.

So, I don’t have really any good solutions for high speed read/writes that DOESN’T come with a finite number of write/erase/program cycles, and at a decent/reasonable cost as well. (Those solutions are WAYYYY too rich for my blood.)

(I’m running eight HGST 10 TB SAS 12 Gbps 7200 rpm HDDs because they were cheap.)

Screwed up the last post, lets try again
I tried using striped SATA SSD arrays and found the performance to be dreadful, despite trying hardware RAID, software RAID, and ZFS. However, when using ZFS in a two-disk configuration on platters with compression enabled, the performance was exceptional (700MB/sec) and it will be my go-to for local backups. I am currently achieving speeds of around 3.5GB/sec over the network to a 20-disk spinning rust array (Seagate 5400RPM). Interestingly, I found that using each disk on SATA controllers rather than SAS provided better performance. I’ll be moving the big server setup to a rackmout case soon for ease of maintenance.

Another update I tried windows software raid on Win2016 server. I still get 10GB/sec or so over the network. Notably slower but not slow.