Build Log - Ceph Cluster - Homelab redesign

My AIO cooler in my homeserver died two weeks ago and I decided to repurpose the existing machine from my Silverstone CS 381 to a 2U chassis (Silverstone RM23-502).

I was planning to build a Ceph cluster eventually, but this was pretty much the kickoff to buy new and fancy stuff.
For the time being, particularly because I lack hardware for the other nodes, this will serve as a single node cluster until I buy the other hardware and networking needed/desired.

Full cluster will be 3x 2U or 2x 2U +1x4U nodes (2x Ryzen, 1xEPYC) with 25GbE. Storage will be served by 2x HDDs I keep on using as well as 2-4 Enterprise SSDs with 2-8TB, both per node.

Existing server is:
Ryzen 5900x
128GB ECC
AsRock Rack D4U-2L2T board
broken AIO cooler
6x 16T Toshiba HDDs, 2x 1TB M.2 consumer drives, 2x SATA SSD

Bought an IcyDock Cage for 4x U.3 alongside 4x Micron 7400 (prices are outta control, got new 7400s for 65€/TB) and MCIO PCIe card + cables.
PSU gets replaced with Be Quiet Titanium one and got a low profile Noctua with 65mm height so it fits into 2U without need for vacuum cleaner-style 8k RPM cooling.

Here some pics from the upgrades:

Will post progress. Can’t wait to get my server back up and running. And see if those expensive MCIO cables were worth it.

6 Likes
  • Stripped the old case from any stuff, about to transplate the board into new chassis.

  • Installed PSU in the front bracket (yeah, most 2U cases only can use ATX if it is mounted in the front). Not great engineering on Silverstones side to fixate PSU and bracket, encourages child labor.

  • Installed HDDs and SSDs into their respective front and internal brackets. internal HDDs are the biggest downside of the case in terms of features. Front PSU blocks two potential 5.25" bays which could have 3xHDD hotswap bays. 2U is a story of compromises. It’s fine to me.

  • Noticed that I didn’t order any 80mm case fans, so testing and closing the lid will have to wait until I get them delivered (ordered 2x Noctua 20mm fans too to replace IcyDock stock fans)



First test run was surprisingly smooth…I expected some things to not work. But even the 4x new NVMe connected via 2x MCIO 8i cables on a x16 card were automatically detected. Temps are fine, everything is running…

Now installing Proxmox and burning in and testing fio stuff. But have to lower the fan speeds after setting everything to high just in case.

2 Likes

Hardcore SYNC 4k blocksize with Queue depth of 1 on my new Micron 7400 drives.

fio --ioengine=libaio --filename=/dev/nvme2n1 --direct=1 --sync=1 --rw=write --bs=4K --numjobs=1 --iodepth=1 --runtime=60 --time_based --name=fio

IOPS=80.2k, BW=313MiB/s (328MB/s)(18.4GiB/60001msec); 0 zone resets

sequential reads clock in at 6718MiB/s. Pretty much standard PCIe 4.0 bandwidth. Writes are 2.1GiB/s. Everything pretty much in line with official Micron specs.

They may be last-gen SSDs, but with 80k IOPS on that sync-torture test, they pack a punch. All my consumer PCIe 4.0 drives I used for testing previously made like 1.5-2.5k IOPS. This cluster will never ever bottleneck on storage hardware.

System Power draw changed very favorably. Change from Gold to Titanium PSU, 4 less HDDs but 4 more Enterprise SSDs. Went from ~90W to ~70W idle. Didn’t check full load figures yet and 3x80mm fans are still missing. I didn’t expect that much savings tbh, but I’ll take it.

1 Like

Hey Exard3k,

this is a great idea. I didn’t know server mainboards for desktop CPUs existed in this form. Am i correct in assuming that with this board you can put in that 5900X and have the system boot up without the need of dedicated graphics card? That would be “ill”, at least from a technical standpoint :smiley:

1 Like

Welcome!

Thanks. Yeah, it just needs power and a network cable and you’re ready to go. You don’t even need a USB flash drive to install the OS. Just upload the .ISO via WebUI and tell BIOS to boot from “CD-ROM”. IPMI (basically on-board remote management) is a nice thing indeed. Asrock Rack and Supermicro both have boards for Core and Ryzen. Prices have gone up since I bought mine. Not cheap these days. With iGPUs now in most CPUs, headless server without IPMI is much easier now. Computer technically just needs CPU and memory, everything else is optional.
All I need is a browser to open the WebUI of the IPMI, Proxmox + other stuff running and some means to connect via SSH, be it Putty or linux terminal. And without GPU, you suddenly find yourself in a position with 20W less idle power and 16 freed up ultra-fast PCIe lanes that are just waiting for big networking or storage alternatives.
I will eventually add a GPU to the cluster for compute purposes and specific VM needs, but the HDMI/DP ports will share the same fate as the rear IO USB → will never see anything plugged into it.

Update on some stuff I tested yesterday:

  • Power:
    I ran some full load stuff with y-cruncher (tough AVX2 code and memory-heavy), keeping all drives busy and running full bandwidth on both 2x1GbE and 1x10GbE. 145-150W at the wall. Idle bouncing between 67-72W. The CPU is running @65W TDP via ECO Mode, otherwise no BIOS or OS tweaking of power settings.
    The old Aspeed 2500 IPMI pulls 8-10W from the wall alone. That’s what I see when powered-down. There may be other MB/PSU things at work too. But majority is certainly the IPMI/BMC.

  • Drives:
    U.3 drives get HOT. I switched off the fans to check smartctl temps for a single U.3. Even idle , temps were slowly creeping to >60°C. Factory HIGH temp is 70°C. You NEED airflow for these
    things. The IcyDock Cage is built very sturdy and because all-metal, spreads heat across the entire cage and the holding bracket very very well, I love the engineering.
    I’m gonna replace the stock fans with Noctua 20mm on Monday and will try to connect those fans to the board fan headers, so I can control fan speeds via IPMI. Stock fans on LOW are “noisy” and on HIGH it’s datacenter mode.

  • CPU Cooler: The Noctua L9x65 has the lowest rating on Noctuas scale and compatibility for 5900x and 5900 stated “yeah, can’t really handle it”. But testing so far shows that the cooler can handle y-cruncher workload at non-audible RPMs with CPU @65W TDP. It’s fine and I’m really happy about it.

  • Will wait with further testing and installs until monday to install 3x80mm case fan row. The on-board x550 Intel NIC gets really hot without airflow on current open setup. Board relies on typical server air flow and uses passive heat sinks and X550 NIC and x570 chipset need it.

Nice to hear it seems to work as it should! From the video it looks like the cables are IcyDock MB302L-B? And the MCIO adapter looks like it might be this Shinreal one?

Do you have AER enabled to monitor bus errors?

1 Like

Indeed. With all the talks about signal integrity, bad cables I’ve seen in my research (here and elsewhere), I went with MCIO (designed for Gen5 NVMe in mind, not for some 12-24G SAS cables and controllers) and validated IcyDock cables. I like their approach to include fitting and validated cables, although at a hefty premium, but buying a bad cable and getting another is even more expensive.

Photo confirms it is the same MCIO card. Same Shinreal label. I see that we are men with good taste for good hardware! :slight_smile:

I bought it from a french company that has a bunch of U.2/3 stuff in stock. Couldn’t find any other fitting card in the EU. MCIO is still very new, especially in retail channels, and availability and product range isn’t at (mini)SAS/OCUlink-level yet.

No, I didn’t go that deep yet. So far I was mostly interested in getting stuff to run and whether performance numbers match the expectations, let it run hours to check temps and stability, stress test on the drives, etc.

I did notice that smartd reports some error entries in the Proxmox syslog after each boot. I didn’t go down the nvme-cli and hexadecimal code rabbit hole yet. Doesn’t seem like it is interrupting or hampering operations on first sight. And no LBA errors or sth very obvious in SMART readings.
I will dig deeper into BIOS settings and error readings in the coming days. So far I don’t see anything of critical importance, might as well be some outdated firmware on the drives or that “quirky” board of mine.

In the end, all the drives are backed by Ceph, with CoW, periodic scrubbing and the whole paranoia mode of data integrity that doesn’t trust any hardware or firmware to begin with. And Ceph elevates that to the node/host level, not to individual drives or controllers.

I made sure my stuff and operations are cared for. Doesn’t mean I ignore obvious problems in the process :wink:

1 Like

Are you planning to run workloads on the cluster nodes themselves, or purely for storage? I’m interested to see the performance over the 25G connection, please posts some tests when you get around to that!

@zibbp Welcome to L1T!

I’m going for a HCI approach with both compute (VMs, kubernetes, legacy LXC) and storage (Ceph OSDs, Monitors and MDS) running on each node. With Proxmox and SDN, I may pick up on the more network-side of HCI, or FRR and switchless cluster network…we’ll see. That’s the part I’m the least familiar with. But the entire thing is also made to do new things and learn and to be an evolving project.

I also have to account for VMs in need of passthroughs and a (probably virtualized cause hobby and $$$) replication cluster for async RBD-mirroring, so nodes will be heterogeneous in config.

Second node I will order and build in the coming weeks, while 3rd node and 25GbE networking will happen “in 2025”. Until then I use my existing 10GBase-T home network I currently have. And proper HA isn’t really happening until then, because running 2 nodes on a single machine isn’t HA :wink:

So don’t expect network and Ceph latency tuning from this thread in the near term. Although my usual workload is IOPS and latency sensitive and I will certainly get that 4k BS IOD=1 write latency as low as possible.

edit: Shooting for min_size=2 with two-way replication on the NVMe and probably 2+1 or 4+2 EC on the HDDs.

1 Like

Some updates on my progress:

  • Fans: Have arrived. Noctua 40mm are solid. 30% RPM (on board headers, can control via IPMI), NVMe running at 35-40°C under intense load. can’t hear much from them, HDDs next to them are way louder. I highly recommend the Noctua-swap for IcyDock products. 40mm Noctua aren’t cheap, but you get what you payed for.
    80mm cheapskate Be Quiet! fans…do their job. Nothing to write home about.

Only problem I’m facing is there isn’t sufficient airflow to reach the on-board Intel 10Gbit NIC and it overheats. No fan headers left and the NIC heatsink is sandwitched between hot air exhaust from CPU and the IO shield (basically behind the CPU if you look from front to back)…100% (2300) RPM on the 80mm won’t solve that and I doubt 3k RPM fans would be an improvement. Bad board layout. Had to disable the NIC in BIOS for the time being :frowning:
40mm fan on top of heatsink could solve that, but no fan header. self-made shroud to direct airflow around the CPU to the NIC heatsink could do it…or ripping out the IO shield and hope exhaust redirects more towards that heatsink.
5900x CPU runs at 70-80°C now. Not great, but stable even under sustained all-core Ceph+y-cruncher load. Occasional spikes into 82-83°C on 1-2 core full boost…“all within spec” as AMD likes to say.

  • Proxmox and Ceph, drives and Power:
    Got software and VMs up and running, also made a virtual Rocky Linux test cluster with one NVMe as passthrough per VM. Substantial improvement in performance. I’ve seen as much as 6GB/s sequential reads and 40k write IOPS with all defaults and no tweaks on anything.
    Did some power tweaking and I’m now at 57W-115W depending on load.
    On full sequential write workload, Ceph uses 2-4 full threads per NVMe (load average was up to 4.6 with 8 assigned cores). Insanity :smiley:
    I have block storage (RBD) up and running on my workstation, but that 1Gbit network is a huge bottleneck atm. But doesn’t feel less snappy than my SMB share from TrueNAS (using Dolphin) Time to get 25GbE earlier than intended.

All just some VMs atm, so no real latency between nodes. But also less CPU horsepower and drives. Ceph scales with the amount of nodes and OSDs (drives) after all. Will be interesting to see how numbers change with more bare metal but also more network latency. But I think I can break the 10GB/s barrier (at the client) easily with 3 nodes, IOPS will be a different beast.

The real difference is the PLP the U.3 have. Allows for fast sync (Ceph syncs everything)/always). I made a test cluster before with passthrough of consumer NVMe drives…I got 100-300 IOPS. Now I get 40k. At least that’s more than ZFS, because the drop of bitterness with Ceph is…you can’t beat ZFS on single user Random IO performance with a cluster, no matter what.

  • Future plans: Haven’t decided on board and CPU yet. But will be a 2U pure I/O-machine (4xU3+2xHDD), EPYC4005 series just got released…found a nice cheap AsRock Rack board (without overheating on-board NICs :wink: ). Will decide within 2 weeks on what exactly is the best for the stuff I want.
    And 3rd Node will be 4U cause I need some space and slots for AI, GPU, more HDD bays and stuff…probably low-spec EPYC Siena. I want smart home, inferencing, tinkering stuff with AI at some point, and maybe some Gaming VM, but not now. So this stuff is optional atm. Priority is getting stuff working for everyday use.

edit: Oh and @homeserver78 I enabled AER in BIOS and Proxmox throws an internal error on NVMe device passthroughs along with kernel messages in syslog. VMs won’t start at all. Disabling AER will solve this. It obviously won’t solve the errors. Not sure what to do. Runs smooth with AER disabled. Ignorance is strength?

1 Like

I guess you can either troubleshoot it or ignore it? :slight_smile: Sorry, I don’t know what else to say! lol

I actually bought the wrong fans! I ordered A4x20 FLX instead of the PWM variant. So I reconnected the 40mm fans to the IcyDock Backplane with the low-noise adapter. If you can’t control the fan speed, it really doesn’t matter where you plug them in :slight_smile:
Also allowed me to use a spare 40mm fan on the (now) free header, to cool the on-board NIC. I’m back to 10Gbit networking again.

I set up VMs and containers for everyday use and my desktop and other stuff now has access to CephFS and block devices.

  • Benchmarks:
    Did some dolphin and fio stuff with 10G networking, measured at my desktop mounting a ext4 formatted RBD (ceph block device, like iSCSI LUN):
    Sequential writes: solid 1.1GiB/s
    Sequential reads: 850MiB/s (not sure why yet), folders with 50k files drop down to ~350MiB/s
    random 4k reads QD=1 (fio) = 1.8k IOPS
    same with 4k random writes QD=1 (fio) = 800 IOPS

Doesn’t look too shabby by Ceph/Cluster standards. Using all defaults, so there still are many best-practice parameters not touched yet.

Oh and be careful with intense random writes, Ceph will totally take all your CPU cycles. fio flags were16 threads and QD=32, results in this kind of VM utilization and 6k IOPS (12 cores, 2x NVMe to feed):
ceph12core

I narrowed down my shopping list now.

  • Build and Parts for 2nd Node:
    • CPU: Ryzen 7900 (king of threads/€? and 65W TDP requirement)
    • Board: AsRock Rack EPYC4000D4U (Made a thread about it,looks like it’s the best that fits)
    • 2x Micron 32GB ECC UDIMM
    • Noctua LH-L9x65 CPU cooler
    • Silverstone RM203-502
    • Be Quiet Dark Power 13 750W Titanium PSU
    • IcyDock 4xU.3 bay + 2x MCIO8i → OCulink cables
    • ConnectX-4 25Gbit NIC

Along with 2xHDD I have (really have to fully migrate my ZFS pool before doing that, too lazy) and I probably order two more U.3 and the fans needed.

Feel free to comment on parts and build, always good to hear from other angles/perspectives. Intention was high-performance NVMe Ceph + headroom for virtualization via Proxmox while keeping costs and power bill low for this kind of platform.

Still not full HA and I will still have to run a Rocky VM for Ceph.

Transitioning to SFP and 25Gbit is another matter I have to face (and pay for) at some point. DACs and connecting the nodes is the least problem there. (Mikrotik) Switch and 2x >3m optical cables might be required.

I ignore it for the time being. Particularly because the drives won’t see passthrough in the long term and remain bare metal. Scrubs so far were all fine…so there is no rampant data corruption by jank cabling going on. But I keep monitoring this.

1 Like