Need advice for NVME storage configuration for Proxmox

rodionmdk · January 12, 2021, 1:13pm

Hi, I’m upgrading our server infrastructure in our company. We are a small company that partially uses Windows RDP service for employees that work from home and have some of our products that run in k8s. Databases ran up on separate VMs with replication on database servers level. Right now, we have four small servers in a standalone configuration. We have two servers with Proxmox, one with VMware ESXi and one with Windows server.

I’m planning to upgrade our server infrastructure to the Proxmox cluster of three servers. I will use one of our old servers that are most new and powerful, Dell R720XD. The Role of this server is to be just the third server in the cluster.

Each server will have:

Two SATA SSD in ZFS Raid1. For OS.
6TB*6 HDD in ZFS RaidZ2. For slow data.
1TB*2 NVME SSD (Seagate FireCuda 510), LVM Raid0 for Ceph, and L2ARC for ZFS(maybe).
Separate network for cluster communication via SFP+ (Daugther network card Intel X520 + Mikrotik CRS309-1G-8S+IN)

I’m aware that Seagate FireCuda 510 is not for enterprise or server load. My current statistics:

100TBW for one SSD drive in 6~7 years.
70TBW for another SSD drive in 3 years.
We consume about 100GB of RAM across all servers.

I don’t want to use ZFS on NVME drives because they are not for enterprise load and could wear too quickly. My goal is to run a new cluster system for five years. That is why want to try to use Ceph on NVME. I will use this mount point to store data like databases or critical OS drives if HDD storage is too slow.

I’m aware that Ceph on top of ZFS is not the best solution, if I wrong, please correct me.

I have about three months to set up this cluster and run some tests before switching load to this cluster.

I will use a converter from PCI-E to M.2 and add data protection on power loss. We have APC UPS for 3KVA. We plan to use it not more than 25-33%. I already add a script that checks battery status, if it too low it sends a command to servers to shut down properly.

Configuration in short :

Two main servers Dell R730XD (E5-2680V42, 256GB, 2256GB SATA SSD, 6*6TB HDD, Network card Intel x520+i350, Dual 1100W PSU)

Third servers Dell R720XD (E5-2680v22, 512GB, 2256GB SATA SSD, 6*6TB HDD, Network card Intel x520+i350, Dual 1100W PSU)

My questions are:

How bad to mix ZFS and Ceph in one cluster?

Could you share practices or a different solution for my workaround?

Dutch_Master · January 12, 2021, 3:44pm

Never use RAID0, it’s a sure way to loose data at some point. Either put them in RAID1 or keep their cache functions separate on dedicated drives. LVM on cache is probably a waste of time, a regular journaling FS suffices (like ext4, JFS, XFS and BTRFS).

rodionmdk · January 13, 2021, 2:51am

That is why I want to use Ceph on top of NVME drives so I could use Raid0 without any consequences. ZFS L2ARC is optional, for now, I’m not planning to do this. We have used not so much IO and do not expect huge growth in the future.
I will not use the LVM cache function.

Dutch_Master · January 13, 2021, 5:53am

I understand where you’re coming from, but it’s another layer to troubleshoot when things go belly up. When, not if! But the call is yours, of course!

wertigon · January 13, 2021, 7:48am

I can’t see any advantage of NVMe in RAID0. Only reason you would do RAID0 is because it’s faster, but even NVMe Gen 3.0 is fast enough to saturate a 10 Gb Ethernet interface.

Do JBODs instead, so much more reliable. Or upgrade to 2 TB capacity, single drive. Both are better solutions in this case.

rodionmdk · January 13, 2021, 8:13am

I see thank you for your advice. I’m appreciate.

rodionmdk · January 13, 2021, 8:18am

I choose 1TB Seagate Drive because it’s rated with higher random read IO than 2TB drives (link) I don’t know why. So your suggestion is to use just JBODs instead RAID0? Could I use policy one CephOSD per NVME drive instead, or this is a bad practice?

system · October 14, 2021, 2:18am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.