I was tasked with setting up a seerver 24 NVMe 1.92TB drives from Samsung. As I am fairly new into NVMe servers and their RAID configurations ( till now I run all servers on VMware with hardware RAID cards), I decided to kindly ask you for help with getting the most out of those drives.
Most of the files will be small (like websites), so sequential performance is not the top priority.
Server will run proxmox and as we’re still waiting for GRAID pricing. It was decided that those disks will be run in ZFS RAID10 in proxmox.
The system itself is installed on RAID1 of Samsung SSD 980 PRO 250GB.
Here is the NVME list of those drives (I got confused why they are randomly formatted to different sizes, should I do something with it and why?):
So can you please help me with prepairing those disks for the best perfomance and create the ZFS pool correctly? I learned about ashift as its the only option I see in proxmox to be changed when creating the pool. Or should I even use ZFS for this?
If there is any other information, I should provide, please let me know.
I’m not sure about the proxmox angle, but ashift sets the sector size - 12 is 4K (what you want). I think the default is 9 (512K) so “format” could be that.
The structure of the pools depends on what you mean by performance - and don’t forget you may only need a certain amount before being limited by overall NVME bandwidth, network bandwidth etc.
In order to know whether ZFS can fulfill your needs, you need to know what kind performance your workload actually requires, and balance that with things like your acceptable level of write-amplification as well as features (snapshots, data safety with checksums, etc).
ZFS continues to have issues with getting appropriate performance from NVMe drives because of it’s historical hard coded assumptions from a time when flash drives were just experimental tiny things costing tens of thousands of dollars. This is actively, but very slowly, being worked on. If the performance as is, is acceptable, then great! But with the expense of a flash array, there’s probably a minimum level of performance involved. You’ll have to dig through GitHub comments, Reddit posts, and ycombinator for some arcane things to try with nvme specific tuning, if out of the box performance isn’t quite enough.
Which leads us to the next problem, have you figured out how to replicate your workload with benchmarks so that you can tell if you are hurting or helping things? There are a large number of gotchas that can give you false results. Be warned that “popular” articles on this subject can also be flawed, or completely inapplicable to your own situation. Also note, benchmarks on new and fresh arrays do not tell you how it will perform when very mature and mostly full.
The people that can really tell you what and how to do things as best as possible, tend to hold their cards to their chest because that’s how they make money. So much of what you see suggested on the internet tends to be cargo-cultish by hobbyists who use ZFS for their home NAS with HDDs, which is exactly what I am.
Most of the files will be small (like websites), so sequential performance is not the top priority.
Server will run proxmox and as we’re still waiting for GRAID pricing. It was decided that those disks will be run in ZFS RAID10 in proxmox.
If you mean you’ll have multiple mirror vdevs, this is probably a good choice for small files and random IO. When files get down in the 8-16KiB range, RAIDZ loses its space saving advantages as the data is literally just mirrored anyways.
You’ll want to manually set ashift to equal 12 or 13 most likely. There’s no way to tell which is preferable without benchmarking and looking at write amplification stats. 12 should be fine though as a “just pick one”, as NVMe drives tend to be designed for 4K sectors. I suppose that’s another thing, make sure your drives themselves are set to use 4K and not 512B sectors on their (likely single) namespace.
You may use over-provisioning on those ssd, so that you don’t need to trim them later on. (ZFS autotrim is just not mature yet)
The practice is only use 90% of all available space. You can run “blkdiscard” on the full drive and then create a partition about 90% space. Then, create ZFS on top of those partitions.
Out of curiosity, do you have access to firmware updates for the PM1733 SSDs? If I remember correctly there have been some fixes for systems where SR-IOV is enabled, but given it’s Samsung and the PM1733 were their first PCIe Gen4 SSDs I would want to run the latest firmware versions from the get-go.
Hello and thank you all for your answer. I forgot to enable notifications and then forget to reply, I am sorry. I will do my best to reply to everybody.
So first of all, the firmware is an issue. My collegue tried to “get the hardware” before my arrival to the company, so even though we have those PM1733 SSDs, some of them I labeled as NETAPX, so made by samsung, but for a company. We did get the latest firmware on those PM1733, but even though our barebone supplier tried, we were unable to get any FW or info about those NETAPX ones, as they were part of a private deal.
I planned to run in in a single RAID10 ZFS pool made by proxmox, as I need to get the majority of storage to the webserver. Is there any great reason to split that to multiple pools and assign sections to clients? (create VMs on different pools?).
To be honest, I do not have a plan how to benchmark specific options. My idea when writing this post was to get some best practices and insights, as I can only agree all the articles are just saying random and often completely not agreeing things.
I can also confirm I set all those drives to a 4K sectors.
And last, I am sorry if I look absolutely stupid and I am not answering your questions, I am just really new into ZFS and NVMe and trying hard to get it setup right. Now, when I find out there was an issue with the barebone and it was sent back for repair, I at lease have more time to learn and make it work.
Unfortunately I don’t have any access to PM1733 firmware updates but am also very interested in getting some since the PM1733 7.68 TB models I’ve been using (directly from Samsung, not third-party branded in any way) still have their initial manufacturing firmware from 2020
hey I posted the updates for the pm1733 in another forum. forums dot servethehome dot com slash index.php?threads/firmware-package-for-samsung-sm883-mz7kh3t8hals.37154/page-3#post-373568
Do you know which firmware is the latest one for the PM1733 7.68 TB models (purchased 2020) with the model name MZWLJ7T6HALA-00007? Is it “General_PM1733_EVT0_EPK9CB5Q.bin” (am I reading these Excel files correctly?)?
Is this “Samsung Magician Software for Enterprise SSD” the proper tool for this job?
Make sure you do a low-level format before you create your pool. Most NVMEs will pretend to have 512 byte sectors, so as to be compatible with MS-DOS, but you get better performance if you tell them to report their actual block size (probably 4k) instead.
Just dropping in to provide an update on our project and to seek further advice. Apologies for the delay in responding, I’ve been busy with finals at school, and we also had to send the server back for a warranty claim due to an issue with one of the NVMe bays.
Despite these setbacks, I’ve been actively exploring different configurations and learning as much as I can. I’ve formatted the drives to 4K blocks and used ashift=12 as suggested in previous discussions.
I’ve also been running some performance tests. Using the command dd if=/dev/zero of=/nvme/test1.img bs=5G count=1 oflag=dsync, I was able to achieve around 1.7GB/s in Proxmox SSH directly, but the performance dropped to around 833MB/s when running the same test in a Linux VM.
For context, our current setup includes a R272-Z34 server, equipped with an AMD EPYC 7H12 processor, 512GB RAM, and 24 SAMSUNG MZWLJ1T9HBJR-00007 P2 drives. Despite this, I’m uncertain about whether the 1.7GB/s speed is even up to par. To be honest, I’m not entirely sure what kind of performance I should be expecting from our setup, so if anyone could provide some insight into this, I’d greatly appreciate it.
Sadly, the GRAID we’ve been waiting for is facing further delays, so we’re now more seriously considering software RAID options. I’d be grateful for any guidance or tips on optimizing a software RAID setup for our NVMe drives within Proxmox and improving the performance discrepancy between Proxmox SSH and the VM.
Thank you for your patience and all the help provided so far. Looking forward to your insights.