Hi! I just got a few servers with PM983 NVMe disks. I’m going to run ZFS on them, but I’m not sure what setting I should use for ashift.
ZFS uses ashift=9 by default as that is what the disks report, but when using smartctl -a this is what I get:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-42-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZ1LB1T9HALS-00007
Serial Number: SXXXXXXXXXXXX
Firmware Version: EDA7602Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,920,383,410,176 [1.92 TB]
Namespace 1 Utilization: 225,070,653,440 [225 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Sat Oct 17 18:17:29 2020 UTC
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000f): Security Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 86 Celsius
Critical Comp. Temp. Threshold: 87 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 27 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 779,217 [398 GB]
Data Units Written: 1,317,842 [674 GB]
Host Read Commands: 1,217,818
Host Write Commands: 15,005,794
Controller Busy Time: 13
Power Cycles: 40
Power On Hours: 584
Unsafe Shutdowns: 33
Media and Data Integrity Errors: 0
Error Information Log Entries: 5
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 27 Celsius
Temperature Sensor 2: 34 Celsius
Temperature Sensor 3: 47 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
Does the drive work with both 512B and 4096B internally, or what does this mean? What setting should I use for ashift? Most writes will likely be from MariaDB/MySQL(16KB).
Btw, of you’re doing innodb, make sure you you enable compression - it actually means more data can be kept in buffers in ram.
Yes. You could format the namespace with either size. It doesn’t really matter, because Flash is using some other, bigger size internally anyway. The translation layer is handling it.
I’ve been using the sysbench oltp_read_write benchmark in an attempt to find the difference between ashift=9 and ashift=12 on these drives, but the CPU seems to be the bottleneck. I’m only seeing a couple of per cent difference as a result of it. I’m getting ~5,5K transactions and ~110K queries/sec in that benchmark.
Any suggestions for a better database-related benchmark that doesn’t stress the CPU as much?
I also used the fileio tests included with sysbench, without seeing much difference. Doesn’t 512B vs 4096B really matter that much on SSDs, or should I expect to see a clear difference?
You should read the article I linked because it covers that.
You should read the article Dynamic linked, and maybe consider dropping the recordsize a little, and maybe enable compression (if data is read back a lot)
From a very old article
No, it doesn’t.
It doesn’t mention anything about a good way to benchmark this, nor does it have any numbers on the difference between right and wrong ashift values. It does say that a too low value will “cripple” performance. Since I’m seeing basically no difference, is ashift=9 likely the way to go?
Strange, the top of this thread says, “It’s been a while since we’ve seen Nightly — their last post was 2 years ago.” Meanwhile, you just replied…
I did read it, and it doesn’t answer any of the questions. I know what ashift does, and both the database and the dataset has the correct settings.
The question is about ashift and what the best setting for these drives are. I would not expect to see basically the same performance with ashift=9 and ashift=12. What would be the best way to find the right ashift value? Or doesn’t it(9 vs 12)really matter on these drives?
Fifth (and final) level: ashift
Ashift is the property which tells ZFS what the underlying hardware’s actual sector size is. The individual blocksize within each
record will be determined by
ashift ; unlike
recordsize , however,
ashift is set as a number of bits rather than an actual number. For example,
ashift=13 specifies 8K sectors,
ashift=12 specifies 4K sectors, and
ashift=9 specifies 512B sectors.
Ashift is per
vdev , not per pool – and it’s immutable once set, so be careful not to screw it up! In theory, ZFS will automatically set
ashift to the proper value for your hardware; in practice, storage manufacturers very, very frequently lie about the underlying hardware sector size in order to keep older operating systems from getting confused, so you should do your homework and set it manually. Remember, once you add a
vdev to your pool, you can’t get rid of it; so if you accidentally add a
vdev with improper
ashift value to your pool, you’ve permanently screwed up the entire pool!
ashift too high is, for the most part, harmless – you’ll increase the amount of slack space on your storage, but unless you have a very specialized workload this is unlikely to have any significant impact. Setting
ashift too low , on the other hand, is a horrorshow. If you end up with an
ashift=9 vdev on a device with 8K sectors (thus, properly
ashift=13 ), you’ll suffer from massive write amplification penalties as ZFS needs to write, read, rewrite again over and over on the same actual hardware sector. I have personally seen improperly set ashift cause a pool of Samsung 840 Pro SSDs perform slower than a pool of WD Black rust disks!
Even if you’ve done your homework and are absolutely certain that your disks use 512B hardware sectors, I strongly advise considering setting
ashift=12 or even
ashift=13 – because, remember, it’s immutable per vdev, and vdevs cannot be removed from pools. If you ever need to replace a 512B sector disk in a vdev with a 4K or 8K sector disk, you’ll be screwed if that vdev is
Just set it to 12 and forget about it.
If you set it to 9, proceed at your own peril.
Those drives can be formatted for 512b or 4k, (the controller can work with either size)
and that one is currently set to 512b
You can change it to 4k, but I would suggest not formatting it, for later flexibility if needed.
I would also suggest ashift=12 because even though that drive is set to 512, a later replacement might not be, or an addition to the vdev (making a mirror deeper/splitting mirror etc) and as you already proposed, it is easier for systems to use a 4k drive in 4k, than a 512 drive in 4k
I apologise for mentioning the recordsize, which is more to do with the workload, and you mentioned the database @16k size records, and were looking to optimise performance, which is why I mentioned it.
There have been reports of disappointing performance results with some the U.2 drives on certain controllers, so would be real helpful for posterity if you might share, but this is outside the scope of your request. (unless it’s an m.2 drive; they seem okay?)
“If you end up with an
ashift=9 vdev on a device with 8K sectors (thus, properly
ashift=13 ), you’ll suffer from massive write amplification penalties as ZFS needs to write, read, rewrite again over and over on the same actual hardware sector.”
I wanted to ask if this is changing at all with how SSDs buffer with DRAM cache? I don’t know much about this so this is a question and not a statement. But if the SSD sector size is for example 16K and Ashift is set to 12 for 4K, can the SSD DRAM Cache buffer hold 4 X 4K writes and then write all 16K to the NAND flash? Can buffering remove the write amplification?
That’s not a bad question, even for the drives with real/fake SLC cache drives
Be aware that at least a bit of this info is outdated. For example L2ARC is not necessarily ephemeral anymore.