ZFS and ashift with samsung NVMe SSD

Hi! I just got a few servers with PM983 NVMe disks. I’m going to run ZFS on them, but I’m not sure what setting I should use for ashift.

ZFS uses ashift=9 by default as that is what the disks report, but when using smartctl -a this is what I get:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-42-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZ1LB1T9HALS-00007
Serial Number:                      SXXXXXXXXXXXX
Firmware Version:                   EDA7602Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,920,383,410,176 [1.92 TB]
Namespace 1 Utilization:            225,070,653,440 [225 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Sat Oct 17 18:17:29 2020 UTC
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000f):   Security Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     86 Celsius
Critical Comp. Temp. Threshold:     87 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    779,217 [398 GB]
Data Units Written:                 1,317,842 [674 GB]
Host Read Commands:                 1,217,818
Host Write Commands:                15,005,794
Controller Busy Time:               13
Power Cycles:                       40
Power On Hours:                     584
Unsafe Shutdowns:                   33
Media and Data Integrity Errors:    0
Error Information Log Entries:      5
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               27 Celsius
Temperature Sensor 2:               34 Celsius
Temperature Sensor 3:               47 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Does the drive work with both 512B and 4096B internally, or what does this mean? What setting should I use for ashift? Most writes will likely be from MariaDB/MySQL(16KB).

1 Like

You should likely set ashift to 12.

https://jrs-s.net/2018/08/17/zfs-tuning-cheat-sheet/

2 Likes

Btw, of you’re doing innodb, make sure you you enable compression - it actually means more data can be kept in buffers in ram.

Yes. You could format the namespace with either size. It doesn’t really matter, because Flash is using some other, bigger size internally anyway. The translation layer is handling it.

2 Likes

I’ve been using the sysbench oltp_read_write benchmark in an attempt to find the difference between ashift=9 and ashift=12 on these drives, but the CPU seems to be the bottleneck. I’m only seeing a couple of per cent difference as a result of it. I’m getting ~5,5K transactions and ~110K queries/sec in that benchmark.

Any suggestions for a better database-related benchmark that doesn’t stress the CPU as much?

I also used the fileio tests included with sysbench, without seeing much difference. Doesn’t 512B vs 4096B really matter that much on SSDs, or should I expect to see a clear difference?

You should read the article I linked because it covers that.

1 Like

You should read the article Dynamic linked, and maybe consider dropping the recordsize a little, and maybe enable compression (if data is read back a lot)

old link

From a very old article
https://www.usenix.org/system/files/login/articles/login_winter16_09_jude.pdf

Edit: the last section mentions about write amplification harming performance an drive life

No, it doesn’t.

It doesn’t mention anything about a good way to benchmark this, nor does it have any numbers on the difference between right and wrong ashift values. It does say that a too low value will “cripple” performance. Since I’m seeing basically no difference, is ashift=9 likely the way to go?

Strange, the top of this thread says, “It’s been a while since we’ve seen Nightly — their last post was 2 years ago.” Meanwhile, you just replied…

I did read it, and it doesn’t answer any of the questions. I know what ashift does, and both the database and the dataset has the correct settings.

The question is about ashift and what the best setting for these drives are. I would not expect to see basically the same performance with ashift=9 and ashift=12. What would be the best way to find the right ashift value? Or doesn’t it(9 vs 12)really matter on these drives?

1 Like

Last entry

page 1

https://jrs-s.net/category/open-source/zfs/

Fifth (and final) level: ashift

Ashift is the property which tells ZFS what the underlying hardware’s actual sector size is. The individual blocksize within each record will be determined by ashift ; unlike recordsize , however, ashift is set as a number of bits rather than an actual number. For example, ashift=13 specifies 8K sectors, ashift=12 specifies 4K sectors, and ashift=9 specifies 512B sectors.

Ashift is per vdev , not per pool – and it’s immutable once set, so be careful not to screw it up! In theory, ZFS will automatically set ashift to the proper value for your hardware; in practice, storage manufacturers very, very frequently lie about the underlying hardware sector size in order to keep older operating systems from getting confused, so you should do your homework and set it manually. Remember, once you add a vdev to your pool, you can’t get rid of it; so if you accidentally add a vdev with improper ashift value to your pool, you’ve permanently screwed up the entire pool!

Setting ashift too high is, for the most part, harmless – you’ll increase the amount of slack space on your storage, but unless you have a very specialized workload this is unlikely to have any significant impact. Setting ashift too low , on the other hand, is a horrorshow. If you end up with an ashift=9 vdev on a device with 8K sectors (thus, properly ashift=13 ), you’ll suffer from massive write amplification penalties as ZFS needs to write, read, rewrite again over and over on the same actual hardware sector. I have personally seen improperly set ashift cause a pool of Samsung 840 Pro SSDs perform slower than a pool of WD Black rust disks!

Even if you’ve done your homework and are absolutely certain that your disks use 512B hardware sectors, I strongly advise considering setting ashift=12 or even ashift=13 – because, remember, it’s immutable per vdev, and vdevs cannot be removed from pools. If you ever need to replace a 512B sector disk in a vdev with a 4K or 8K sector disk, you’ll be screwed if that vdev is ashift=9 .

Just set it to 12 and forget about it.

If you set it to 9, proceed at your own peril.

1 Like

Thanks.

Those drives can be formatted for 512b or 4k, (the controller can work with either size)

and that one is currently set to 512b

You can change it to 4k, but I would suggest not formatting it, for later flexibility if needed.

I would also suggest ashift=12 because even though that drive is set to 512, a later replacement might not be, or an addition to the vdev (making a mirror deeper/splitting mirror etc) and as you already proposed, it is easier for systems to use a 4k drive in 4k, than a 512 drive in 4k

I apologise for mentioning the recordsize, which is more to do with the workload, and you mentioned the database @16k size records, and were looking to optimise performance, which is why I mentioned it.

There have been reports of disappointing performance results with some the U.2 drives on certain controllers, so would be real helpful for posterity if you might share, but this is outside the scope of your request. (unless it’s an m.2 drive; they seem okay?)

1 Like

“If you end up with an ashift=9 vdev on a device with 8K sectors (thus, properly ashift=13 ), you’ll suffer from massive write amplification penalties as ZFS needs to write, read, rewrite again over and over on the same actual hardware sector.”

I wanted to ask if this is changing at all with how SSDs buffer with DRAM cache? I don’t know much about this so this is a question and not a statement. But if the SSD sector size is for example 16K and Ashift is set to 12 for 4K, can the SSD DRAM Cache buffer hold 4 X 4K writes and then write all 16K to the NAND flash? Can buffering remove the write amplification?

That’s not a bad question, even for the drives with real/fake SLC cache drives

Be aware that at least a bit of this info is outdated. For example L2ARC is not necessarily ephemeral anymore.