Devops Workstation: Fixing NVMe Trim on Linux

wendell · October 4, 2019, 7:19pm

NVMe AMD-Vi IO_PAGE_FAULT

Are you getting NVMe AMD-Vi IO_PAGE_FAULT when you enable IOMMU or with your Phison-based SSD? It’s something to do with IOMMU and the Linux kernel, but there is a work-around.

Checking the error

As you probably already know, a NAND flash drive must be told by the operating system which flash cells can be prepared for fresh data to be written to them. While the read and write cycles on NAND flash can be quite fast, the ERASE cyclle, etc. e is glacial in comparison. When deleted information is really removed from NAND flash devices, such as Phison-based NVMe (Aorus Extreme PCIe 4.0 1/2tb SSD, Corsair MP600, Sabrent Rocket 4.0, etc), the OS must issue a TRIM or DISCARD command to the device with specific block(s) (or a range of blocks) to be marked for erasure when the drive is idle.

Failure of the device to trim means that when the OS goes to rewrite a block that is already written the firmware must pause the write, issue an erase, then actually write to the target block. Which will take forever in comparison to just writing the drive.

issue a command such as

fstrim -v / or fstrim /home

if your root or home partition is on NVMe storage. Check kernel messages with

dmesg

and see if you see errors. If not, you’re good and your IOMMU setup is fine.

If you see errors, there are some steps you can take.

IOMMU=PT, the low hanging fruit

I would recommend setting iommu=pt avic=1 on your kernel boot line for Ryzen and Threadripper-based systems.

For this video the following systems were tested with latest Agesa (ABBA in the case of Ryzen 3000 series desktop CPUs):

Epyc 7402P
Epyc 7742
Threadripper 2950X
Threadripper 2990WX
Threadripper 1950X
Ryzen 3900X
Ryzen 3600X
Ryzen 3800X

** You should also change IOMMU from AUTO in bios to Enable **

As stated in the video Auto is often subtly different from Enable, due to windowsland requirements. Be sure IOMMU is set to Enable and not auto, even if it appears to be “on” when set to Auto.

If Iommu=pt only partially fixes the issue, or decreases the frequency of the issue, there may be another factor in play. Some devices will only TRIM/DISCARD on an even number of flash cells at once. If the OS issues the command because the underlying file system and the nand block device have a slight misalignment of cells-to-filesystem blocks, then the drive may reject the operation.

This could actually be a problem in the kernel, and the -v option to fstrim will tell you how many blocks were actually trimmed, which may be useful for further troubleshooting.

diff -Nau1r a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
--- a/drivers/nvme/host/core.c 2019-09-14 11:27:34.986373747 +0200
+++ b/drivers/nvme/host/core.c 2019-09-13 16:14:16.937812531 +0200
@@ -564,3 +564,3 @@
 
-       range = kmalloc_array(segments, sizeof(*range),
+       range = kmalloc_array(256, sizeof(*range),
                                GFP_ATOMIC | __GFP_NOWARN);

This patch to the kernel might solve the alignment issue.

A more complete patch is coming, and will be upstreamed.

Relaxed Ordering

Relaxed command ordering may also be at fault. Do an

lspci -vvv and look for RlxdOrd+ or RlxdOrd- in the output. If you see RlxdOrd+ use setpci to disable it.

TODO: Actual command goes here.

nx2l · October 4, 2019, 7:34pm

So if I am using an Optane SSD… trim would work… or does it use a Phison controller too?

wendell · October 4, 2019, 7:48pm

optane is micron/intel and shoullddddd be all good but if you fstrim and see errors then… try the steps? actually post if you see errors because… that’d be interesting.

Aenra · October 4, 2019, 8:24pm

Thank you for this Wendell…

Still new to all this, often enough (ahem, make that almost always), i read “instructions” that i then also have to ‘decode’ internally back to English. But i’ll get there.

Given your avatar, may i humbly suggest that future contwibushions be made in Deutch, ja?
Zis vay at least ve’ll know vat it’s for.

Nikolay_Mihaylov · November 28, 2019, 10:35am

In fact, the 3D XPoint technology doesn’t need to erase blocks before writing something. It can write directly, regardless of what’s already there. As a result there’s no write amplification and no TRIM support.

In all fairness, The 16GB and 32GB Optane drives do advertise TRIM support (if lsblk-D is to be trusted) but they probably use it to write zeroes, because that’s the other side effect of the TRIM command.