NVMe AMD-Vi IO_PAGE_FAULT
Are you getting NVMe AMD-Vi IO_PAGE_FAULT when you enable IOMMU or with your Phison-based SSD? It’s something to do with IOMMU and the Linux kernel, but there is a work-around.
Checking the error
As you probably already know, a NAND flash drive must be told by the operating system which flash cells can be prepared for fresh data to be written to them. While the read and write cycles on NAND flash can be quite fast, the ERASE cyclle, etc. e is glacial in comparison. When deleted information is really removed from NAND flash devices, such as Phison-based NVMe (Aorus Extreme PCIe 4.0 1/2tb SSD, Corsair MP600, Sabrent Rocket 4.0, etc), the OS must issue a TRIM or DISCARD command to the device with specific block(s) (or a range of blocks) to be marked for erasure when the drive is idle.
Failure of the device to trim means that when the OS goes to rewrite a block that is already written the firmware must pause the write, issue an erase, then actually write to the target block. Which will take forever in comparison to just writing the drive.
issue a command such as
fstrim -v /
or fstrim /home
if your root or home partition is on NVMe storage. Check kernel messages with
dmesg
and see if you see errors. If not, you’re good and your IOMMU setup is fine.
If you see errors, there are some steps you can take.
IOMMU=PT, the low hanging fruit
I would recommend setting iommu=pt avic=1 on your kernel boot line for Ryzen and Threadripper-based systems.
For this video the following systems were tested with latest Agesa (ABBA in the case of Ryzen 3000 series desktop CPUs):
- Epyc 7402P
- Epyc 7742
- Threadripper 2950X
- Threadripper 2990WX
- Threadripper 1950X
- Ryzen 3900X
- Ryzen 3600X
- Ryzen 3800X
** You should also change IOMMU from AUTO in bios to Enable **
As stated in the video Auto is often subtly different from Enable, due to windowsland requirements. Be sure IOMMU is set to Enable and not auto, even if it appears to be “on” when set to Auto.
If Iommu=pt only partially fixes the issue, or decreases the frequency of the issue, there may be another factor in play. Some devices will only TRIM/DISCARD on an even number of flash cells at once. If the OS issues the command because the underlying file system and the nand block device have a slight misalignment of cells-to-filesystem blocks, then the drive may reject the operation.
This could actually be a problem in the kernel, and the -v option to fstrim will tell you how many blocks were actually trimmed, which may be useful for further troubleshooting.
diff -Nau1r a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
--- a/drivers/nvme/host/core.c 2019-09-14 11:27:34.986373747 +0200
+++ b/drivers/nvme/host/core.c 2019-09-13 16:14:16.937812531 +0200
@@ -564,3 +564,3 @@
- range = kmalloc_array(segments, sizeof(*range),
+ range = kmalloc_array(256, sizeof(*range),
GFP_ATOMIC | __GFP_NOWARN);
This patch to the kernel might solve the alignment issue.
A more complete patch is coming, and will be upstreamed.
Relaxed Ordering
Relaxed command ordering may also be at fault. Do an
lspci -vvv
and look for RlxdOrd+ or RlxdOrd- in the output. If you see RlxdOrd+ use setpci to disable it.
TODO: Actual command goes here.