RND4K Q32T1/Q1T1 slow reads and writes using Optane!

Hello Level1Techs community,

I am experiencing similar RND4K Q32T1/Q1T1 slow read and write speeds on my Intel Optane drives compared to my Samsung NAND drives. Here are the setup details:

Hardware:

- Motherboard: Supermicro H11SSL-i
- CPU: AMD Epyc 7502
- NAND Drives (RAID10): 4x Samsung 970 EVO PLUS 500GB
- Optane Drives (RAID10): 6x Intel Optane p1600x 118GB

Software:

- OS: Proxmox 8.2
- Filesystem: ZFS

Despite the superior performance characteristics of the Optane drives, their RND4K Q32T1/Q1T1 read and write speeds are nearly identical to the NAND drives, which are slow as well. Given the specifications of these drives and Wendell’s recommendations, I expected significantly higher performance for random I/O operations.

I would greatly appreciate any insights or recommendations on how to diagnose and resolve these performance issues. :wink:

Best regards,
PapaGigas

Creating any RAID of Optane will destroy random read performance. I experienced a similar issue while using Intel VROC to stripe four p1600x drives.

They’re “identical” because your RAID solution is the slowest link in the chain, bringing all drives down to the lowest common denominator.

If you want the larger capacity, use a single 960GB 905P drive.

3 Likes

I don’t need the larger capacity; I need redundancy and the benefits of a 3-mirror RAID10 solution. However, I’m only getting around ~36MB/s on reads and writes at Q1T1. Could this be an issue with ZFS? :confused:

1 Like

@Wendell, could you or anyone else shed some light on this issue, please?

1 Like

Here are the results obtained from running CDM on a fresh Windows 11 VM:

1 Like

The CPU cores are boosting to 3.35GHz as expected in Proxmox VE, but I’m unsure if the bottleneck is due to the CPU or the RAM speed. I found a post by @Wendell discussing NVMe RAID performance on EPYC, but it doesn’t detail the performance differences between RAM speeds, even though it mentions there is a significant difference:

https://forum.level1techs.com/t/fixing-slow-nvme-raid-performance-on-epyc/151909#whats-the-difference-between-2666-and-2933-memory-2

Can anyone provide insight into whether CPU or RAM speed could be the bottleneck here, or suggest other potential issues?

1 Like

I’ve noticed a bug in the H11SSL-i BIOS (running the latest version 2.9) where the RAM settings need to be set to AUTO because MT/s and MHz are mixed up. If you try to set it manually, it will give you the wrong MT/s.

Additionally, MemTest86+ v7 shows very low speeds for the main memory.

memtest

1 Like

However, when running Sysbench in Proxmox VE, the speeds aren’t as bad, but still not up to what I expect.

So my questions are:

  1. Are these normal speeds for this CPU?

  2. If the BIOS is the problem, how can I solve this issue?

PS - I had to divide this into two different posts due to the limit of two screenshots per post. Sorry for the inconvenience! :innocent:

1 Like

Seeing as we know almost nothing about the VM configuration I will just give some general pointers.

  • Any RAID 0 (RAID 10 is a RAID 1 of RAID 0) is going to destroy Q1T1 performance on NAND or Optane. Adding a RAID layer between the system and the drives significantly increases latency and overhead. Hence why, in the forum thread you linked about NVME RAID, Wendell came to the conclusion that “even if everything is working right, ZFS is “slow” compared to NVMe.”

  • Since you mentioned ZFS, it seems unlikely you are using PCIe passthrough to get your storage to your VM. QCOW2 and other VM disk image formats add a significant amount of latency and overhead to guest disk access. This is a compounding effect, on top of the slow ZFS RAID, further reducing performance.

  • The default disk cache mode for QCOW2 images is “none,” unless you have already changed this default in your hypervisor. This means guest disk IO is bypassing the host page cache; that is to say you are getting none of the benefits of caching to DRAM, in terms of latency or bandwidth. Since you seem to value redundancy, you may not want to use caching. But using “writeback” may increase performance.

  • I don’t know if Proxmox automatically sets up VM thread pinning, but this will also contribute to better guest IO performance. Your 32 core Epyc processor likely contains four CCDs; that is to say, four clusters of 8 CPUs. These clusters each have two “local” memory controllers, even though DRAM access filters through the IO die. In your BIOS, enable NPS=4 to expose these NUMA domains to your OS. You can read more about Epyc NUMA topology on ServeTheHome.

  • To my knowledge, Proxmox does not automatically set up hugepage backing for VMs. IO operations and PCIe passthrough-type peripheral communications benefit greatly from allocating the VM a contiguous memory area by using hugepages. These hugepages should be in the same NUMA domain as the cores you pin for best performance, because there is overhead associated with “hopping” across NUMA boundaries to access non-local memory. For example, if you allocate eight cores on CCD2 (essentially the whole CCD), make sure to create your hugepage backing on NUMA node 2 (or whatever it may be), and remind KVM to only allocate hugepages to this VM from NUMA node 2.

There is a lot of setup involved with extracting performance from KVM, which is generally what Proxmox uses. Above, I have identified half a dozen factors which may be contributing to your low IO performance.

Setting any of this up in Proxmox will require some Googling on your part. But since it seems to use KVM, I’ll explain some of the setup.

Here’s an example of properly set up thread pinning and hugepage backing in a VM. The KVM XML file definition is shown below.

The <memoryBacking> line tells KVM to use the <hugepages/> I have allocated. The <numatune> command tells KVM to allocate hugepages on a specific NUMA node <memory mode="strict" nodeset="1"/>.

The numbering in the cpuset commands seems nonsensical, but this is how virsh identifies CPUs in my two socket, four NUMA node system.

Running sudo virsh capabilities from a terminal will identify cores and their thread siblings, by NUMA node.

The above output shows the core IDs for NUMA node 1 on my system.

I allocate the hugepages to NUMA node 1 on each boot with the following command.

sudo echo 16384 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages && sudo mkdir /dev/hugepages2M && sudo mount -t hugetlbfs -o pagesize=2M none /dev/hugepages2M && sudo systemctl restart libvirtd

This allocates 16384 two megabyte hugepages, to give 32GB of memory to my VM. You may also choose to use 1GB hugepages. The command then mounts them with hugetlbfs so they can be identified by KVM. Finally, the command restarts libvirtd so that the newly allocated pages are identified.

You must first set up hugepages in your Grub configuration file, for instance, before they can be allocated. Google how to do this with Proxmox.

Which node you choose to allocate hugepages to depends on where you want to allocate cores, and which NUMA node other hardware for your VM is local to. You can install the lstopo or hwloc package on most Linux distributions to get more information about this. Run lstopo to get the output shown below.

In this example, the GPU passed to my VM is on NUMA node 1, which is why I allocate cores and hugepages there.

Hugepages and thread pinning will increase the IO performance of any KVM guest. However, you are still running into the fundamental slowness of ZFS and virtual disk media (QCOW2), so your Q1T1 gains may be minimal.

2 Likes

You are benchmarking a single thread. These speeds are likely normal.

1 Like

I only have one NUMA node, here’s my topology:

PS - Some components, such as both GPUs or the 16 drives used in the TrueNAS VM, don’t show up due to passthrough. :wink:

1 Like

Hence what I said about enabling NPS in your BIOS.

1 Like

Regarding the disk image format, I use only ‘raw’ with ‘no cache’ and ‘VirtIO SCSI single’ for the controller.

1 Like

I use NPS1. Are you saying it’s better to use NPS4 with the 7502?

1 Like

I don’t know how to make that much clearer.

RAW has even more overhead than QCOW2, so that is another place you can look. Right now my intention is to remedy all the other performance issues with your VM and system configuration, so you can get a clear picture of your ZFS array’s performance in a best case scenario.

1 Like

I upgraded from a 7501 to a 7502 specifically because of NUMA. I changed it from NPS1 to NPS4 as you suggested but haven’t seen any performance gains. Should I keep this configuration?

Re-read (read?) my post. I have answered your question. Enabling NPS is the first step in facilitating huge page allocation, pinning cores, etc. It is not meant to change your performance.

Yes, I’ve read your post. I was asking this because of the PCIe devices, like… hopping between NUMA nodes wont generate more latency overall? :confused:

By the way, I can only use the RAW disk image format with Proxmox VE when using local-zfs, but I experience the same performance with PCIe passthrough.

Whether or not you have NPS enabled, the cores of your Epyc processor do not have uniform memory access. The difference between NPS=1 and NPS=4 is just whether or not this is exposed to the OS. The OS can make more intelligent decisions with it enabled.

There are four groups of eight cores in your processor whether NPS=1 or NPS=4. In both cases, all PCIe traffic is funneled through the central IO die.

The point of enabling NPS and allocating cores and memory on the correct NUMA node is to facilitate more logical transfer of data from cores and memory to IO devices.

If you have enabled hugepages, configured hugepages and thread pinning for your VM (unclear if you have tried either of these; both improve performance of all virtual machines in all circumstances), and cannot use QCOW2, then you can safely assume your ZFS array is the issue.

You can try benchmarking your array on the Linux host with KDiskMark.

If the performance is still terrible after applying the optimizations I’ve suggested, both on the host and for guests, and even after applying all of Wendell’s optimizations mentioned in the thread you linked, then it is clear that ZFS is not the right tool for the goal you are trying to accomplish (i.e. maximum Q1T1 throughput).