Debian servers + NVME = slow read/write speeds even on multi-threaded database transactions

I’m hoping someone can give me some pointers on this, I’m totally at a loss.

I’ve got a few servers I’m trying to troubleshoot what I perceive to be disk performance issues on.

Xeon v4 server with loads of RAM and Samsung F320’s in RAID-0 (debian bookworm)
Dual Epyc with 1TB RAM and a mix of Samsung PM9A3/Optane (debian bookworm)
single epyc with 256GB RAM and Optane (Ubuntu 22.04)

Debian kernel version: 6.9.7-1~bpo12+1
Ubuntu kernel version: 6.8.0-40-generic

They all exhibit this behavior. A single-file dd copy with bs=1M will max out at ~1.5-2GB/s even on my PCIE Gen 4 Epyc server with 1TB RAM.

***TO BE CLEAR - fio benchmarks indicate that all NVME drives perform within spec but in real-world throughput testing they are scarcely better than 12G SAS drives (which makes me question fio as a benchmarking tool - if you can’t achieve the stated spec in a simple file copy then what good is the benchmark other than to make the manufacturers look good?)

To test if this was an issue of multi-threading being required to achieve max throughput on the dual 64-core Epyc server (1TB RAM) I have MSSQL installed in a docker container.

The dual 64-core NVME setup is as follows:
PM9A3 (EXT-4) on u.2 to PCI-E adapters housing MDF files
Optane p4800x (EXT-4) also on u.2 to PCI-E adapter housing LDF files

To test multi-threaded throughput I did a SELECT INTO of all of the tables in one DB and into the other concurrently using as many threads as there are tables (I think ~20 tables). Both source/sink databases are set to simple logging. The source database is 110GB in size

The entire 110GB source database takes about 2-2.5 minutes to copy to the new database (on the same drive) which indicates about ~1GB/s total throughput. This is abysmal. Even running as a single transaction yields similar results.

Breaking the destination database into multiple files yields the same results. This does not appear to be a multi-threading issue.

I’ve done numerous tests on other servers using cp, dd, rsync etc. No real change in throughput.

What makes things interesting is that I took two Samsung PM863 SATA drives and put them in a striped ZFS pool and achieved proper scaling on a per-drive basis for file copies… it behaves as I would expect (i.e. two drive stripe = 1GB+/s throughput on copies).

I’m starting to think that this is a Linux issue generally and that Linux is severely under-optimized for NVME performance.

Does anyone else have experiences like this? Please tell me I’m missing something I really don’t want to have to switch out my drives :frowning:

2 Likes

I can’t help I’m afraid, but if you have a disused/abandoned server rack and you’d perform the right rituals & incantations, you may summon @wendell to take a look into your problem :man_mage:

(yes, the meme is still alive! :wink: )

I can slather them with cooking lard with a topping of whipped cream, chocolate syrup and sprinkles before shipment. Would this suffice?

1 Like

Use fio , you wont do any reasonable testing with just dd, rsync or cp.

There is zero implicit guarantee that you server workload is actually parallelizing block access well enough to leverage your storage subsystem fully. And you need that parallelism to see what you expect.

Use FIO with proper test scenarios (QD1 vs QD32+ minimum) as starting sanity test , then troubleshoot higher levels of software stack.

Good mental starting point:

This is only reference how to monitor block device queue staus in real time, worth trying out:

Should splitting a database across multiple database files on the same drive help so far as parallelism is concerned?

What performance do you expect and why?

Did you try this with your nvme drives, too? What’s the result?

After researching this I really think that this may stem from the 4K block limit on Ext-4 as well as what I perceive to be a hideous lack of multi-threaded optimization in basic CLI tools like CP and DD.

Doing a single dd with bs=1M on my PM9A3 is actually SLOWER than using MSCP to 127.0.0.1.

If someone else wants to try this I’m curious the results:
Install MSCP and send a large (like 20-30GB) file to 127.0.0.1 with adjustments to the -u and -m parameters based on how many cores/threads your box has (I tried -u64 and -m64 in my case)

compare that to using dd if=X of=Y BS=1M

The trouble with parallel writes is that they require use of write locks. Databases will only ever write single threaded to database tables.

The trick to increase parallelism is to use partitioned tables. Or for testing purposes write in parallel to different tables.

Thanks for the reply. My expectations of the drive(s) is speed that comports with Samsung’s posted specs using the out-of-the-box tools (i.e. cp/dd/rsync etc). This should be at least 3-4GB/s write speeds and quite a bit higher on read.

I tried a striped ZFS pool with a few PM9A3’s and they max out at 2GB/s on file copies (I’m new to ZFS so I’m still learning about it). I haven’t tried the MSCP tool in this configuration though.

Yeah I knew locking would happen for writes, but that’s to a single table. I was writing 30+ different tables in separate threads all at once and figured it would perform better than ~1GB/s on a simple-logged database (where the LDF resides on an Optane p4800x).

I’m going to try testing with multiple single-drive ZFS pools with the db files spread over them and see how that impacts the same test (30+ tables dumped into a db at once).

Also, the Self-MSCP trick I did previously was tried on Ext-4 as well as XFS partitions and neither of them appeared to perform as well as ZFS. Probably ZFS has done actual work on parallelism at the file system level (and other things) if I had to guess.

Out of curiosity, have you tried to see if FreeBSD behaves similarly?
You might also want to check temps of the drives

1 Like

These are reasonable expectations based on marketing provided information.
Unfortunately, marketing (not just Samsung but for all nvme storage products) has again found a way to skew reality. Advertised speeds can only be reached at largely unrealistic queue depths and request sizes.

To get a realistic expectation for your product I recommend ATTO benchmark. A single run shows read/write performance by request size, you can adjust queue size as a parameter, which allows assessing performance across increasing queue sizes for a complete picture.

Now, the hardest part is to understand the storage access pattern for your application work load and to find out what leeway (if any) you have to tune it towards the capabilities of your storage products.

Well, current gen nvme-based storage products have not advanced quite as well across PCIe 3/4/5 generations as marketing suggests. Yes, they are better, but by no means doubling performance for each generation.

Understanding and simulating relational database workloads is quite some tricky business. Typical workloads have synchronous writes (to not lose any data), low request sizes (depending on db 8k-32k) and highly variable queue depths driven by workload.

I experiment with different storage setups for postgresql (for purely academic reasons) and find that storage performance increases easiest with number of tablespaces. This means that partitioning tables and distributing db items across a relatively large number of tablespaces yields optimal performance.
The postgresql storage engine works quite differently to sql server’s. So, this information is not transferable as written. But it’s relevant information for this thread nonetheless.
Here is an article that discusses tuning sql server parallelism FYI.

Actually, for asynchronous writes zfs exploits the rather long wait times when accessing HDDs to combine many small record sizes into large ones, which reduces parallel access (which is critical to reach peak performance on HDDs), and improves single threaded performance.
It also allows converting synchronous writes into asynchronous writes with the help of slog devices.
Unfortunately, nvme devices are so much quicker (very low latency) that it doesn’t leave zfs lots of time and process cycles to do its trick efficiently.
I think in your tests you experienced mostly RAM performance in asynchonous write patterns.

Thanks for the reply.

I’m going to check a few other things though I’m going to try ZFS to see how it performs on database workloads.

The other post about drive temperatures may well be valid.

All of my PM9A3’s sit further away from the case fans than my Optanes so I may try to add additional cooling to see if this improves things and check temps.

1 Like

You’re going to find a couple of tuning guides.

Let me point out the key things to pay attention to

  • sync writes. Databases rely on synchronous writes to ensure no data is lost. Sync writes are much slower than async writes. That performance loss is acceptable to database applications because of the additional security provided. zfs (as explained above) allows to speed up sync writes with the help of SLOG devices. You can only benefit from this if you disable sync writes for the database app. If you don’t you pay the cost of sync writes twice (causing worse performance on zfs compared to other file systems)
  • database apps typically prefer a single record size to write data. As also mentioned above that is typically a small (less than optimal) value from a performance perspective. ZFS improves performance for async writes by combining multiple small records into a single large record. So, unless you use SLOGs in ZFS you cannot benefit from this optimization and your performance will suck (not reach the optimal performance of the storage device). This optimization is a compromise for database workloads because OLTP apps depend on quickest response time (latency) for performance, and OLAP apps benefit more from higher bandwidth for performance.
  • a single vdev (e.g. a SLOG) may only experience a small queue depth in low load condition. To improve performance (bandwidth) in low load (low parallelism) conditions you can partition your device (e.g. your Optane drive) and add each partition into a separate vdev. This trick will help running your nvme or optane drives at top speed. In my testing 3-4 partitions work best. There may be zfs tuning parameters that accomplish the same, but I have not found them, yet.

Edited for clarity.

Can you explain more of your methodology? As said above fio can have queue depths from 1-32 and access your test storage in blocks of arbitrary size, where it’s worth checking 4K (native to contemporary storage) and 1M (tuned recommendations for spinning rust, ext4 and Linux LVM). mode=randrw will yield the worst results, and mode=read or mode=write will yeild optimal linear access patterns.

I’ve got a chokepoint in the layout of PCIe lanes on a system I’m testing now. I’m using LVM with a spread of PV’s at 4K, 64K, 256K, 1M, 4M, 16M and 64M allocation units, ext4/XFS told about the stripe geometry and Debian GNU/Linux Trixie. The range of speeds in fio’s testing ranged from a few hundred MB/S to 8 GB/s across the drives.

We need more info to help figure out what’s constraining this setup.

K3n.

So I’ve been fiddling with this a lot over the last few days.

I think my issue comes down to a dumpy riser. One of those quad u.2 to PCI-E risers. I think there are major issues with them so I am going to try something else. I’m going to have to use a cable of some kind because there really is no option. I don’t particularly want to use cables but there is no other choice unfortunately.

We’ll see how it goes. I was really hoping these quad u.2 risers wouldn’t suck but it appears that they may well suck very badly…

I don’t remember my fio test settings but the indication was that they were all within spec so far as speed.

IMO, now that I recall, in the past when I’ve run into this sort of thing it’s always been cable/riser/connectivity related and it always takes me forever to figure that out because I assume it’s something else.

Good luck – may you have cables that work reliably all the time!

K3n.

1 Like

Ugh, cp is really showing its age more and more. I needed to copy a 1 TB template and after letting it run for a bit I noticed only a couple hundred GB had copied. Checked the IO and it was only doing a measly ~1 GB/s. Being lazy and not wanted to write something myself, because surely this is a common thing right, I did a little searching.

Came across GitHub - pkolano/mutil: Multi-threaded cp and md5sum based on GNU coreutils which is a patch set to add threading to cp. Seemed reasonable but unfortunately it didn’t compile cleanly and again lazy. Hit up the GitHub issues page to see if someone had dumped in a fix or something and found a link to this project Shift

Now shift is seriously overkill for what I needed to do but it supported local transfers and the setup is non-existent, just the way I like it. Compiled the one binary, added the 3 files to my path and kicked it off. The transfer went from 1.5ish GB/s (cp) to ~20 GB/s write (shift) on my little consumer nvme drives.

Hey I’m not the only one having an issue! awesome!

There is similar functionality in a tool called MSCP:

I actually use MSCP to copy files locally. What is weird is that when I copy them to a regular Ext4 volume they copy faster than regular cp but nowhere near the speed of the drive.

However, when I copy to a ZFS formatted drive, even with ARC disabled, it’s getting up near the drive speed spec.

I still believe that there is a significant lack of optimization that I’m dealing with here.

1 Like

Its interesting you mention ext4, @wendell mentioned an issue to me a couple weeks ago. Could be another data point for him.

1 Like