ZFS high I/O issue [txg_sync] 100% and [z_trim_int] 60% to 70%

risk · October 31, 2020, 6:58am

Wouldn’t it allow more writes to “main pool” to happen out of order? This increasing overall throughput to the drives?

So if your “main pool” is spinning rust you can burst write at optane speeds, and only eventually have to catch up.

(I’m ignoring read latency in this setup and assuming pagecache and/or arc would handle it).

Trooper_ish · October 31, 2020, 5:04pm

Dynamic’s link

To J Salters’ blog spells out how it’s not really used. Unfortunately.

Ghan · October 31, 2020, 5:19pm

The SLOG device is never read from unless there is a failure. So the actual data path is from the memory buffer to the pool, regardless if you have an SLOG device or not. The SLOG device only helps you with sync writes so that the I/O can be acknowledged as written to persistent storage when it hits the SLOG device, but it doesn’t act as a traditional write cache.

HaaStyleCat · October 31, 2020, 5:29pm

Did I open a can of worms?.. ram will be here today and I have a 2tb wd purple from an nvr project Im going to try to use to cut down latency. It is a 512e 4k block drive. Should I still use a-shift 12?

Log · October 31, 2020, 5:37pm

Yes, use ashift 12.

At this point in time, all new HDD’s should be manually set to 12. SSD’s should be fine with 12 or 13, unless you have testing showing something else is better.

Log · October 31, 2020, 5:49pm

SLOG and L2ARC are very commonly misunderstood about the problems they actually solve. The JRS blog posted does a good job explaining it, but you have to read it carefully.

Think of SLOG this way: it’s not a write cache, it’s a safety device that allows normal performance.

HaaStyleCat · October 31, 2020, 6:34pm

Great, thank you. Look forward to having a good read here soon. Love to know how things work. I did intend it as a safety, due to no ups atm. With the dvr drive its not an issue but with my media vault Id like to keep that intact.

risk · October 31, 2020, 7:43pm

The blog post/article is really only scratching the surface. It doesn’t mention transaction size tuning, ordered writes, what kind of datastructures are used to hold zfs metadata , whether they’re written to in sync, or async in log form and whether directing these writes to a slog device with an increased transaction size can increase throughput of a spindle drive. (ie. achieve no writes blocking on spindle).

HaaStyleCat · October 31, 2020, 8:36pm

Yikes, a lot to learn. But, I have the time.

Log · October 31, 2020, 8:53pm

Here’s a quick spoiler, none of that stuff matters for your non-enterprise use case. He’s going to great lengths to overcomplicate his persistent misunderstanding.

Set atime=off, figure out if you need a larger recordsize than default (128KiB), make sure samba or nfs are doing what they should be doing, and call it a day.

There’s some other dataset settings you can play with if you don’t care about BSD compatibility, but honestly those don’t effect things much performance-wise.: NAS + Homelab Virtualization build... Suggestions?

HaaStyleCat · November 1, 2020, 1:52am

Well drives in, now I just need to wait a bit for memory to load back up. Wish I had know I only needed 16 gigs or so…lol well more ram for when I expand. Only have 12 gigs for the media vault in raid 10 so 32gb probably would have been fine. I know with backups I don’t need the redundancy but hey. I have learned a lot and now to update the build log. I still enjoy learning the enterprise side as that was the point of this great adventure lol. But… ill need more hardware for that. Lol

risk · November 1, 2020, 7:15am

It’s because I’ve worked with databases of different kinds before - to eliminate how often they sync, and how they order their buffer evacuation.

Apologies for the confusion - but from my prism, a filesystem is just a small/simple database with a limited query set - in the case of zfs it aims to be posix compliant.

Theoretically, you don’t sync unless you use nfs or software like databases that syncs often, but the reality is that even home software likes to keep track of stuff in databases, and we use the files over the network even at home.

example of sync vs no sync and what-would-zfs-do question if you happen to know

This “you don’t need it for home use is kind of BS IMO”. If you have a photo library or a music player or Plex or a web browser keeping track of visited urls to know which ones to color blue or purple, or is keeping track of its browser cache, then it’s more likely than not that this software is storing some data in some database (sqlite or leveldb). And it’s likely that it issues syncs.

Additionally there’s data and metadata races, e.g. in the case of posix and a web browser, let’s say we download a favicon, create a small file, write to it, close the file - then random write to another file/journal with the browser cache metadata, thus changing its size. Posix doesn’t specify anything about whether both of these writes will work, or both will fail, or if file can end up zero sized with metadata pointing at it, or if file can end up with random contents inside. Or, e.g. what will happen with free space accounting.

Many filesystems would simply implicitly sync after closing a file, and many apps would use sync as a way to indicate all future filesystem operations rely on all previous operations, and many apps have come to rely on this behavior.

So, if ZFS writes some data directly and doesn’t really use its ZIL. When a sync happens does it scramble to quickly flush out buffers to spindles? Or does it scramble quickly to write to ZIL or to a dedicated log device?

Wouldn’t it be better if all the data was already committed to stable storage ie. the log when a sync happens sooner or later, and you gain back your 5ms of spindle time?

Log · November 1, 2020, 8:31am

While I appreciate the effort you are going to, I’m honestly not gonna touch that rabbit hole.

Slogs do not add a performance benefit to async writes. There was originally forum-lore floating around a decade ago suggesting that an slog+forced sync might be really cool, but that eventually went extinct after people who actually run ZFS for a living realized that async writes are buffered in RAM, so waiting for them to be also shoved into an SSD cannot ever provide superior performance. All an slog can do is reduce the latency of responding to a sync() call (and this ignores the additional latency of sending this response back over the network). The behavior of ZFS transaction groups do not change if you have sync writes or not. You need to adjust that stuff manually if you want to fit into some special niche. One interesting note: I’ve seen mention of an optimization for large sync writes when an slog is not involved, ZFS may possibly just change the pointer to what was written to the ZIL into a normal write, avoiding having to write the data twice. Again, this brings us back to (at best) normal performance, and this mechanism can’t/wouldn’t be used when an slog is involved.

I don’t know what to tell you at this point besides try out some performance testing in a test pool. It’ll become clear real quick.

HaaStyleCat · November 2, 2020, 4:38pm

Ok so new WD drive is having a max of roughly 1.3 to 0.3 MAX I/O wait time. I believe we have found the culprit to my I/O delay!!! The Seagate Firecuda Hybrid drive must not play nice with ZFS in some way. I wonder if theres a way to code it to ignore its on board cache, or have ZFS utilize it as a seprate drive for cache?..Devember https://www.youtube.com/watch?v=Daf5C52s124&feature=emb_title got me thinking…LOL not that I COULD DO IT…lol I wouldnt know where to start. lol I will still add my cache like I had prior with the ZFS DVRdrive just to make sure its the drive and not something I did setting the drive up.

HaaStyleCat · November 3, 2020, 3:25pm

What are some of the best tools to use to test a pool for sped and latency? Should I run them from the CLI or can I run them on another machine? Just wondering so I can start some benchmarks as I mess around with things.

Log · November 3, 2020, 4:28pm

I have never found a comprehensive performance testing tutorial that I can really recommend. It’s basically all a dark art you’ll have to research yourself, looking through google and the ZFS reddit for the arcane incantations that others have decided to use.

I have a post here going over some of the complications that are involved with performance testing with FIO and similar tools: Poor ZFS Write Performance

I’ve not done anything in regards to latency, as my use case (bulk data storage and transfers) are not dependent on that.

iostat is another common tool: https://jrs-s.net/2019/06/04/continuously-updated-iostat/

HaaStyleCat · November 3, 2020, 5:22pm

Nice, great start thanks. watch -n 1 iostat -xy --human 1 1 is handy. If its doing all it should. I don’t see any stats for the CPU. I’m also not loading it up either, but the proxmox display is showing at least some cpu activity. I know its a bit difficult, but the start is much appreciated.

HaaStyleCat · November 8, 2020, 5:44pm

SO, the issue eventually came back… it was once the ARC was full. I made it full because I had a 480GB L2ARC attached to my raid 10 12TB zfs drive. After doing some reading it appears that thinking of L2ARC as a cache is completely faulse… in a way. It is an extension of ARC, but it is also kept track of by ARC so having too big of a L2ARC fills up the ARC and makes it less effective and eats up processes (mostlikely why my [txg_sync] was so high).

I removed that cache from the drive and I now have no caches, no SLOG, and the system IOWait has dropped to under 1 for the last day even with DVR records to a Spinner, while the OS is installed on a SSD seprate from the proxmox install SSD.

Do these conclusions sound correct? If I wanted a useful sized L2ARC for a 12T media library, and a 2TB DVR drive with 20GB of ram dedicated to ZFS ARC what sizes should I use? I also have that small 32 Intel Optaine Memory, can I still use it as a slog for safety of data till I get a UPS? Or should I just leave it alone? I appreciate any input.

Log · November 8, 2020, 6:55pm

L2ARC is useful when you have a workload that has random reads happening, that doesn’t fit in your Ram, but will fit on an SSD.

The most ideal use case is seeding torrents, oddly enough. Database workloads can also benefit.

L2ARC resists caching sequential workloads on purpose, so streaming media is generally not helped at all by it.

L2ARC also has a slow fill/replacement rate, 8MiB/s I believe. Setting it faster will quickly chew through disk life. The math is not kind to consumer drives.

Generally it’s better to take the disk used for L2ARC, and straight up just put your “I want these things fast” files on it.

I would personally leave it alone at this point. Adding an SLOG shouldn’t hurt though. It may not help your workload, but it should never set you back.

HaaStyleCat · November 8, 2020, 7:34pm

Perfect. Thanks so much for the help. Now time to start messing with mail notifications to get them to work for proxmox and zfs. Then also set up a cron job file for my scrubs and anything else I need to automate just to keep the library in good shape. Then any security updates I should do. I need to look into removing root logs and making sure I’m logged out or set up sudo working accounts on a few vms that don’t do it automatically or install something to log me out automatically lol A lot to do.