ZFS + Minecraft = Fragmentation?

Howdy folks,

I recently moved the Minecraft server (a rather large world that’s been running continuously for 10 years now) from it’s spinning rust ZFS mirror to an NVME-backed ZFS mirror.

Quite alarmingly, the FRAG statistic has rapidly taken on a life of its own.

This is how it started two weeks ago…

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool                                              880G   102G   778G        -         -     0%    11%  1.00x    ONLINE  -     
  mirror-0                                        880G   102G   778G        -         -     0%  11.6%      -    ONLINE       
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE      
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE

After finishing copying over the data and running the server off of the new NVME mirror for a week…

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool                                              880G   197G   683G        -         -     9%    22%  1.00x    ONLINE  -
  mirror-0                                        880G   197G   683G        -         -     9%  22.4%      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE

then a few days later…

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool                                              880G   200G   680G        -         -    13%    22%  1.00x    ONLINE  -
  mirror-0                                        880G   200G   680G        -         -    13%  22.7%      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE

a few more days…

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool                                              880G   200G   680G        -         -    14%    22%  1.00x    ONLINE  -
  mirror-0                                        880G   200G   680G        -         -    14%  22.7%      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE

a few imore days later…

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool                                              880G   202G   678G        -         -    15%    22%  1.00x    ONLINE  -
  mirror-0                                        880G   202G   678G        -         -    15%  23.0%      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE
    nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1   882G      -      -        -         -      -      -      -    ONLINE

In a month it’ll be what, 100% fragmented?
Is this thing going to implode on me?
I realize ZFS fragmentation is “just” a measure of the availability of contiguous free block regions. As a COW-based filesystem this is going to happen but it’s not (as moderate levels) a big deal. I have read.

Questions:
When does fragmentation get to be a big deal?
What is to be done so that it doesn’t get to be a big deal?
Perhaps there is Minecraft ZFS tuning?

Thanks as always for your feedback and insights.

It will only ever become a real problem when space is closing on max capacity. If you’re at 80% capacity and rewriting blocks all the time, expect serious fragmentation and performance decrease.

Just avoid filling the drives…CoW needs free space to breathe. And your drives have 22% cap, so I wouldn’t worry about that FRAG figure.

Problem is if you do regular snapshots and everything is changing all the time, there are a lot of old referenced blocks all over the disk. You may as well reduce snapshot policy (or delete some snapshots) for those Minecraft “scratch” directories and FRAG should eventually fade out over time or just cease to exist.

Happens a lot with temp data or if you snapshot the swap on VMs. Basically all the (usually small) stuff that are in flux all the time. Bad for CoW.

It will trigger a nuclear meltdown. Good thing we got you watching that timer :slight_smile:

Don’t worry. 15-30% FRAG is fine and “normal” and shouldn’t impact stuff really. Eventually we all rack up FRAG over time, some more, some less. Unless we only have large files and never delete/modify any of them. It’s the normal lifecycle of a CoW filesystem. It’s the same with BTRFS.

TLDR: Programs writing to files and modifying them all the time is bad for CoW. Frequent Snapshots will make this even worse. FRAG is normal and only a problem once CAP is dangerously high (80% rule, but I’d recommend less with pools that change frequently or hyper-frequently)

5 Likes

An earlier post of mine

Note that FRAG is a very poorly named and rather arbitrary calculation of free space fragmentation, not file fragmentation. If you look into how the value is generated, calling it a percentage is overly generous and misleading to pretty much everyone.

How the percentages relate to the average segment size of free space goes roughly like this. Based on the current Illumos kernel code, if all free space was in segments of the given size, the reported fragmentation would be:

  • 512 B and 1 KB segments are 100% fragmented
  • 2 KB segments are 98% fragmented; 4 KB segments are 95% fragmented.
  • 8 KB to 1 MB segments start out at 90% fragmented and drop 10% for every power of two (eg 16 KB is 80% fragmented and 1 MB is 20%). 128 KB segments are 50% fragmented.
  • 2 MB, 4 MB, and 8MB segments are 15%, 10%, and 5% fragmented respectively
  • 16 MB and larger segments are 0% fragmented.

See also https://www.bsdcan.org/2016/schedule/attachments/366_ZFS%20Allocation%20Performance.pdf

Basically it’s not something to worry about or try to reduce unless you are very low on disk space and trying to write out blocks, at which point ZFS may start having trouble quickly finding free space to write into and may have to resort to using gang blocks (essentially a pointer to multiple smaller blocks) when writing large blocks.

If pool data consists mostly of large blocks, the FRAG will trend very low (My large block (16M) storage pools are currently 0-1%).

Active pools at the default blocksize of 128K, or zvols with a much smaller size will inherently have a much larger FRAG value, because you are leaving lots of 128K or smaller bits of free space whenever your COW filesystem frees up old blocks after writing new ones.

Again, this is not a problem to try to optimize away, it’s just (understandably) a misunderstanding what the system is actually trying to tell you. Frankly it should be renamed to FFREE or FFRAG or something.

For reference:

It’s not uncommon to see people with a FRAG of 50-70% doing nothing special, and encountering no actual issues.

https://forums.freebsd.org/threads/fragmentation.58307/
Just for the sake of illustration: I created two brand-new pools on two new disks one week ago today. The two pools report 31% and 52% fragmentation, respectively. The pool with 52% fragmentation is about 2.7TB in size, and is 68% full. In the week since it was created, the pool has seen perhaps 60 GB of writes. Meanwhile, the pool with 31% fragmentation is only 20GB in size and is 60% full, has seen probably 40GB-50GB of writes—at least twice its total capacity.

So the smaller, more active pool is less fragmented, and both pools are only one week old. It seems the FRAG value cannot represent the same sort of file fragmentation that degrades I/O performance on other filesystems.

And a fantastic analogy in this thread

https://old.reddit.com/r/zfs/comments/squjbj/should_i_start_worry_about_things_with_frag_62/
Generally the opposite, actually. Let’s say you have a blank notebook, and you want to take notes for three different classes throughout the semester. If you start your notes on page 1, and continue page to page from that point, your notes will be a mix of all three subjects. Your data will be highly fragmented, and your free space will be unfragmented since all of the blank pages are always together at the back of the book.

But if you start with English notes on page 1, history notes on page 30, and science notes on page 60, you have enough room in each section to expand the notes throughout the semester. Your data is unfragmented, but your free space is spread throughout the notebook so it’s moderately fragmented.

This breaks down as disks get close to full, and is one reason why performance drops off suddenly when drives get close to full—the system is spending more and more time looking for contiguous blocks of free space, which does indeed lead to higher data fragmentation. But having free space spread throughout the disk is generally a good thing.

This guy with FRAG 60% does have a problem, the problem being that he only has 191M of free space left and even less available for his /boot

https://askubuntu.com/questions/1382986/zfs-bpool-is-almost-full-how-can-i-free-up-space-so-i-can-keep-updating-my-syst

5 Likes

No, since you use SSDs.

When you write new data on HDDs.

Nothing

That is a very old rule from back in the HDDs days and it wasn’t very precise to begin with. For blockstorage, at 50% you will see a performance penalty already and at 80% it will get really bad.

Maybe. Is minecraft in a zvol? What volblocksize does the zvol use? What filesizes does a minecraft server write?

2 Likes

Alternatively if not on a zvol, check the recordsize property:

You may think of Minecraft world files as such “databases”. Although I’m not sure what are the write patterns, so perhaps a strace will be needed to profile them.

1 Like

FRAG is not file fragmentation, but free-space fragmentation. Because ZFS is copy-on-write, and minecraft updates chunk files when something changes, this is not surprising.

I don’t think this is a big concern, to be honest. It’s more or less expected and you should be okay.

1 Like

Updated my prior post in this thread on why you should sleep soundly regarding the FRAG value.

If perhaps one stills needs something to stress about, then I’d recommend just looking at your write load and estimating how long it’s take to burn through the endurance of the drives.

I remember someone with Torrent having the same behaviour. Some applications, not just full blown databases (I worked with an enterprise product wearing down SSDs just by logging), just like to flush every damn changed bit to disk at relatively high intervals. Whether this is necessary is a more philosophical discussion.

But they are part of everyday computing. I don’t run Minecraft servers or Torrent, but I have a few dozen Zvols and zapping and installing new OS or generally altering the zvol. Man, I had 82% CAP with 94% FRAG and it almost was HDD-level of random IO.

send/recv and deleting 2 year old snapshots (took a while :smiley: ) solved a lot and got CAP and FRAG under control. Ever since, I kept a closer watch on CAP and relaxed some snapshot policies I made (were like 22k snapshots at peak, now I’m down to 4k and my pool says thank you).

It can get really messy if you ignore what your filesystem is and what it needs to shine. That is important for any filesystem out there, but with CoW, things can get very uncomfortable if you like to explore the limits. I know Ceph will shutdown the cluster and refuse any requests if CAP is >80%, just to make “hardcoded sure” you run into these troubles (and other troubles specific to the cluster failure domain)

Good stuff, as always.

Thank you all for the sound advice.

I’ll gather your questions in one post, and add more of my own. =)

I did indeed take those two 1TB WD SN700 SSDs and partition them down with ~20% unprovisioned space left for bad block remapping by the controller. Meaning I am not writing to the whole drive as I read is best for ZFS, but I ssume it really shouldn’t matter if I’m using a partition as a ZVOL when the drive is dedicated to only this one ZVOL.

Here are the stats:

# zfs get all pool | grep -e acl -e attr -e comp -e atime -e record -e sync -e special -e metadata
pool  compressratio         1.18x                  -
pool  recordsize            128K                   default
pool  compression           lz4                    local
pool  atime                 off                    local
pool  aclmode               discard                default
pool  aclinherit            passthrough            local
pool  xattr                 sa                     local
pool  sync                  standard               default
pool  refcompressratio      1.00x                  -  
pool  acltype               posix                  local
pool  relatime              off                    default
pool  redundant_metadata    all                    default
pool  special_small_blocks  0                      default
# find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

  1k: 2307431
  2k: 1626436
  4k: 1331945
  8k: 1095750
 16k: 1128945
 32k: 259367
 64k: 129210
128k:  42171
256k:  11571
512k:   5784
  1M:   3760
  2M:   4716
  4M:   2755
  8M:   1545
 16M:    662
 32M:    310
 64M:     82
128M:     48
256M:     16
512M:     16
  1G:      4
  2G:      2
  8G:      1
 32G:      1
 64G:      1

Block Size Histogram

block   psize                   lsize                     asize
size    Count   Size    Cum.    Count   Size    Cum.    Count   Size    Cum.
512     680361  348344832       348344832       680361  348344832       348344832       0       0       0
1024    1157530 1448215040      1796559872      1157530 1448215040      1796559872      0       0       0
2048    1855795 5166855680      6963415552      1855795 5166855680      6963415552      0       0       0
4096    2000210 10142238720     17105654272     1494989 8345898496      15309314048     4182811 17132793856     17132793856
8192    1278280 14459683328     31565337600     1136651 13234021376     28543335424     2384019 22132899840     39265693696
16384   1222863 27419568128     58984905728     1563518 33005192704     61548528128     1520788 32581959680     71847653376
32768   363056  15897610240     74882515968     274654  11611489280     73160017408     465824  19439398912     91287052288
65536   194237  17930324480     92812840448     135123  12365927424     85525944832     196816  18085134336     109372186624
131072  826694  108356435968    201169276416    1280405 167825244160    253351188992    828557  108614729728    217986916352
262144  0       0       201169276416    0       0       253351188992    211     60321792        218047238144
524288  0       0       201169276416    0       0       253351188992    0       0       218047238144
1048576 0       0       201169276416    0       0       253351188992    0       0       218047238144
2097152 0       0       201169276416    0       0       253351188992    0       0       218047238144
4194304 0       0       201169276416    0       0       253351188992    0       0       218047238144
8388608 0       0       201169276416    0       0       253351188992    0       0       218047238144
16777216        0       0       201169276416    0       0       253351188992    0       0       218047238144

                            capacity   operations   bandwidth  ---- errors ----
description                used avail  read write  read write  read write cksum
pool                       203G  677G 2.23K     0 9.21M     0     0     0     0
  mirror                   203G  677G 2.23K     0 9.21M     0     0     0     0
    /dev/disk/by-id/nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1                           1.12K     0 4.60M     0     0     0     0
    /dev/disk/by-id/nvme-WD_Red_SN700_1000GB_XXXXXXXXXXXX-part1                           1.12K     0 4.61M     0     0     0     0

Minecraft is on a dataset along with other other datasets on that same common mirrored zpool. Yes, that means “fragmentation” is a common function of the minecraft data together with the other data written on that pool… it was only after Minecraft was moved to the zpool that the frag % started going up.

To Minecraft: the server is essentially permanently writing to hundreds of thousands of files. Probably a huge amount of change happens not just directly in the MC DB (DB of files… that would be a thing! If Mojang actually moved to a real database instead of a million files) is for example the non-stop generation of Dynmap mapping files:

[14:32:43 INFO]: [dynmap] Radius render of map 'surface' of 'world' in progress - 177700 tiles rendered (4184.03 msec/tile, 3826.55 msec per render)
[14:41:54 INFO]: [dynmap] Radius render of map 'surface' of 'world' in progress - 177800 tiles rendered (4184.66 msec/tile, 3827.25 msec per render)
[14:51:02 INFO]: [dynmap] Radius render of map 'surface' of 'world' in progress - 177900 tiles rendered (4185.24 msec/tile, 3828.00 msec per render)

So… dare I ask… do I have some glaring mistake on this zpool?

Which I call shoddy coding. Imagine all processes would run like that. That’s why we have that stuff called memory (or databases as you already pointed out, to get this more structually and orderly sound).

If you want to avoid it, just set the working directory to a non-CoW FS on the unpartioned space. But that’s just extra work and not why you run ZFS.

Otherwise: If it isn’t impacting performance, don’t change it. Let it run and accept the FRAG value. There is nothing you can change (as @Log also pointed out) other than running apps that don’t use lazy coding, messing up and slowing down storage. It will never spiral out of control unless your CAP is high and FRAG will settle at some value and just be there not impacting much.

no.
I’m only surprised on that low compressratio if there are a million files around. I see much higher values on my datasets having that kind of structure. Maybe that Minecraft has own compression running (obviously redundant and unnecessary with ZFS).

I’d reduce recordsize to 4-8k on the relevant dataset (assuming you run with 128k for everything). That is pretty much database workload and lots of small stuff with (as we’ve been told) lots of file modify and rewrites. If stuff is sitting on a Zvol, you can’t change blocksize after creation and 32k default should be ok.
Won’t help with FRAG, but should give better performance in general

Is that a bedrock or java edition server? I don’t recall java having do many files, albeit that was (checks notes) some 8 years ago.

This java edition, 1.21.1 running on RHEL.

Along with the MC server itself, there are mapping plugins that read an calculate isometric topographical map tiles every time someone changes anything… which happens all the time. And they write their own files along with the world files.

Reduce recordsize… OK, that’s work, but doable. I can do this because that spinning rust ZFS pool is still there (now just serving as a backup for snapshots) and I can destroy the ZVOL and copy it all back. (Yes, writing to the HDDs is dog slow, but with ZFS send/receive it’s managable over several days.)

You can create a new dataset on the same pool and send/recv between them instead. As long as there’s free space, it might also help reduce fragmentation.

1 Like

Why. Seriously asking. Recordsize is just a max value. Many of his 4k writes will just use 4k instead of 128k. And the bigger 128k writes can use 128k.

If it is on zvol, (which it is not by his admission) I would set it to 4k with that many small files written. With 32k the rw amplification and fragmentation would be through the roof.

I think all in all, you are perfectly fine because you use mirrors and not RAIDZ. This is a prime example of why you should not use RAID for VMs and only for large files.

They only thing I can think of you could tune would be to use a SLOG, if minecraft does a lot of sync writes. But I have no idea if it does. That way the ZIL does not end up on your mirror pool, which could help with fragmentation (and of course sync write speed). I don’t see the need to mirror SLOG in homelabs by the way, since SLOG is only read from after a system halt. To loose data, you would need a failing SLOG and a system halt at the same time.

That’s one part I don’t agree with. A failing SLOG is likely to cause a system halt, and a violent system halt may take some components with it.

The writes will only be 4k if the files are under 4k. A 5b write to a 128KiB file with recordsize = 128KiB is a 128KiB write, not 4KiB write AFAICT.

1 Like

As far as I understand it, SLOG is never actually read from (unless after system halt). Do you think that a SLOG will crash the system because a write to it fails?

I am pretty sure that a 5b file will be a 4k write and a 5k file will be a 8k write (on a 128k record size) since it is the power of two for 4k (ashift 12).
I am under the impression that record size unlike volblocksize is not a set value but a max value. But maybe this is wrong and ZFS does actually use 128k for a 5k file and just compresses all zeros. Which then basically also only uses slightly over 5k.

Update: Just created a test dataset to test this.
Compression off, record size set to 1MB.
Dataset even empty uses 96k.

Writing 100x 10k files to it results in 2.58MiB.
If every file would write a 1MB this should be 100MB.
So I guess it is not only the compression?

I have seen failing drive cause a kernel panic, albeit a few years back. Kernel stalls are definitely possible still.

A 5b file sure will be a 4k record. But a 128KiB file with a 5b write inside it will be a 128KiB record, and a modification will still be a 128KiB record. RE: FRODOLF in the other thread.

Would that kernel panic not happen if it is a mirror?

Yeah I agree. That is why I think you can set the recordsize to even something big like 1MiB.
My theory: Even if Minecraft mostly writes small 16k files, there is no advantage of a 32k vs 1MiB recordsize. Since that 16k file will still have a 16k recordsize no matter if the dataset is set to 32k or 1MiB. And for any small change of that 16k file, the whole 16k file has to be read and written.

My point is in the other direction - if it has 1MiB files with 4KiB writes, smaller recordsize would be beneficial. But we won’t know unless we check for the workload.

It would happen, but you’d have a valid, second mirror of your ZIL. Whether missing that second half would be catastrophic for your pool - I can’t tell yet, but from what I understand it can.

1 Like

Ahh I see. Agree. But that would only be beneficial in case of rw amplification (so in a performance sense) and not fragmentation, right?
ZFS should just update the whole record without any further fragmentation if the write is async?