VM on ZFS

, ,

Hi

I have a ZFS pool, on my server; and would like to run a VM off it.

Is there a way to have a disk image as a directory or something where individual files on that .img can be utilized by ARC etc?

The blocks in a zvol block device, or in a raw/qcow2 file on a dataset will be cached in ARC.

Usually the recommendation is to set ARC to cache metadata only, and let the VM just use its cache, because it can make slightly better decisions on what to keep and so you aren’t doubling the amount of ram used for what is mostly the same data.

So I use the ZFS pool for quite a few other things, which I would probably want to keep the ARC on normal mode. Presumably to achieve this, I’d have to create a separate dataset

The VM is windows 10, desktop…Will it automatically cache frequently used files/folders?

You can set the ARC settings per dataset and ZVOL, so you can leave the rest of the pool as is, if you were going that route.

As far as what each OS will cache, I couldn’t tell you.

You could export a directory via NFS and have your VM mount it

Is it? I never heard this advice before, I think your argument is sound, but it seems to depend a lot on the use case. The guest OS can make better decisions, but on the other hand the host often has more memory to use for cache (whether ARC or Linux’s pagecache). But of course you are right in that it would imply an extra round of shuffling data back and forth in RAM, I never actually thought much about that aspect. Either way, I’m always following the advice to leave qemu’s own caching off, and I think everyone should.

On a more technical note:
One thing I noticed with VMs and primarycache=metadata, is that unless block sizes are perfectly aligned, sync writes lead to read-modify-write amplification (test method: running crystaldiskmark rnd 4k write tests, 4k cluster NTFS in Win10 guest, 64k recordsize/volblocksize on host, monitoring zpool iostat -vv to observe both reads and writes during writing). If primarycache=all, then I see no RMW amplification.

The above suggests to me that if only metadata is cached in ARC, then writes are not cached either (at least not sync writes). But you’d really want this caching, to take advantage of coalesced writes. (Of course for sync writes the ZIL will also add to the write load nevertheless, but my assumption is that those extra writes are more efficient than RMW cycles).

Unless I am missing/misunderstanding some part of the underlying mechanisms (which happens a lot :slight_smile: ).

(My detour above is probably quite far from @burny02 s original question, which I believe is already answered in the thread.)

1 Like

Huh, that’s an interesting observation, thanks for bringing it up. Definitely makes me reconsider.

1 Like

Thinking again, I realized there is another potential explanation behind my observations. Maybe there is always RMW amplification in the test case I describe, but in the case of primarycache=all the to-be-read data sits already in ARC, implying that the relevant reads are never logged by zpool iostat -vv.

Actually this might make more sense. In my understanding, CrystalDiskMark preallocates a 1Gb file on NTFS and does all IO within that one. This file is organized first in 4k blocks (i.e. NTFS allocation units), which are stored on a virtual device with (in my test case) 64k block sizes.

My original assumption (leading to the conclusions in my last post) was that the size mismatch (64k vs 4k) would be taken care of as ZFS would have plenty of chances to fill a 64k range with 4k writes before the next txg commit. (NB that CrystalDiskMark is spamming 4k writes as fast as it can!) But this assumption is likely to be wrong, because ZFS never sees this blocksize mismatch! Rather, the RMW amplification must be happening between the VM and the storage layer immediately below it (qcow2 clusters or zvol blocks). ZFS would just receive read and write requests of 64k size.

My revised theory implies that RMV amplification happens regardless of the primarycache setting. If this revised theory is correct, then the reason I see only writes and no reads with zpool iostat when primarycache=all, is not a lack of RMW but simply that all the reads are from ARC and never hit storage.

Relevant to note may be that my tests were run with compression=off. The logic was that compression shouldn’t matter for random data, but I realize that when mismatched IO chunk sizes come into play it might potentially matter a lot! All this calls for further testing :slight_smile:


To recap

I made the following observation:

My first interpretation was that “turning off” ARC (primarycache=metadata) also turns off the collection of sync writes into transaction groups, i.e. the standard write caching that ZFS does for sync and async writes alike. But then I realized that zpool iostat -vv probably doesn’t log reads that happen from ARC, so a better explanation of my observation might be that there is always RMW (regardless of the primarycache setting) in my test scenario due to the mismatch in allocation unit size, but ARC helps with the read part of that.

If I’m right, the lession is simply to avoid mismatches beween NTFS / qcow2 / ZFS block sizes when hosting VM storage.

1 Like

:+1: That’s exactly what I was considering as a possibility as well, and is now in my ever increasing list of things to try to look at.