Proxmox ZFS Mirror - Very low performance in VM's

A few months ago I migrated from a single proxmox instal to a mirrored array using a supermicro pcie dual slot nvme card 9my motherboard doesn’t have native physical nvme support), and ever since, I have noticed very slow VM perofmrnace. This is a homelab, so it isn’t “really” an issue, but even just apt upgrade on simple ubuntu server VM’s takes a while, and if I kick a bunch off at once, things really slow down. This was not an issue at all when I was using a single nvme drive and not using ZFS.

I do notice really high io delay, and being almost entirely a noob, I assume this is related? I store my VM’s on the same array as the boot drives, and have reduced ZFS’s RAM usage manually since it was eating up all my systems RAM for no real reason (again, this is a homelab, I really don’t need ARC acceleration on a pair of NVMe drives that do almost not raeding and writting anyways…). I believe I set ARC to either 4 or 8 GB, which should be plenty…

The screenshot below was while I was running apt update on 3 ubuntu server VM’s. Any ideas why or what is causing this? Its certainly not the end of the world, but it is a bit annoying as this machine should not be slow at anything I throw at it.

It’s not eating your RAM, it is taking free RAM until something else needs it. If ZFS doesn’t, your pagecache certainly will. And just because the page cache isn’t listed as “RAM used”, doesn’t mean Linux doesn’t use those bits. There is a good reason to use all avaiable RAM because unused RAM is wasted RAM. Windows is doing the same, but hiding it from the user. I never heard a Windows user disabling the cache because it’s eating all the RAM.

This has nothing to do with your problem though. 8G ARC is totally fine unless you have some very extravagant tuning and/or dedup going on.

I don’t think 25% I/O delay is a critical issue…

Are you using zvols or datasets? monolithic files for a VM image (like QCOW2) are always problematic, especially when modifying stuff and when the recordsize just is too large to guarantee optimal performance.

I recommend using zvols or reducing the recordsize on your datasets to match the clustersize of the underlying filesystem (the one the VM is using on the vdisk). You can’t retroactively change the recordsize on existing data, but future data should see improvement.

Also check that autotrim is enabled and do a manual TRIM just in case. Pool capacity is >70-80% or the pool shows high fragmentation? These all can and will slow down things quite a bit. Usually this isn’t the case, but I always check these things first.

Apt upgrade is a lot of random I/O and if some 1k file gets updated and the record on your pool is 128k, that’s a lot of work for updating a line in a man file or whatever. Extrapolate this for a 100MB small sync with the ubuntu repos and it certainly adds up.

ZFS mirror should perform as good as it gets and having the same pool for both Proxmox and VM storage isn’t a problem at all.

ZFS will always be slower than ext4/xfs. It’s doing a lot of extra stuff. CoW is great, but it’s certainly not free.

1 Like

While I mostly agree, from the light research I did it appeared as though ARC was a bit sticky with RAM in Proxmox, and didn’t de-allocate for VM’s needs as quickly as it should. This may have been false information, but either way, I am using ~80 GB of data for my VM’s, I figured hard limiting the ARC size to 4 or 8 GB would be more then fine (I think I picked 8, because I do have the RAM to spare…)

I am not using zvols or datasets, I simply set up the zfs array to be a mirror, and did a restore of my VM’s from an external drive. I am somewhat familular with zvols and datasets via my use of truenas as my NAS, but I don’t understand the inherent difference of them and why they would be prefered. Could you explain why this would be prefered, I would rather actually learn something then just blidnly try and change some settings hoping it fixes the perceived issue I have :slight_smile: I know enough to know enough, but ideally I would like to know enough to actually understand and be able to teach others in the future.

Again, I guess this is my ignorance showing, how would I go about this? I assume TRIM is on be default, but where do I even check, and how would I run it manually? Pool capacity is 500 GB, and I am using not even 100GB so that shouldn’t be an issue in terms of ZFS or the SSD’s NAND, but I do agree TRIM and garbage collection need to be running regardles, I am just not sure if it is. I vaguely remember seeing this somewhere in the settings but am not seeing it now…

Jumping through reddit posts, I did see this:

If you are running zfs, enabled it on the zpool:

zpool set autotrim=on rpool

And:

On consumer grade SSDs like your 860 EVO, I’ve seen catastrophic performance problems caused by autotrim on ZFS.

Docs say: "Be aware that automatic trimming of recently freed data blocks can put significant stress on the underlying storage devices. This will vary depending of how well the specific device handles these commands. For lower end devices it is often possible to achieve most of the benefits of automatic trimming by running an on-demand (manual) TRIM periodically using the zpool trim command. "

Would it make most sense to dissable autotrim to be on the save side (I am not sure if it is on or off, zpool status doens’t show this, I am not sure what command would tell me?

*Looks like its already set to run a trim on a cron schedule:

# TRIM the first Sunday of every month.
24 0 1-7 * * root if [ $(date +\%w) -eq 0 ] && [ -x /usr/lib/zfs-linux/trim ]; then /usr/lib/zfs-linux/trim; fi

This leads me to believe it is likely on right now. I am running Samsung 980 SSD’s, certainly not enterprise grade, but also not bottom of the barrel. Also, I was running a single 980 for a year and didn’t have this issue, so I suspect it is something to do with ZFS, or the way TRIM is working, or I suppose the fact ZFS is CoW as you suggested… but even being CoW, something is certainly up, this should still be extremely performant.

Any ideas?

If you are using ZFS, your data is stored somewhere. Proxmox creates datasets by default and is using otherwise default settings for all stuff I’ve checked so far.

My guess is still the 128k default recordsize applied to heavy random I/O sitting in some monolithic QCOW2 VM disk image. Create a dataset with 8k records and check if a VM stored there is just as slow. There are good reasons for having lower recordsize on random I/O heavy datasets. But it’s a tradeoff and I wouldn’t change it just for some (usually) automatic updates. More metadata and less throughput are the usual suspects that suffer.

About the TRIM. There is automatic background TRIM (very convenient) but you can always do a manual TRIM with zpool trim poolname. But with a cronjob in place in Proxmox, there probably no need for it.

With zfs get recordsize pool/dataset you can check this and with set instead of get you can change it. “get all” gets you a nice list with all the nice things you can tweak on a dataset basis.

A zvol is a kind of storage in ZFS that provides you with a block device. Means it’s a ZFS-native method of giving your VM a standard Linux block device as storage. I always found huge files stored in a filesystem a bit awkward to use and manage and I prefer to use zvols for all my VMs and Containers, but YMMV. Internet wars have been fought over this topic after all :wink:

TRIM as cronjob, CAP and FRAG fine, mirror vdev…seems ok so far.

I don’t know what to expect from a 980 Pro in a ZFS mirror but it might just be what it is. I’d check and test 4k-16k recordsize and/or 4k-16k volblocksize on a zvol just in case. Because for your average VM disk that is not some 4TB disk storing media, low values are king.

Proxmox defaults to QCOW2 file during VM creation and zfs-local is probably the storage location, so I assume that is what you are using for VM storage.

Looks like it is using 128k records:

root@pve:~# zfs get recordsize rpool
NAME   PROPERTY    VALUE    SOURCE
rpool  recordsize  128K     default

So, what you do is create a zvol for each VM and pass it that zvol? Instead of just letting it use the monolithic “entire drive”?

Looking at my VM options, I do have these settings checked (“SSD emulation” and “discard”), which if I understood correclty, enables TRIM to be passed to the OS as an option it can run;

This is correct, as far as I know. I know it is using zfs-local for VM storage, but I am unsure where to set or check QCOW2. What would I use instead of QCOW2…?

Also, if I wanted to do a test, I would create a new zvol with 8k records, and then can I just migrate a vm to that zvol (I may just restore it from a backup to that new location, sorta the same thing, but may be “easier” for me to do), and see if its any different?

This ZFS array is simply storing VM’s and containers only, so I would prefer to target VM OS’s performance which I imagine is mostly smaller files, thus an 8k record size? All “actual data” is stored on truenas (which… is a vm on this box, with its own ZFS array of spinning rust, these drives are passed through via an HBA direct to truenas, so proxmox settings are not applicable to my “actual data”), and all VM’s have NFS targets to truenas for their “actual data”. And again, this isn’t really “an issue” since everything seems to be happy enough, nothing I do (Plex, home assisstant, other simple homelab stuff) seems to have any performance issues, but I would like to make things a little faster, more in line with how it was prior to going ZFS.

How do I go about creating a new zvol (I can probably google this…) and then how would I set the record size?

Googling shows the folowing for creaitng zvol, is this correct?

mkdir -p /data
    zfs create -o mountpoint=/data tank/data

 zfs create -s -V 4GB tank/vol
    mkfs.ext4 /dev/zvol/tank/vol
    mount /dev/zvol/tank/vol /mnt

I haven’t tweaked this. TRIM on the ZFS pool is plenty.

I usually create batches of 5-10 zvols in the command line. And yes, my VM gets the zvol as storage. Usually 8k and with sparse flag so it only uses the space that the VM has actually used. And if the VM just gets a (virtual) block device anyway, I may as well use block devices created by ZFS itself and keep complexity low.

That’s correct for creating zvols. You don’t need to create a filesystem or mount it because the VM gets this a raw disk and OS install formats it anyway.

I’m actually not sure anymore how Proxmox handles local ZFS. I’m using TrueNAS for storage and Proxmox just gets fed with storage as needed.

Do a simple

zfs list

lists all pools, datasets and zvols present on the system. “vm-600-disk-0” should be listed there somewhere. If it’s a zvol/volume, it’s as good as it gets and you can forget about recordsize or volblocksize. Default volblocksize is 16k and a good “one size fits all” value.
If you can’t see it, it’s probably some file sitting in one of the datasets and usually mounted in some directory.

I can’t think of another major factor…other than the shoddy 980 Pro firmware I heard about several times. Or just overly high expectation regarding IOPS with that kind of software stack.

Makes sense.

I get this:

root@pve:~# zfs list
NAME                            USED  AVAIL     REFER  MOUNTPOINT
rpool                          83.2G   366G      104K  /rpool
rpool/ROOT                     3.39G   366G       96K  /rpool/ROOT
rpool/ROOT/pve-1               3.39G   366G     3.39G  /
rpool/data                     79.7G   366G      104K  /rpool/data
rpool/data/base-900-disk-0     1.14G   366G     1.14G  -
rpool/data/basevol-500-disk-0   566M  7.45G      566M  /rpool/data/basevol-500-disk-0
rpool/data/subvol-400-disk-1   1.92G  6.08G     1.92G  /rpool/data/subvol-400-disk-1
rpool/data/vm-600-disk-0       7.28G   366G     7.28G  -
rpool/data/vm-600-disk-1         56K   366G       56K  -
rpool/data/vm-600-disk-2         88K   366G       88K  -
rpool/data/vm-601-disk-0       8.01G   366G     8.01G  -
rpool/data/vm-670-disk-0       9.02G   366G     9.02G  -
rpool/data/vm-700-disk-0       13.3G   366G     13.3G  -
rpool/data/vm-701-disk-0       12.1G   366G     12.1G  -
rpool/data/vm-702-disk-0        116K   366G      116K  -
rpool/data/vm-702-disk-1       7.95G   366G     7.95G  -
rpool/data/vm-800-disk-0       5.62G   366G     5.62G  -
rpool/data/vm-880-cloudinit      76K   366G       76K  -
rpool/data/vm-880-disk-0       3.17G   366G     3.17G  -
rpool/data/vm-890-disk-0       3.70G   366G     3.70G  -
rpool/data/vm-899-disk-0       5.97G   366G     5.97G  -
rpool/data/vm-900-cloudinit      76K   366G       76K  -

Does this suggest its a dataset, and not a zvol, thus its using the default 128k?

To be fair, I am using 980’s, not pro’s (they were significantly cheaper at time of purchase, and again, for my use… should be WAY fast enough). I have also heard of shoddy 980 pro firmware, but I think the 980’s are a bit better (at least, they didn’t have the same issues the pro’s did, for all I know they have issues with ZFS, who knows).

They are awesome. dirt cheap and low power, excellent P-states. Got them in my laptop and two as boot drives for my devices. Tend to get very slow after 50GB of writes because the cache is kinda shit, but plenty for the stuff I do. Boot drives don’t write 50GB of logs per minute.

zfs list -t volume for zvols

and

zfs list -t filesystem 

for datasets/filesystems

zfs get all pool/dataset also works as you can see if it has a recordsize or volblocksize.

Values usually get inherited by the “subdirectories” unless you set values manually. Keeps things easy and uniform.

Exactly why I figured they would be good in this case. I think I got the first 980 for 90 bucks (this was 1.5 years ago), and the second was 40 bucks 2 months ago, couldn’t pass that up.

Based on this, it looks like my VM’s are zvols, and my containers are datasets.

Seeing as the vdev is set to 128k (is that right terminology here?), the zvol which houses the VM’s would inherit that 128k record size, correct? Or, no…

This suggests my VM’s are in zvol’s, which means they are 16k? How can I confirm this?

Also, thanks a bunch for the detialed help here, its greatly apprecaited.

No. It only applies to filesystems/datasets. zvols have a volblocksize and that should default to 8-16k and is totally fine. I’m happy to see Proxmox stores VMs in zvols by default on local ZFS storage. I wasn’t really sure and I remember it being different, but I only used local storage for a couple of weeks, years ago.

Yeah, so your problems aren’t related to a potential mismatch and unfortunate record-/volblocksize. Was worth a shot and I can’t think of anything other having a major impact on random I/O. The usual suspects dragging down performance are all fine on your pool.

I pass this problem to other forum members.Sure, you can dig into fio and explore the deep rabbit hole that is ZFS benchmarking and analyze all kinds of things…

NVMe on ZFS was always some degree of disappointment for most people…I never personally experienced it to be slow or faced any lockups. My 980s certainly served as storage for a lot of VMs, especially on my laptop when testing stuff.

1 Like

Intersting, I wonder if it has to due with PCIe bifrication on this specific motherboard. I had to really dig up some not so well documanted data as this mobo seems to “support it” but only just barely, and only on the last BIOS ever released for it. I wonder if somehow it ended up bifricating at a really low (1x?) speed. Even then, I would be surprised if that was the issue. PCIe Gen 3 x1 is still 1 GB/s, which is far and away plenty.

Who knows, I guess unless someone else comes along and offers some suggestions, it’ll just be what it is.

Thanks again for the help and especially for the detailed answers, certainly cleared up a lot of information for me.