Possible bug: KVM virtualization, cache=none, and Linux 6.x [bug verification help needed]

Symptom: after updating Pop OS 22.04 to Kernel 6.0.2 (via the regular rolling updates), qemu-based virtual disks with libvirt’s “cache mode” set to either none or directsync cannot be properly read, written, or booted from by virtual machines. Changing the cache mode to something else will make them work.

Request to the community: can anyone else who uses KVM virtualization on a Linux >= 6.0 host try any of the following:

  • change cache mode on a (virtual) boot drive to either none or directsync and see if it boots?
  • or alternatively, create a new virtual disk to an existing machine and try to create a filesystem on it, while having cache mode set to either none or directsync?

These tasks both fail for me with kernel 6.0.2, but works as expected with 5.19.

Before filing a bug report somewhere I’d like to see whether it affects others, including other distros than Pop OS. If it does it would potentially have a large impact for production systems who migrate to kernel 6+.

Working config

image

<disk type="file" device="disk">
  <driver name="qemu" type="qcow2" discard="unmap"/>
  <source file="/var/lib/libvirt/images/debian11-bugtest.qcow2"/>
  <target dev="vda" bus="virtio"/>
  <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
</disk>

Result: guest system boots normally from the virtual drive.
Note: All selectable cache modes but none and directsync seems to work.

Not-working config

image

<disk type="file" device="disk">
  <driver name="qemu" type="qcow2" cache="none" discard="unmap"/>
  <source file="/var/lib/libvirt/images/debian11-bugtest.qcow2"/>
  <target dev="vda" bus="virtio"/>
  <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
</disk>

Result:
image

More details

I first brough this up in the “Small linux problem” thread, but after a closer look it seems significant enough to warrant a thread of its own (if not else for searchability in the future).

Software versions:

uname -a:  6.0.2-76060002-generic #202210150739~1666289067~22.04~fe0ce53 SMP PREEMPT_DYNAMIC Thu O x86_64 x86_64 x86_64 GNU/Linux
libvirt-daemon: 8.0.0-1ubuntu7.3
qemu-system-x86: 1:6.2+dfsg-2ubuntu6.5

Same problem regardless of disk model (sata/virtio/virtio-scsi) , guest firmware (UEFI vs BIOS), and guest OS (Debian vs. Windows 10). I did not try any other backing than qcow2 files (raw, zvol, device, …).

1 Like

(bump) No one yet with experience of success or failure with the software and use case combination that I describe above (Linux 6.x, KVM virtualization, cache=none)?

I remember reading a breakdown of the exact implications of the different cache modes mentioned above, but I can’t find it now. However judging from RedHat’s documentation, my guess is that the modes none and directsync have in common (and is separated from the others) in that they bypass Linux’s page cache. I wonder if that’s the root of the issue somehow.

For now I’m mainly trying to pinpoint the scope of the bug, to report it in the most relevant place. I guess I’ll start by reporting the bug with Pop OS unless we know it affects other distros.

(post deleted by author)

2 Likes

I did as @anon7950785 suggested and filed a bug with Pop!_OS.

Here: Some cache models for KVM virtual machines stopped working after upgrade to kernel 6.0.2 · Issue #2670 · pop-os/pop · GitHub

In the meanwhile I found out that the problem disappeared if I stored the virtual disk image on a ZFS dataset instead of an ext4 partition. This adds to my suspicion that the problem is related to bypassing the page cache, as ZFS does not use the page cache.

Details in the bug report if anyone is interested!

1 Like

Update: now the bug is confirmed from a Debian user too. I think it is safe to assume that the problem lies further upstream (unless anyone can show an unaffected system with the combination of parameters described above).

2 Likes