Normally guests benefit from virtio memballoon that resized dynamically their memory depending on workload. However, this stops working when vfio devices are added.
Does anyone know why vfio requires all memory to be “pinned” even if it isn’t used? Is there any way to control how much memory it pins?
It’s because PCI-E devices have direct memory access and if you are using ballooning memory then the PCI-E device can write to memory which may be being used by something else.
1 Like
Thank you for explaining, that was my intuition as well, but I couldn’t find a definitive answer anywhere.
What I don’t understand however is why there is no way to limit the memory used by the device? You can list which processes use that device, so at very least you should be able to release all the memory except for these specific processes?
Is that wrong to say that vfio took a very conservative approach by pinning ALL available memory rather than filtering it in some way? Surely this could be optimized further?
You can limit the memory used by the VM, possibly there is a way to limit how much memory a device uses but it can still use any block of memory that has been assigned to that VM, which is why ballooning doesn’t work.
2 Likes
That’s a shame. I thought that kernel should arbiter direct memory access. Are you saying that this needs to be all or nothing? There is no way to tell the device via the chipset in between to limit request only to a certain range of system memory?
Do you know if this Is something that SR-IOV could solve? Does SR-IOV still use vfio or some other mechanism? Maybe this way we could avoid this RAM ballooning issue?
I haven’t used it but I would assume that SR-IOV would work fine with ballooning memory because the hardware is controlled by the hypervisor not a VM.
You can limit the range of system memory it uses by changing how much memory you give the VM.
Interesting, that would explain why Nvidia can bill so much for their GRID (SR-IOV). If you are running any GPU application at scale I can’t imagine doing it without a memballoon.
If anyone used SR-IOV with memballoon please let me know.
There is a vGPU unlock project for consumer cards, but the last time I checked it required a custom kernel and nvidia’s license server constantly running. Between a rock and a hard place…
Or maybe one day someone will find the missing piece needed to unlock memballoon for vfio? Is it worth holding your breath for that day?
No, it just isn’t possible. I mean you can use ballooning memory with VFIO if you really want to but it will be unstable. I’ve done it by accident in proxmox before. Ultimately though it might just be simpler to get more memory.
1 Like
Thanks that’s fine, but (and sorry for being repetitive) I still don’t understand what happened in the history of computer design that didn’t foresee need to limit access of DMA devices to system memory? PC designers surely didn’t just look at these various third-party devices having full access to RAM and say “that’s fine”?
Surely there must have been some attempts to let kernel control this access and maybe due to performance or complexity reasons they were removed?
The VFIO device is limited to accessing the memory given to the VM, it doesn’t know anything about the host system. The problem is that when using ballooning memory the virtual memory addresses used by the VM don’t necessarily match the physical addresses because they’re mapped dynamically, so a PCI device using DMA will try to access a block of memory and end up accessing the wrong block because there’s nothing to remap the address as it’s directly accessing the memory.
Sorry, but isn’t that what IOMMU was supposed to provide? So virtual memory in VM will be coordinated with both the host and the device? Does it mean that the limitation lies in the chipset? If that’s correct would it make sense to ask chipset manufacturers to add vfio support?
All I know is if you use VFIO and ballooning memory you get errors
Sorry for keep drilling this topic, but I saw a good lecture discussing vfio for both x86 and POWER7/8 architectures and at 16:25 he mentions that on POWER7/8 you CAN have dynamic DMA windows (you don’t need to pin the entire memory). He also mentions that this had some issues.
That means that this is a decision on x86 architecture level driven mostly by PCI’s DMA design. I still don’t understand though what were the considerations when making this decision - clearly it must have been possible to create IOMMU without pinning the entire memory for DMA.
Actually Alex Williamson explained this issue in his talk already (I just needed to rewatch it very carefully this time). If I understood this correctly IOMMU lacks…
page faulting
Which makes intuitive sense to me - I assume it means that there is no “page table” on the chipset and it doesn’t store information necessary for the OS to be able to tell which pages are used for DMA?
Hope that is correct? He also mentions that it may possibly be added in the future (and that it is already included in the PCI spec).