I don’t think that these parameters have anything to do with your problem. I could find nothing online that people report that they needed them for performance reasons. I really do think your problem lies elsewhere.
Well, I AM attempting to use them to fix performance anomalies. And the lines screwing with the payload sizes seem to be the only thing different in the logs.
If only I could upload my full dmesg logs here. somebody else might see something obvious I’m missing.
@moderators Can someone please give @mathew2214 a boost for his trust level?
As I said, I think this message appearing is rather a sympthom than the cause.
Attach it as a file, this error is from trying to inline the log.
Seconded, this will completely tank your VM’s performance even with the hardware acceleration available to it. People usually only enable this to avoid VM detection with certain Anti-Cheats.
I also do not agree at all that the PCI bus tuning is the cause of the issue you’re having. And if something is screwing with it, it could very well be the guest VM itself as you are literally passing through hardware to it that it can do whatever it wants with it.
and runs terribly.
How are you measuring this metric?
How are you measuring this metric?
visual observation. there seems to be no systematic way to detect this state other than the dmesg messages mentioning the payload sizes.
What is your CPU?
Threadripper 1950x
You’re in luck, I am intimate with this CPU.
You are running a CPU that has absolutely terrible UMA support, if you have not aleady, configure your bios to switch to a NUMA configuration (Non-uniform memory access - Wikipedia)
You have assigned 12 cores to your guest, but 8 live on one die, and 6 live on the other, the guest performance will be terrible under certain circumstances as the guest will improperly schedule things across the dies causing huge memory latency spikes the guest is not expecting.
here is the dmesg logs i have previously recorded before adding any parameters to try and address the issue.
unaccaptable state
fkt1.txt (82.2 KB)
working perfectly after several VM reboots
avmrb.txt (81.8 KB)
and here are the logs WITH the parameter added:
unacceptable
fkt2.txt (81.3 KB)
working perfectly
wrkngprfct2.txt (77.6 KB)
yes, i am already numa tuned and everything.
i am passing 12 threads to the VM. only 6 cores though. i have been carefull to make sure these are within the same numa node, and that every VM thread is associated with the correctly paired core. lscpu -e was very useful for setting that up.
Excellent, this is something many people miss with the original TR CPUs.
You have provided dmesg of the guest, we need to see the host.
here’s the host’s dmesg. this is after rebooting the VM several times to bring it to a usable state. i doubt youll find anything of use here… i couldnt. i am under the impression the issue lies entirely within the VM.
host's dmesg
hstdmesg.txt (118.3 KB)
As you’re using an AMD GPU (Navi10) and I can see you obviously needed vendor-reset
you must be aware that the reset is not always 100% and can result in sporadic issues such as substandard performance in the guest on restarts.
I must ask again sorry, how are you noting the performance loss? General usage? 3d performance? Gaming?
Also you’re assigning a kernel argument that is useless…
amdgpu.gpu_recovery
will only be used if you actually load the amdgpu
kernel module, which you are not (and obviously should not)
310 [ 0.000000] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
Have you disabled this and assigned huge pages to the right die for the VM?
And is the VM running on the die that the GPU is actually attached to?
As you’re using an AMD GPU (Navi10) and I can see you obviously needed
vendor-reset
you must be aware that the reset is not always 100% and can result in sporadic issues such as substandard performance in the guest on restarts.
seeing as the degraded performance is the VM’s state UNTIL multiple reboots, i dont believe resetting has any impact here. as if resetting were to blame, the first VM bootup would consistently be performing as expected, and problems would only manifest after reboot. in reality, its the other way around. if anything, vendor-reset is fixing whatever causes it. however i believe vendor-reset has no impact on this problem at all.
I must ask again sorry, how are you noting the performance loss? General usage? 3d performance? Gaming?
observation. i can see that the system is performing horribly. takes several seconds just to print text to half the screen. even my Zilog Z80 rig can write text to the entire display faster than the VM in it’s degraded state.
Have you disabled this and assigned huge pages to the right die for the VM?
the VM does not use hugepages at all, only a numatune block as shown in my XML above. attempting to use hugepages causes the host to be unable to boot.
And is the VM running on the die that the GPU is actually attached to?
yes, i had to physically move the gpu to a different slot to achieve this, it was a giant pain.
Also you’re assigning a kernel argument that is useless…
amdgpu.gpu_recovery
will only be used if you actually load theamdgpu
kernel module, which you are not (and obviously should not)
noted. i will remove that parameter once this issue is fixed.
however i believe vendor-reset has no impact on this problem at all.
Sorry but this is simply not possible to assert, I can tell you with absolute certainty that the AMDGPU reset is not complete and the guest drivers can push the GPU into a better state on restart for a future boot when it’s not working right. (I am the vendor-reset author)
attempting to use hugepages causes the host to be unable to boot.
This is very concerning and indicates a rather major fault with your system and or configuration. There is no circumstance where using huge pages should be able to prevent system boot.
takes several seconds just to print text to half the screen.
I have seen Navi10 cause this intermittent behaviour again, due to inconsistent state/reset.
This is very concerning and indicates a rather major fault with your system and or configuration. There is no circumstance where using huge pages should be able to prevent system boot.
yes. i have learned my lesson about buying ASUS products. but it will be some time before i can afford a new board.
Sorry but this is simply not possible to assert, I can tell you with absolute certainty that the AMDGPU reset is not complete and the guest drivers can push the GPU into a better state on restart for a future boot when it’s not working right. (I am the vendor-reset author)
I have seen Navi10 cause this intermittent behaviour again, due to inconsistent state/reset.
this is new information to me. is there anyway we could go about objectively ruling out vendor-reset as the culprit?
yes. i have learned my lesson about buying ASUS products. but it will be some time before i can afford a new board.
It’s got nothing to do with the motherboard vendor. They all use the AGESA provided by AMD. Usually the most issues you get are corrupted/invalid ACPI table entries and/or power configuration. There is literallly nothing the motherboard vendor can do that can break huge pages without breaking the entire system even without huge pages.
this is new information to me. is there anyway we could go about objectively ruling out vendor-reset as the culprit?
I did not state vendor-reset
is the culprit, I stated that the Navi10 Is. Swap the GPU out for NVidia anything, even if old crap, and then try to reproduce the issue.
I did not state
vendor-reset
is the culprit, I stated that the Navi10 Is. Swap the GPU out for NVidia anything, even if old crap, and then try to reproduce the issue.
i have no other graphics cards available to test with. but since this is a VM, i could test it simply without the pass-through and using spice/qxl.
now that i suspect the GPU to be at fault, is there any hope for a solution aside from hardware replacement?
i could test it simply without the pass-through and using spice/qxl.
Yes, this is an option also. If this doesn’t resolve it, keep going through as a process of elimination, removing each pass-through device, USB devices, etc. Keep simplifying the VM config until the problem is resolved.
now that i suspect the GPU to be at fault, is there any hope for a solution aside from hardware replacement?
None, it’s a fundimental design flaw in the AMD GPU silicon that still to this day has not been fixed even though AMD have ackked the issue.
vendor-reset
attempts to work around the fact that AMD are incapable of implementing the PCIe standard bus reset mechanisim properly and refuse to provide function level reset (FLR) to us. It is honestly disgusting that they have not fixed this.
Edit: Note I have no love for ASUS either, their after sales support is horrendous and they often do not provide updated BIOSs with critical fixes for their systems once a new product hits the shelves.