The test is like this: pass all cores into your VM. For example, Ryzen 5 3600, pass all 6 cores to a single VM. For Ryzen 9 5900X, pass all 12 cores to a single VM. Run p95 128k fft fully loaded inside the VM, for say 20 minutes or longer.
I only tried on very few CPUs, coincidentally none can pass the test in VM. All can pass in bare metal. Stock settings, no overclocking. Tried different guest OSes, different P95 versions. The results are the same.
This issue has been puzzling me for a long time. I start to wonder if itās related to silicon quality or specific to the full range or a subset of Ryzen line processors. I canāt imagine EPYC line would have such issue.
I can understand the rationale of leaving some cores to the host. Some precautionary measure are necessary in setting up when passing all cores. It works out āwell for meā so far.
Now you mentioned it. You brought up a valid point to explain why I couldnāt find any mention of my issue online. In this sense, itās enlightening
Care to elaborate a bit?
While Prime95 fails to pass small fft such as 128k fully loaded in VM, it has no problem passing large fft such as 7M to 8M fully loaded. Perhaps this anecdotal evidence pushed me so inclined to believe the issue is a Ryzen processor thing.
For the fun of it, and for the first time, I tried the test in Windows 10 on bare metal with Hyper-V enabled. Surprise! The test passes 25-min run without failure.
The interesting part of it is that I was told once Hyper-V is enabled in Windows 10, Windows actually turns itself into a type 1 hypervisor, and runs āWindows 10ā as a guest VM on top of the hypervisor.
Such a setup micmic my test originally intended for KVM & QEMU. Apparently I would think AMD at least validates their Ryzen processors on WIndows with Hyper-V enabled. More foods for thought
A little cheer from recent hassle and bustle of life.
The issue was beautifully resolved in recent days by an accidental discovery. Now my system can pass 128k fft runs all day long in VM !!
In a nutshell, as I suspected all along (for close to three years), if your system cannot pass the test as stated in OP, your system (and/or overclock) IS NOT stable. The instability is probably a ānon-issueā to most users but a bugging one to people like me.
Look forward to discuss more in a month or two after I settle down.
itād be helpful if you could be clear about what the actual issue was in your case. Iām guessing itās undervolt/OC related, but itās just that, a guess.
The whole point of the test is to find bad hardware
Running inside a VM perhaps is not more stressful in terms of CPU load, but it does test operations which donāt normally happen when running on bare metal, like VM exit handling, IOMMU/nested TLB handling, etc - which could certainly be affected if the CPU is running under extreme load and is not stable.
Itās nearly always something to do with: PSU, CPU, CPU voltage settings, overclocking, memory timings, not enough cooling, dirty CPU/RAM socket contacts, firmware bugs (IIRC some Apple devices had a Wifi firmware thatād write to RAM in places it shouldnāt while in SMM mode (āring -2ā), so even with ECC youād get random corruptionā¦).
Passing all cores to a VM would not affect the test result for good hardware, it just means scheduling hypervisor IO/VM exits is a little higher latency. Prime95 does practically zero IO apart from the log messages, so is not affected much by that.
Additional data point: I ran the stress test inside a Linux VM (QEMU+KVM) with all cores passed-through on my EPYC, no errors, no ECC errors, and still going after an hour.
I still couldnāt pin point the exact issue. Ryzen 3000 series had been a mess from AMD in my opinion. The AGESA must be spaghetti code. Itās joke and luckily stays quite stable in the later stage of the product cycle.
I recalled when my system was brand new. I could pass the test as in OP in Windows 10 VM. Obviously it could also pass in Windows 10, and Linux on bare metal. Sometimes later I found it could no longer pass in Windows 10 VM. And yet some more time later (in recent months after OP was created), it could no longer pass in both Windows and Linux bare metal. It consistently fail on one core. So the new problem is pretty obvious.
I theorised:
the silicon could have degraded after three years heavy use
I might have significantly degraded the silicon quality in some of the recent experiments
AMD messed up in the latest AGESA again (which IMO isnāt friendly to early production batches near Ryzen 3000 series launch day)
As previously said, I always suspect the test failure in VM was a combination of AGESA bug (specifically in its boost algorithm) and marginal silicon quality (of my silicon sample). Atlas, itās damn hard to find another user on the giant cyberspace to corroborate the issue.
Anyway, your guess is spot on about my accidental workaround. So itās an undervolt. It cures everything, with no loss of performance. What a surprise.
Exactly. The test in OP can sniff out ābadā hardware. I was surprised no people were interested in the experiment within months after the post.
Prime95 small fft runs out of L3 cache. Itās damn fast, efficient and hot. VM entry/exit might slow it down a little but not so much judging by the heat it produces. I would assert it probably as close as to bare metal.
Overall, actually running stress tests in VM exert greater stress on processor, I/O and memory subsystem.
Congrats!
I should probably have RMAāed my sample, then my three-year journey would be quite different. Though then I would gain less insight into AMD as a companyā¦ and their technologies.
Lol I could understand maybe a few coreās to test make sure the VM you plan on setting up would be configured correct for your use case but all core prime test on a VM seems kinda daft.
If the use case is to run a VM to utilize all core then whatās the VM being used for in the first place?
Speaking from the perspective of personal computing, the rise of VFIO/multiple VMs with lots of dedicated hardware is the most daft creation IMHO.
You as an individual only have a pair of hands, eyes and two halves of a brain (which mostly only half functioning at a single moment). Whatās the sound and genuine use cases of multiple VMs, multiple display outputs to begin with?
If you think along that line then one day youāll realise e.g. actually passing all cores to all VMs is a brilliant ideaā¦ for personal computing.
Iām wiling to bet that the quality of the VM implementation is significant too.
How do you pass all processors to a VM? Iāve never thought to do this and Iām surprised if the OS allows it. The kernel is going to continue running the host regardless.
Which OSās have you tried this with - Linux distros? Same or different to the host? Windows? Which virtualization software?
Judging by frequencies of code updates. VFIO in comparison seems āmuch simplerā and has been stable for a long time. KVM on the other hand is buggy and constantly pushed with code updates almost everything single point release. Though most are edge cases that personal computing seldom run into I would think.
I ran into the phenomenon described in OP very early on after setting up my new system. Back then I suspect some failure was contributed by a floating point bug in KVM around Linux 5.2 timeframe (canāt remember exactly).
Just like passing in two or three cores. Instead, you pass all. After all, the cores are utilised on the good old time sharing basis.
Iām running three operating systems concurrently on the same machine, Linux as host, one MacOS VM and one Windows VM. It covers all my need whenever I want to spend some time to check, experiment, confirm or whatever need in any of those environments.
When you āpassā a CPU to a VM, the hypervisor doesnāt prevent that CPU being used by the hypervisor/host OS, it can still schedule processes to run on it, unless you do core-isolation as well. Just like you can run 100 processes at the āsame timeā and theyāll share 4 cores, you can run a VM with ā100ā cores on a host with 4 cores. Itās not optimal, just like running 100 processes on 4 cores isnāt, but itāll work.
Right. Like I said, Iāve never thought to try it before. It does make a bit of a mockery of the OPās idea since youāre not allocating all the CPU resources to the VM that one might think.
Conceptually itās very simple how VMs utilising/sharing CPU cores on the host. Each virtual core (inside VM) is simply a running thread on the host. So two virtual cores will be two threads (ideally always) running on the same physical core on the host for a microarchitecture supporting SMT.
CPU isolation means certain physical cores will be used by a single VM, nothing else. In a sense, itās equivalent to passing dedicated HW resources to a VM. So the proās and conās of doing so.
From a personal computing perspective, passing dedicated HW resources (including CPU isolation) to a single VM is an expensive way and not so efficient way of utilizing the hardware. There will be situations you need dedicated HW. Most of the time though you donāt have such need.
The main role of QEMU these days is exactly this: sharing host HW resources among multiple VMs.
With that said, while conceptually itās simple. Preserving CPU states for multiple VMs is a very complicated and expensive task. One of the reasons why VM performance usually takes a 5% loss when compared to bare metal. Nevertheless, the benefit of VMs outweighs the loss even for personal computing iMO.