Can your Ryzen processor pass Prime95 small FFT inside Qemu/KVM?

The test is like this: pass all cores into your VM. For example, Ryzen 5 3600, pass all 6 cores to a single VM. For Ryzen 9 5900X, pass all 12 cores to a single VM. Run p95 128k fft fully loaded inside the VM, for say 20 minutes or longer.

I only tried on very few CPUs, coincidentally none can pass the test in VM. All can pass in bare metal. Stock settings, no overclocking. Tried different guest OSes, different P95 versions. The results are the same.

This issue has been puzzling me for a long time. I start to wonder if itā€™s related to silicon quality or specific to the full range or a subset of Ryzen line processors. I canā€™t imagine EPYC line would have such issue.

Please share your test result and thoughts.

2 Likes

Well, thatā€™s not a good idea!

I canā€™t say Iā€™m surprised it doesnā€™t pass

1 Like

I can understand the rationale of leaving some cores to the host. Some precautionary measure are necessary in setting up when passing all cores. It works out ā€˜well for meā€™ so far.

Now you mentioned it. You brought up a valid point to explain why I couldnā€™t find any mention of my issue online. In this sense, itā€™s enlightening :slight_smile:

Care to elaborate a bit?

While Prime95 fails to pass small fft such as 128k fully loaded in VM, it has no problem passing large fft such as 7M to 8M fully loaded. Perhaps this anecdotal evidence pushed me so inclined to believe the issue is a Ryzen processor thing.

For the fun of it, and for the first time, I tried the test in Windows 10 on bare metal with Hyper-V enabled. Surprise! The test passes 25-min run without failure.

The interesting part of it is that I was told once Hyper-V is enabled in Windows 10, Windows actually turns itself into a type 1 hypervisor, and runs ā€œWindows 10ā€ as a guest VM on top of the hypervisor.

Such a setup micmic my test originally intended for KVM & QEMU. Apparently I would think AMD at least validates their Ryzen processors on WIndows with Hyper-V enabled. More foods for thought

1 Like

A little cheer from recent hassle and bustle of life.

The issue was beautifully resolved in recent days by an accidental discovery. Now my system can pass 128k fft runs all day long in VM !!

In a nutshell, as I suspected all along (for close to three years), if your system cannot pass the test as stated in OP, your system (and/or overclock) IS NOT stable. The instability is probably a ā€˜non-issueā€™ to most users but a bugging one to people like me.

Look forward to discuss more in a month or two after I settle down.

2 Likes

itā€™d be helpful if you could be clear about what the actual issue was in your case. Iā€™m guessing itā€™s undervolt/OC related, but itā€™s just that, a guess. :wink:

2 Likes

The whole point of the test is to find bad hardware :slight_smile:

Running inside a VM perhaps is not more stressful in terms of CPU load, but it does test operations which donā€™t normally happen when running on bare metal, like VM exit handling, IOMMU/nested TLB handling, etc - which could certainly be affected if the CPU is running under extreme load and is not stable.

Itā€™s nearly always something to do with: PSU, CPU, CPU voltage settings, overclocking, memory timings, not enough cooling, dirty CPU/RAM socket contacts, firmware bugs (IIRC some Apple devices had a Wifi firmware thatā€™d write to RAM in places it shouldnā€™t while in SMM mode (ā€œring -2ā€), so even with ECC youā€™d get random corruptionā€¦).

Passing all cores to a VM would not affect the test result for good hardware, it just means scheduling hypervisor IO/VM exits is a little higher latency. Prime95 does practically zero IO apart from the log messages, so is not affected much by that.

Additional data point: I ran the stress test inside a Linux VM (QEMU+KVM) with all cores passed-through on my EPYC, no errors, no ECC errors, and still going after an hour.

I still couldnā€™t pin point the exact issue. Ryzen 3000 series had been a mess from AMD in my opinion. The AGESA must be spaghetti code. Itā€™s joke and luckily stays quite stable in the later stage of the product cycle.

I recalled when my system was brand new. I could pass the test as in OP in Windows 10 VM. Obviously it could also pass in Windows 10, and Linux on bare metal. Sometimes later I found it could no longer pass in Windows 10 VM. And yet some more time later (in recent months after OP was created), it could no longer pass in both Windows and Linux bare metal. It consistently fail on one core. So the new problem is pretty obvious.

I theorised:

  1. the silicon could have degraded after three years heavy use
  2. I might have significantly degraded the silicon quality in some of the recent experiments
  3. AMD messed up in the latest AGESA again (which IMO isnā€™t friendly to early production batches near Ryzen 3000 series launch day)

As previously said, I always suspect the test failure in VM was a combination of AGESA bug (specifically in its boost algorithm) and marginal silicon quality (of my silicon sample). Atlas, itā€™s damn hard to find another user on the giant cyberspace to corroborate the issue.

Anyway, your guess is spot on about my accidental workaround. So itā€™s an undervolt. It cures everything, with no loss of performance. What a surprise.

Exactly. The test in OP can sniff out ā€˜badā€™ hardware. I was surprised no people were interested in the experiment within months after the post.

Prime95 small fft runs out of L3 cache. Itā€™s damn fast, efficient and hot. VM entry/exit might slow it down a little but not so much judging by the heat it produces. I would assert it probably as close as to bare metal.

Overall, actually running stress tests in VM exert greater stress on processor, I/O and memory subsystem.

Congrats!

I should probably have RMAā€™ed my sample, then my three-year journey would be quite different. Though then I would gain less insight into AMD as a companyā€¦ and their technologies.

Lol I could understand maybe a few coreā€™s to test make sure the VM you plan on setting up would be configured correct for your use case but all core prime test on a VM seems kinda daft.

If the use case is to run a VM to utilize all core then whatā€™s the VM being used for in the first place?

Not OP, but thereā€™s very few downsides if you want any features that hypervisors can deliver:

  • Ability to live-migrate to other physical hardware for uptime.
  • Ease of deployment / provisioning / backup / snapshots / rollbacks / restores.
  • Consistency of remote management.
1 Like

Read carefully

Speaking from the perspective of personal computing, the rise of VFIO/multiple VMs with lots of dedicated hardware is the most daft creation IMHO.

You as an individual only have a pair of hands, eyes and two halves of a brain (which mostly only half functioning at a single moment). Whatā€™s the sound and genuine use cases of multiple VMs, multiple display outputs to begin with?

If you think along that line then one day youā€™ll realise e.g. actually passing all cores to all VMs is a brilliant ideaā€¦ for personal computing.

Iā€™m wiling to bet that the quality of the VM implementation is significant too.

How do you pass all processors to a VM? Iā€™ve never thought to do this and Iā€™m surprised if the OS allows it. The kernel is going to continue running the host regardless.

Which OSā€™s have you tried this with - Linux distros? Same or different to the host? Windows? Which virtualization software?

Judging by frequencies of code updates. VFIO in comparison seems ā€˜much simplerā€™ and has been stable for a long time. KVM on the other hand is buggy and constantly pushed with code updates almost everything single point release. Though most are edge cases that personal computing seldom run into I would think.

I ran into the phenomenon described in OP very early on after setting up my new system. Back then I suspect some failure was contributed by a floating point bug in KVM around Linux 5.2 timeframe (canā€™t remember exactly).

Just like passing in two or three cores. Instead, you pass all. After all, the cores are utilised on the good old time sharing basis.

Iā€™m running three operating systems concurrently on the same machine, Linux as host, one MacOS VM and one Windows VM. It covers all my need whenever I want to spend some time to check, experiment, confirm or whatever need in any of those environments.

When you ā€œpassā€ a CPU to a VM, the hypervisor doesnā€™t prevent that CPU being used by the hypervisor/host OS, it can still schedule processes to run on it, unless you do core-isolation as well. Just like you can run 100 processes at the ā€œsame timeā€ and theyā€™ll share 4 cores, you can run a VM with ā€œ100ā€ cores on a host with 4 cores. Itā€™s not optimal, just like running 100 processes on 4 cores isnā€™t, but itā€™ll work.

Right. Like I said, Iā€™ve never thought to try it before. It does make a bit of a mockery of the OPā€™s idea since youā€™re not allocating all the CPU resources to the VM that one might think.

Conceptually itā€™s very simple how VMs utilising/sharing CPU cores on the host. Each virtual core (inside VM) is simply a running thread on the host. So two virtual cores will be two threads (ideally always) running on the same physical core on the host for a microarchitecture supporting SMT.

CPU isolation means certain physical cores will be used by a single VM, nothing else. In a sense, itā€™s equivalent to passing dedicated HW resources to a VM. So the proā€™s and conā€™s of doing so.

From a personal computing perspective, passing dedicated HW resources (including CPU isolation) to a single VM is an expensive way and not so efficient way of utilizing the hardware. There will be situations you need dedicated HW. Most of the time though you donā€™t have such need.

The main role of QEMU these days is exactly this: sharing host HW resources among multiple VMs.

With that said, while conceptually itā€™s simple. Preserving CPU states for multiple VMs is a very complicated and expensive task. One of the reasons why VM performance usually takes a 5% loss when compared to bare metal. Nevertheless, the benefit of VMs outweighs the loss even for personal computing iMO.