Can your Ryzen processor pass Prime95 small FFT inside Qemu/KVM?

vic · August 18, 2022, 4:16am

The test is like this: pass all cores into your VM. For example, Ryzen 5 3600, pass all 6 cores to a single VM. For Ryzen 9 5900X, pass all 12 cores to a single VM. Run p95 128k fft fully loaded inside the VM, for say 20 minutes or longer.

I only tried on very few CPUs, coincidentally none can pass the test in VM. All can pass in bare metal. Stock settings, no overclocking. Tried different guest OSes, different P95 versions. The results are the same.

This issue has been puzzling me for a long time. I start to wonder if it’s related to silicon quality or specific to the full range or a subset of Ryzen line processors. I can’t imagine EPYC line would have such issue.

Please share your test result and thoughts.

FunnyPossum · August 19, 2022, 1:47pm

Well, that’s not a good idea!

I can’t say I’m surprised it doesn’t pass

vic · August 20, 2022, 3:56am

I can understand the rationale of leaving some cores to the host. Some precautionary measure are necessary in setting up when passing all cores. It works out ‘well for me’ so far.

Now you mentioned it. You brought up a valid point to explain why I couldn’t find any mention of my issue online. In this sense, it’s enlightening

Care to elaborate a bit?

While Prime95 fails to pass small fft such as 128k fully loaded in VM, it has no problem passing large fft such as 7M to 8M fully loaded. Perhaps this anecdotal evidence pushed me so inclined to believe the issue is a Ryzen processor thing.

vic · August 24, 2022, 10:50am

For the fun of it, and for the first time, I tried the test in Windows 10 on bare metal with Hyper-V enabled. Surprise! The test passes 25-min run without failure.

The interesting part of it is that I was told once Hyper-V is enabled in Windows 10, Windows actually turns itself into a type 1 hypervisor, and runs “Windows 10” as a guest VM on top of the hypervisor.

Such a setup micmic my test originally intended for KVM & QEMU. Apparently I would think AMD at least validates their Ryzen processors on WIndows with Hyper-V enabled. More foods for thought

vic · January 18, 2023, 4:55am

A little cheer from recent hassle and bustle of life.

The issue was beautifully resolved in recent days by an accidental discovery. Now my system can pass 128k fft runs all day long in VM !!

In a nutshell, as I suspected all along (for close to three years), if your system cannot pass the test as stated in OP, your system (and/or overclock) IS NOT stable. The instability is probably a ‘non-issue’ to most users but a bugging one to people like me.

Look forward to discuss more in a month or two after I settle down.

foppe · January 18, 2023, 7:50am

it’d be helpful if you could be clear about what the actual issue was in your case. I’m guessing it’s undervolt/OC related, but it’s just that, a guess.

xzpfzxds · January 18, 2023, 1:42pm

The whole point of the test is to find bad hardware

Running inside a VM perhaps is not more stressful in terms of CPU load, but it does test operations which don’t normally happen when running on bare metal, like VM exit handling, IOMMU/nested TLB handling, etc - which could certainly be affected if the CPU is running under extreme load and is not stable.

It’s nearly always something to do with: PSU, CPU, CPU voltage settings, overclocking, memory timings, not enough cooling, dirty CPU/RAM socket contacts, firmware bugs (IIRC some Apple devices had a Wifi firmware that’d write to RAM in places it shouldn’t while in SMM mode (“ring -2”), so even with ECC you’d get random corruption…).

Passing all cores to a VM would not affect the test result for good hardware, it just means scheduling hypervisor IO/VM exits is a little higher latency. Prime95 does practically zero IO apart from the log messages, so is not affected much by that.

Additional data point: I ran the stress test inside a Linux VM (QEMU+KVM) with all cores passed-through on my EPYC, no errors, no ECC errors, and still going after an hour.

vic · February 6, 2023, 6:37pm

I still couldn’t pin point the exact issue. Ryzen 3000 series had been a mess from AMD in my opinion. The AGESA must be spaghetti code. It’s joke and luckily stays quite stable in the later stage of the product cycle.

I recalled when my system was brand new. I could pass the test as in OP in Windows 10 VM. Obviously it could also pass in Windows 10, and Linux on bare metal. Sometimes later I found it could no longer pass in Windows 10 VM. And yet some more time later (in recent months after OP was created), it could no longer pass in both Windows and Linux bare metal. It consistently fail on one core. So the new problem is pretty obvious.

I theorised:

the silicon could have degraded after three years heavy use
I might have significantly degraded the silicon quality in some of the recent experiments
AMD messed up in the latest AGESA again (which IMO isn’t friendly to early production batches near Ryzen 3000 series launch day)

As previously said, I always suspect the test failure in VM was a combination of AGESA bug (specifically in its boost algorithm) and marginal silicon quality (of my silicon sample). Atlas, it’s damn hard to find another user on the giant cyberspace to corroborate the issue.

Anyway, your guess is spot on about my accidental workaround. So it’s an undervolt. It cures everything, with no loss of performance. What a surprise.

vic · February 6, 2023, 7:00pm

Exactly. The test in OP can sniff out ‘bad’ hardware. I was surprised no people were interested in the experiment within months after the post.

Prime95 small fft runs out of L3 cache. It’s damn fast, efficient and hot. VM entry/exit might slow it down a little but not so much judging by the heat it produces. I would assert it probably as close as to bare metal.

Overall, actually running stress tests in VM exert greater stress on processor, I/O and memory subsystem.

Congrats!

I should probably have RMA’ed my sample, then my three-year journey would be quite different. Though then I would gain less insight into AMD as a company… and their technologies.

Jaycob · February 7, 2023, 9:09pm

Lol I could understand maybe a few core’s to test make sure the VM you plan on setting up would be configured correct for your use case but all core prime test on a VM seems kinda daft.

If the use case is to run a VM to utilize all core then what’s the VM being used for in the first place?

xzpfzxds · February 8, 2023, 9:29pm

Not OP, but there’s very few downsides if you want any features that hypervisors can deliver:

Ability to live-migrate to other physical hardware for uptime.
Ease of deployment / provisioning / backup / snapshots / rollbacks / restores.
Consistency of remote management.

vic · February 9, 2023, 12:42pm

Read carefully

vic · February 9, 2023, 12:48pm

Speaking from the perspective of personal computing, the rise of VFIO/multiple VMs with lots of dedicated hardware is the most daft creation IMHO.

You as an individual only have a pair of hands, eyes and two halves of a brain (which mostly only half functioning at a single moment). What’s the sound and genuine use cases of multiple VMs, multiple display outputs to begin with?

If you think along that line then one day you’ll realise e.g. actually passing all cores to all VMs is a brilliant idea… for personal computing.

AliasInWendellLand · February 9, 2023, 1:01pm

I’m wiling to bet that the quality of the VM implementation is significant too.

How do you pass all processors to a VM? I’ve never thought to do this and I’m surprised if the OS allows it. The kernel is going to continue running the host regardless.

Which OS’s have you tried this with - Linux distros? Same or different to the host? Windows? Which virtualization software?

vic · February 9, 2023, 3:02pm

Judging by frequencies of code updates. VFIO in comparison seems ‘much simpler’ and has been stable for a long time. KVM on the other hand is buggy and constantly pushed with code updates almost everything single point release. Though most are edge cases that personal computing seldom run into I would think.

I ran into the phenomenon described in OP very early on after setting up my new system. Back then I suspect some failure was contributed by a floating point bug in KVM around Linux 5.2 timeframe (can’t remember exactly).

Just like passing in two or three cores. Instead, you pass all. After all, the cores are utilised on the good old time sharing basis.

I’m running three operating systems concurrently on the same machine, Linux as host, one MacOS VM and one Windows VM. It covers all my need whenever I want to spend some time to check, experiment, confirm or whatever need in any of those environments.

xzpfzxds · February 10, 2023, 2:41pm

When you “pass” a CPU to a VM, the hypervisor doesn’t prevent that CPU being used by the hypervisor/host OS, it can still schedule processes to run on it, unless you do core-isolation as well. Just like you can run 100 processes at the “same time” and they’ll share 4 cores, you can run a VM with “100” cores on a host with 4 cores. It’s not optimal, just like running 100 processes on 4 cores isn’t, but it’ll work.

AliasInWendellLand · February 10, 2023, 3:14pm

Right. Like I said, I’ve never thought to try it before. It does make a bit of a mockery of the OP’s idea since you’re not allocating all the CPU resources to the VM that one might think.

vic · February 11, 2023, 9:01am

Conceptually it’s very simple how VMs utilising/sharing CPU cores on the host. Each virtual core (inside VM) is simply a running thread on the host. So two virtual cores will be two threads (ideally always) running on the same physical core on the host for a microarchitecture supporting SMT.

CPU isolation means certain physical cores will be used by a single VM, nothing else. In a sense, it’s equivalent to passing dedicated HW resources to a VM. So the pro’s and con’s of doing so.

From a personal computing perspective, passing dedicated HW resources (including CPU isolation) to a single VM is an expensive way and not so efficient way of utilizing the hardware. There will be situations you need dedicated HW. Most of the time though you don’t have such need.

The main role of QEMU these days is exactly this: sharing host HW resources among multiple VMs.

With that said, while conceptually it’s simple. Preserving CPU states for multiple VMs is a very complicated and expensive task. One of the reasons why VM performance usually takes a 5% loss when compared to bare metal. Nevertheless, the benefit of VMs outweighs the loss even for personal computing iMO.