Help identifying problems on Socket 1 (2011v4 Supermicro platform)

Hi folks

So for the love of all things I decided to scrap together some old workstation hardware, buy some missing parts, and build myself a two headed gaming rig for two simultaneously happy people.

And many things I did.
And a funky solution have I.
And a very riddleculous problem, I have never seen before…

I ask for your… intuition,
whatever the reason could be.
I have no clue anymore.
Thank you very much indeed!

Abstract:
Dual Socket. Socket 2 runs splendidly, but Socket 1 behaves like a donkey in constant protest.

Problem Description:
While stuff run on Socket 2 performs well, Socket 1 shows extremely heavy load for even the simplest of operations, way bigger than the same done on Socket 2. Heavy stutter. Gaming is impossible. The Windows GUI is jumpy, non-responsive. Socket 2 is good, Socket 1 is bad. But I find no errors.

  • Example Socket 1: maxing out 200mbit Download from Steam it takes 12 saturated threads (at around 3Ghz).
  • Example Socket 2: does the same with 900mbit+ using about 6 threads.

Concept:

  • dual Socket Xeon platform
  • KVM: libvirt/Qemu, on Ubuntu
  • 128G RAM on host
  • two Win10 guests for gaming, each equipped with a physical GPU, 24G RAM each
  • using vfio-pci (vfio-pci driver pre-loaded in early boot, plus vendor-reset)
  • using strict Numa separation (1 VM bound to each Socket, no bleeding out of RAM)
  • keep to the host enough resources in terms of CPU and RAM
  • keep ZFS memory hungryness low
  • run guests and their games on a separate SSD, ZFS Raid0 equivalent, via qcow2, but give out storage for the games as a Samba share to each VM
  • Connect using Parsec, mad king of the hill :laughing:.

Things I tried:

  • switch GPUs (because Numa topology, but shouldn’t matter)
  • switch CPUs
  • switch VMs (test both VMs on both Sockets, keep VM definition, switch numa location only)
  • run same tests on qemu-images only (without Samba)
  • no error in syslog, no error in dmesg (both IMHO)
  • test different RAM
  • update Bios, reset Bios
  • disabled all CVE mitigations
  • strict CPU thread pinning, lower core count for VM, more resources for host
  • numastat says fine (RAM on same node)
  • deactivated numad.service
  • run the Supermicro Self-Diagnosis (via IPMI). Checks for each component. All PASS.
  • Run the “Intel Processor Diagnostic Tool” inside the VM, all tests and flags PASS.
  • Have a glance at the pins of the Socket (will do again).
  • Try a 10G Fiber adapter via Pcie (SR-IOV, rule out ethernet controller)
  • as for the BIOS settings: I must admid there are so many, but the basics should be fine, and I should have ruled out the major ones via the other tests…

Suspicion
Something physical on the board. But I lack the experience to pinpoint internals versus behaviour.
I will eventually recheck the pins on the Socket, but at first check I didn’t see anything irregular, and anyways I would expect the Intel CPU test (in VM) to fail… (?). I will also run the test on the host itself. Thought it was more telling in VM, but willdo.
Corrosion on the CPU pins? This board came from ebay and it looked never used.

Hardware Setup:
Board: Supermicro X10DRI
CPU: 2x Intel(R) Xeon(R) CPU E5-2697A v4
Bios: latest (tried several)
RAM: 4x 32GB M386A4G40DM0-CPB (LRDIMM, ECC) (equivalent recommended by Supermicro)
OS disk: An Intel SSD. Ubuntu 22 LTS HWE (Kernel 6.2)
GPUS: 2x AMD WX9100 plus working vendor-reset module
Storage including VMs: 2x Crucial P3 NVME. The ZFS pool has all ARC caching deactivated (metadata only).
Resizable BAR: forced, using the new vfio method in Kernel 6.2 (tested without)
Network: SR-IOV using the built-in 1G controller
Qemu: pc-q35-7.2, OVMF/ UEFI, libvirt 9.0.0/Qemu 7.2.0
Guests: WIndows 10 Pro
Gaming storage shared via Samba on an internal network (virtio net)

So…
This seems very aesotheric to me, so I ask,
if anybody has a clue regarding the odd degradation?

Thank you very much indeed.

Is it actually hitting those speeds? Or is that what it should run at and it’s throttling, maybe VRM related throttling and just jumping thread using around so more cores seem more active?
Engineering sample CPU mismatch? Limit cpu ID or whatever the exact setting is. that one is shot in the dark guess.
All drives (unless NVME or add-in HBA) will be connected to 1st CPU, might need to account for that in some way.
That’s all I got for now. good luck

Is it actually hitting those speeds? Or is that what it should run at and it’s throttling

typically CPU does 1400% but delivers 100%.
It is as if the CPU is either degraded to not using some CPU extensions, or in between these 1400% there are a lot of stoppers (cache, RAM?), causing it to “restart” the current task, causing a peak (just my senses).

maybe VRM related throttling

Do you know a handle for this? Some keywords to look out for, in BIOS, or even some endpoint in Linux itself to measure?

Engineering sample CPU mismatch

no ES. These seem fully Retail to me. At least no reason to expect otherwhise.

All drives (unless NVME or add-in HBA) will be connected to 1st CPU, might need to account for that in some way.

The work is done in the VMs which reside fully on NVME.
The OS ssd is SATA, but I don’t think it has much IO as soon as it’s booted.
I will check Swap, but given this huge amount of RAM headspace, guess no.

Do you know a tool to benchmark CPU but limited to Socket?
I want to compare the two on the host, but haven’t found a tool yet which can do Cores/Socket selection.

Thanks, Regards

Which RAM slots do you have the RAM in? if you had the 4 sticks on only 1 CPU it could explain the problems.

if you had the 4 sticks on only 1 CPU it could explain the problems.

good point thanks.
I went by the naming convention of the manual, since there was no explicit matrix, as in other cases. They are certainly 2+2 by Socket, and numastat showed two groups and no misses, but…
I will double check!
Sometimes the order matters.

You deal with that with more air flow. Well, you could also disable turbo and stuff.

I can’t remember a name for anything NUMA aware.
might be a NUMA setting in BIOS, sounds like you’d want to make sure that’s enabled. If you mentioned that being set, I forgot.

VRM if I understand correctly has to do with Voltage Input. IPMI said green, but will check if I get info.

I can’t remember a name for anything NUMA aware

could also be Core-aware. I can deciver Cores by Socket from lscpu |grep NUMA.
Yes, Numa is activated.
Will research more for a tool. stress-ng has a ton of options, but I need a score to compare.

One thing I didn’t crosscheck was me leaving all CPU power management in “Performance Mode” (BIOS). It could interfere with the OS doing the handling as well.