Hi folks
So for the love of all things I decided to scrap together some old workstation hardware, buy some missing parts, and build myself a two headed gaming rig for two simultaneously happy people.
And many things I did.
And a funky solution have I.
And a very riddleculous problem, I have never seen before…
I ask for your… intuition,
whatever the reason could be.
I have no clue anymore.
Thank you very much indeed!
Abstract:
Dual Socket. Socket 2 runs splendidly, but Socket 1 behaves like a donkey in constant protest.
Problem Description:
While stuff run on Socket 2 performs well, Socket 1 shows extremely heavy load for even the simplest of operations, way bigger than the same done on Socket 2. Heavy stutter. Gaming is impossible. The Windows GUI is jumpy, non-responsive. Socket 2 is good, Socket 1 is bad. But I find no errors.
- Example Socket 1: maxing out 200mbit Download from Steam it takes 12 saturated threads (at around 3Ghz).
- Example Socket 2: does the same with 900mbit+ using about 6 threads.
Concept:
- dual Socket Xeon platform
- KVM: libvirt/Qemu, on Ubuntu
- 128G RAM on host
- two Win10 guests for gaming, each equipped with a physical GPU, 24G RAM each
- using vfio-pci (vfio-pci driver pre-loaded in early boot, plus vendor-reset)
- using strict Numa separation (1 VM bound to each Socket, no bleeding out of RAM)
- keep to the host enough resources in terms of CPU and RAM
- keep ZFS memory hungryness low
- run guests and their games on a separate SSD, ZFS Raid0 equivalent, via qcow2, but give out storage for the games as a Samba share to each VM
- Connect using Parsec, mad king of the hill .
Things I tried:
- switch GPUs (because Numa topology, but shouldn’t matter)
- switch CPUs
- switch VMs (test both VMs on both Sockets, keep VM definition, switch numa location only)
- run same tests on qemu-images only (without Samba)
- no error in syslog, no error in dmesg (both IMHO)
- test different RAM
- update Bios, reset Bios
- disabled all CVE mitigations
- strict CPU thread pinning, lower core count for VM, more resources for host
- numastat says fine (RAM on same node)
- deactivated numad.service
- run the Supermicro Self-Diagnosis (via IPMI). Checks for each component. All PASS.
- Run the “Intel Processor Diagnostic Tool” inside the VM, all tests and flags PASS.
- Have a glance at the pins of the Socket (will do again).
- Try a 10G Fiber adapter via Pcie (SR-IOV, rule out ethernet controller)
- as for the BIOS settings: I must admid there are so many, but the basics should be fine, and I should have ruled out the major ones via the other tests…
Suspicion
Something physical on the board. But I lack the experience to pinpoint internals versus behaviour.
I will eventually recheck the pins on the Socket, but at first check I didn’t see anything irregular, and anyways I would expect the Intel CPU test (in VM) to fail… (?). I will also run the test on the host itself. Thought it was more telling in VM, but willdo.
Corrosion on the CPU pins? This board came from ebay and it looked never used.
Hardware Setup:
Board: Supermicro X10DRI
CPU: 2x Intel(R) Xeon(R) CPU E5-2697A v4
Bios: latest (tried several)
RAM: 4x 32GB M386A4G40DM0-CPB (LRDIMM, ECC) (equivalent recommended by Supermicro)
OS disk: An Intel SSD. Ubuntu 22 LTS HWE (Kernel 6.2)
GPUS: 2x AMD WX9100 plus working vendor-reset module
Storage including VMs: 2x Crucial P3 NVME. The ZFS pool has all ARC caching deactivated (metadata only).
Resizable BAR: forced, using the new vfio method in Kernel 6.2 (tested without)
Network: SR-IOV using the built-in 1G controller
Qemu: pc-q35-7.2, OVMF/ UEFI, libvirt 9.0.0/Qemu 7.2.0
Guests: WIndows 10 Pro
Gaming storage shared via Samba on an internal network (virtio net)
So…
This seems very aesotheric to me, so I ask,
if anybody has a clue regarding the odd degradation?
Thank you very much indeed.