I have a Gigabyte R282-Z92 server with 2x EPYC 7742 running on ESXi 6.7u3. I built this server for testing as a candidate hardware to replace our existing ESXi hosts. Also, I wanted to try and see how it’d perform for our specific workload that is quite dependent on single threaded performance. The existing CPU this application is running on is 2x Xeon E5-2667 v4. Due to some operating system limitations, I’m stuck with running CentOS 5 (yes, I know it’s EoL). However, being as old as it is, I basically had to virtualize the OS in order to run it at all on both platforms. In the EPYC’s case, I enabled EVC mode to AMD Opteron Gen 3 with no 3D Now. So the particular problem I’m seeing is that performance on the EPYC platform is lower than that of the Xeon. I had hoped that given the architecture improvements with CPU, memory and storage and AMD’s boost capabilities it would match or out perform the Xeon. However, running synthetic tests It’s about 15-20% slower. My suspicion is that the turbo is not really kicking in, and/or the BIOS configuration is not optimal. Looking for some ideas on how I can tune it for more performance.
Try setting nps4 in the memory options in bios.
Bios fully up to date? Configure ctdp for 240w and determinism slider to performance.
Even though it’s old centos can you run a newer kernel? VMware tools installed?
How fast is the memory?
Yes BIOS is up to date. I did see the AMD docs (https://developer.amd.com/wp-content/resources/56779_1.0.pdf) on recommended BIOS settings and I’m waiting to reboot the host to make those changes so I will give your recommendations a try. It’s currently running kernel 2.6.18-419.el5. I do believe there is a newer official kernel build available but running 3rd party kernel may not be an option. (I can look into that though if all else fails). As a test, I changed EVC to Zen generation and it seems to actually work with current kernel, so maybe I don’t need to enable EVC at all. I am not sure how much penalty it’d incur one way or the other. VMware Tools is installed and running but it is out-of-date. I will also try updating it to get it current. Memory is 3200MHz.
I also made the advanced host configuration changes per the AMD doc on NUMA related settings. The particular VM is running with 128 vCPUs and 300GB of memory.
Should be blazing fast. You might try like 112vcpu if 300gb is on one node
How did you come to that vCPU number? I have honestly not put much consideration in NUMA configuration previously, but it seems to actually make quite a bit of difference, especially for AMD.
I, too, have a lot of machines in the field. So experience.
What happens is the cpu spends loadddssss of time context switching in and out and with that many threads it is probably sub optimal, especially across numa domains. If your legacy app can run on one socket, then thats way better.
without profiling your app I’m not sure what else to suggest but if its truly single thread… why does the vm need so many threads? lots of individual single-thread clients? the 1t performance is indeed very high on that cpu as compared with the xeon. In fact you may be able to “lie” to the vm with that version of vmware and tell it is is a v2 xeon and see even high performance since most instructions are available. vmware won’t let you get into trouble here, generally.
If you can run a newer kernel on old centos, it would be worth it, but I’d custom compile it if I were you.
The application is multi-threaded, but depends largely on single thread performance as it handles individual transactions per customer request and the goal was to scale both vertically and horizontally.
How busy is the system with that many threads? Less threads yet may be more. Is the threads exposed to the system as one CPU or more?
Not particularly busy. We tried to do some stress testing last night, which yielded the 15-20% decrease. Each request is spun up as its own process so the OS will use as many CPUs as it’s able before it starts needing to context switch.
there is kind of a lot of overhead of vmware switching in and out so if you start with say, 60 cores and move up from there, you might find the sweet spot for your app.
Depending on your app it may make more sense to map more than one virtual cpu socket (or not) to your app as well. Even 4 sockets might make sense for your app, just depending, in the VM.
Got it. I will give this approach a try.