How to Get the Most out of your new GIGABYTE Epyc Server

Intro

In this guide I’m going to walk you through setting up the Proxmox virtualization environment, but a lot of the advice here applies if you are going to use VMWare or Windows 2019 as your host operating system.

Our system is configured for 256 threads – dual 64 core Epyc CPUs – so it’s a beast for prettymuch everything. With this many cores & threads, the highest and best use of this type of hardware is generally going to be a virtualization server. For things like Kubernetes on bare metal, it would also work well.

Note: For Windows as the host OS I strongly recommend Windows Server 2019, or later. Windows may also benefit from disabling SMT because, in general, Windows doesn’t really handle 128 threads per socket well at all in my testing. Your mileage may vary, of course, and for virtualization type workloads it may matter less than, say, SQL Server or IIS running on the host.

Our Config

The chassis is the

GIGABYTE R282-Z93
2 x EPYC 7742 CPUs
NVIDIA Tesla V100 32gb (For machine learning/CUDA tests)
512gb ram (16x32gb)
4x 8tb WD Red
256gb Sata SSD
2x1tb NVME

Memory Note: I used a few different memory configs for the video which is testing the impact of 2666/2933/3200 Registered ECC memory with Epyc. To summarize: Use the fastest memory that you possibly can with your Epyc system.

This chassis is configured for 12x 3.5" bulk storage, with the fast NVMe storage being configured as riser cards like the Liqid HHL. Chassis configured as 24x NVMe are available.

In general, I would recommend moving any type of mechanical/spinning rust type storage to an external disk shelf.

Inside the BIOS/UEFI

More POWERRRRRR [ AMD CBS > NB Common Options > SMU Options ]

The first thing you should absolutely do is set the cTDP on your Epyc CPUs to 240w. By default they’re configured for 225w and the extra bump in wattage makes them somewhat less power efficient, but they will be measurably faster. If you’re using this guide as a rough guide with another system not from GIGABYTE, it is possible this option is locked out for you and not supported.

I am not sure why, but on an older UEFI I had to set 480w on a 2 socket system to get 2x240w. I don’t think that’s the case anymore, though.

The Determinism Control is also useful if you prefer performance vs energy efficiency.

Prefer I/O?

If you are going to run a “full NVMe” config it seems like there is a wall around 28 gigabytes/sec when using 2933Mhz memory. I believe, but am not sure, this has something to do with the memory prefetcher constantly prefetching and using infinity fabric bandwidth. As such there is a UEFI option to configure an “I/O Priority” which will make the prefetcher a bit less aggressive, and with this option set I can get another 10 gigabytes/sec read performance from the NVMe array.

For my testing I used 12x Intel DC P4500 4.0TB SSDs. This array can easily clear 35 gigabytes/sec raw read throughput.

Overclocking!?

Not really. Well, kinda? You can overclock the infinity fabric, which might help you if you have really slow memory. You have to do some trickery with the SMU to really do true overclocking, but almost all Epyc motherboards top out around 350w per socket, so you don’t have much headroom anyway (4.4ghz all core on a Threadripper 3970X will consume about ~750w for comparison). Mostly you don’t need to worry about this, but I mention it because of two options: Gear Down Mode and Cmd2T. Disabling Gear Down Mode and making sure Cmd2T is disabled can improve memory performance significantly. CMD2t is required for some memory kits – in general steer clear of those, though. For you to enjoy the best performance on Epyc, you must pair it with the best memory.

PCIe Options? [AMD CBS > NBIO Common Options]

So I have some older/wonkier peripherals and PCIe ARI support has to be disabled for those peripherals. It seems it shuffles around what bits are used for downstream devices and is enabled by default on some bioses. I think AMD has moved to disable it by default in more recent bioses, but if you are experiencing issues (such as a multi-function device not showing up properly) try changing PCIe ARI Support, 10 bit tag support and ACS.

ACS and IOMMU go together – you may have to enable both of them for the best possible IOMMU layouts. I generally disable ARI support explicitly.

I would love more documentation about how these work, and what they do…

Chipset [ Chipset ]

There are some more PCIe options here. “Compliance Mode: Off” sounds fishy, especially for older PCIe cards that are notoriously grumpy. Be aware of these options if you’re using older PCIe devices.

OS Ready

We are now ready for the OS installation.

https://www.proxmox.com/en/downloads/category/iso-images-pve

The installation is straightforward. You can use the IPMI to load the ISO, or make a USB stick. The IPMI is quite a bit slower install process than the USB stick option, but I tested both without issue or any caveats or notes to share.

ZFS & Proxmox = Panics?

ZFS is a great resillient filesystem. In general I highly recommend it.

I set it up as part of our “get the most out of GIGABYTE/Epyc” server video, but there are sometimes issues.

I was helping LTT and was disturbed to find kernel panics. Okay, well not full panics, warnings, but ZFS performance absolutely tanks when this happens, so it is not acceptable. It’s a known issue, and I would not consider it harmless. So the [SOLVED] there is not really solved, imho.

As of December 27, it is fixed upstream, but hasn’t trickled down to Proxmox Updates (yet) at the time of this writing.

How do you know if you are having this issue? Run dmesg from the Proxmox console while doing heavy file/IO operations. If you see warnings and debug information, you’re affected. The bad news is that if you have this issue, your I/O performance to the ZFS pool is going to be utter crap (at least, in the two configs I’ve tested with Proxmox VE 6.1). Your options are to install an older kernel, apply the patch from github, or wait for the github patch to be mainstreamed into Proxmox updates. ( I am expecting the patch to be available via apt update no later than January 20, 2020…)

Tuning Proxmox for Performance

TODO

Remember, this system is a MONSTER. On VMWare, I was able to consolidate 20 1p and 2p Xeon 26xx series systems into a single box. As such, it is new bleeding-edge hardware and there are some tweaks you can do to improve responsiveness of the system when switching between heavily-loaded VMs.

This resource should also be bookmarked and inspected periodically.
https://pve.proxmox.com/wiki/Performance_Tweaks

Kernel Parameters

sysctl-proxmox-tune.conf

After doing some testing on the GIGABYTE Chassis, with 256 threads, here’s what I’d recommend for Kernel parameters:

# https://tweaked.io/guide/kernel/
# Don't migrate processes between CPU cores too often
kernel.sched_migration_cost_ns = 5000000
# Kernel >= 2.6.38 (ie Proxmox 4+)
kernel.sched_autogroup_enabled = 0

# Don't slow network - save congestion window after idle
# https://github.com/ton31337/tools/wiki/tcp_slow_start_after_idle---tcp_no_metrics_save-performance
net.ipv4.tcp_slow_start_after_idle = 0

# try not to swap 
vm.swappiness = 1

# max # connections 
net.core.somaxconn = 512000

net.ipv6.conf.all.disable_ipv6 = 1


# https://www.serveradminblog.com/2011/02/neighbour-table-overflow-sysctl-conf-tunning/
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

# close TIME_WAIT connections faster
net.ipv4.tcp_fin_timeout = 10
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 15

# more ephermeral ports
net.ipv4.ip_local_port_range = 10240    61000

# https://major.io/2008/12/03/reducing-inode-and-dentry-caches-to-keep-oom-killer-at-bay/
vm.vfs_cache_pressure = 10000


Scheduler

# https://tweaked.io/guide/kernel/
# Don't migrate processes between CPU cores too often
kernel.sched_migration_cost_ns = 5000000
# Kernel >= 2.6.38 (ie Proxmox 4+)
kernel.sched_autogroup_enabled = 0

You can do some experiments, but with Epyc, in the default config, is one NUMA node per socket. There is a hidden cost for moving processes around cores even within one socket, however. You CAN expose 4 “near” nodes per socket with a UEFI setting, but generally it isn’t necessary. The better fix, IMHO, is to just tell proxmox not to try to move processes around unless it really needs to.

ZFS Cache

Adjust zfs cache pressure so that ZFS ARC consumes 1/4 to 1/3 your available ram, at most. This may be something your monitor and adjust over time.

Networking

The GIGABYTE chassis has the option for OCP3 mezzanine cards. This works with everything from dual 10 gig adapters up through Intel’s new 400 gig standard. For Intel-based nics, generally no tuning is required. If you have Mellanox adapters that you plan to use, some tuning is recommended.

Here are the settings we used for the Level1 Video Storage Server:

TODO

Samba

If you plan to use Samba with your new server, Enable SMB Multichannel (see our other vid on that).

Migrated VMs

For our migration, we first did full machine migrations. There is an important distinction between full virtualization and containers, however – containers are much lighter. If you have a VM running a simple job such as Webserver, Database server, etc consider whether that job could be performed via a container of some type. Promox supports Docker and other containerization technologies. Because containers have a fraction of the overhead of a full virtual machine, you can efficiently run a lot more of them on a single host.

For our migration & consolidation, we did an evaluation of what the virtual machine was doing. Could it be combined with another VM? Could it be containerized to lower the maintenance requirement? We had a few Python and PHP applications that fit the bill – so we converted them to containers. This may be an option for you.

7 Likes

Placeholder for updates.

What do you mean by this? I run a Proxmox machine right now and i run Docker inside a container, is there a better way of running this?

2 Likes

Not sure if this is a proxmox specific problem, but I have been wondering how the caching works for OpenZFS with regards to KVM guests. I have had issues running KVM guests on older Virtuzzo (based on RHEL7) systems because ZFS didn’t support O_DIRECT.

That led me down the rabbit hole of various open(2) operations, but the explanations were hard for me to visualize. Is there an old thread on the forum, or elsewhere, that can describe what those parameters like writeback, etc mean?

This also got me thinking about how caching would work on OpenZFS backed KVM guests. Is there a way to use an L2ARC to store the data a guest would typically store in its memory buffer (as shown in its free -h). It seems impossibly complex, but wanted to think it out loud as an idea anyways.

I’d love to see more proxmox videos.

I’m planning(when I have more time) on running it on my “new” home server(supermicro CSE-846):

  • 2*1TB ssd, Encrypted zfs mirrored boot and for vm’s
  • 6*8TB Native encrypted zfs raidz2
  • connected to ups, auto shutdown
  • scripted to backup to tmp hotswap hdd’s 1TB-4TB

All my server are running centos, no experience with proxmox, debian or zfs. I’m looking forward exploring proxmox and zfs.

Proxmox is just Debian so you can install it directly on the host system.

This is truly a beast of a machine, and an awesome effort on your part. Just curious – how much would building a epyc threadripper system cost for a private US civilian? My estimate is several thousands.

Probably around $5k for a good 3960x workstation

Thank you @wendell for taking the time to put together this very detailed technical guide!

2 Likes

Don’t thank me thank our patrons :wink:

3 Likes

Thanks @wendell (and the Patrons :slight_smile: ) for all this very useful info!

I’ve recently built a hyperconverged Proxmox test cluster out of 3 trash-tier desktops, with Ceph running over 100mbit/sec usb2 adapters. It actually works quite nicely if you don’t mind 0.4 fsyncs/sec and 20mbyte/sec of storage speed hahaha :smiley:

Still waiting for the real hardware to arrive. The three 7702p CPUs we already received long ago are starting to gather dust whilst we wait for the Gigabyte 272-Z31 to arrive…

2 Likes

Thanks @wendell for this guide. We actually went and bought the same server (but with 4TB of RAM) because of you :slight_smile:

I’d like to point out that if anyone gets a similar machine and sees only 255 instead of 256 threads online, you might need to enable x2APIC in BIOS. The default is “Auto” which only brings 255 threads online.

Currently we have trouble setting up a VM with more than 2TB of RAM. We’d like to use it for scientific computing and the programs we run actually consume that much memory. Ideally we want to allocate 3 to 3.5TB to a single VM, but it becomes unstable if 1.5TB is allocated. When allocating 2TB or more it becomes non-bootable. Proxmox shows Status: internal-error and the VM freezes.

I suspect there’s something to do with hugepages and I couldn’t find a good guide to configure it properly. I tried adding hugepages: 2 and hugepages: 1024 in the VM’s config file but no luck here. I’m not sure what is wrong :frowning:

Any advice is welcome.

EDIT:
It turns out that hugepages: 2 does somehow work with 3TB of RAM!! It just takes a looong time to initialize. It was us being inpatient and thought it wouldn’t boot.

We’ll test if it is stable.

Okay it’s not stable… Proxmox still shows Status: internal-error sometimes.

Somewhere on the forum here is a guide to huge pages that’s good. I’m on mobile atm but it’s a great guide and that’d be a good reference probably for getting it configured. There are likely to be some edge cases with a ram config this large because it’s not well tested yet. An older or bleeding edge kernel may be in order.

Great blog post! It was helpful to read about the different BIOS configurations which might impact the possibility to do PCI passthrough.

We ran into a problem on our Gigabyte R282-Z91 server, while we were trying to set up Infiniband (IB) virtualization via SR-IOV. The goal was to have both the host, and multiple virtual machines communicate over Infiniband. Mellanox has instructions for this, which I’ll link at the bottom.

The problem was that when trying to virtualize Infiniband adapters, they would not get their own IOMMU groups. Instead, both the physical adapter and all of the virtual adapters were put into a single IOMMU group. This yielded all of the virtual adapters unusable, as you might know. Virsh indicated the IOMMU problem by giving us the following error:

error: internal error: qemu unexpectedly closed the monitor: 2021-12-29T14:32:58.611630Z qemu-kvm: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: vfio error: 0000:01:00.1: group 28 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.

The root cause for this turned out to be the PCI AER (Advanced Error Reporting) being disabled by default in the BIOS. The option “Enable AER Cap” is visible in Wendell’s screenshot, and it’s set to “Auto” there, but we apparently needed to enable it explicitly.

After we found that trick in the Nvidia support article linked at the bottom, everything started working as expected, with each virtual IB adapter in its own IOMMU group.

For completeness, here are some specs from our particular server:

  • Chassis: Gigabyte R282-Z91
  • BIOS version: R23
  • BIOS revision: 5.14
  • Base board: MZ92-FS0-00
  • CPU: 2 x AMD EPYC 7402
  • Infiniband controller: ConnectX-6, MT28908
  • Relevant kernel cmdline options: amd_iommu=on iommu=pt

Hopefully someone finds this useful :slight_smile:

Best regards,

Simon

Links (couldn’t include real links in the post as a new user, apparently):

- Mellanox SR-IOV guide:
  - https://docs.nvidia.com/networking/pages/viewpage.action?pageId=52011161
- Nvidia PCI AER support article:
  - https://enterprise-support.nvidia.com/s/article/PCIe-AER-Advanced-Error-Reporting-and-ACS-Access-Control-Services-BIOS-Settings-for-vGPUs-that-Support-SR-IOV
2 Likes

Great info! Does anyone know if it is possible to raise the cTDP on MZ73-LM0 for epyc 9654/9754 or is it locked now?

Did you ever manage to get proxmox working with that much RAM?

I wanted to thank for this post as this finally helped me with my Epyc system and PCI pass-through as well, On my gigabyte board the PCI AER option was disabled this was the reason as to why the PCI ACL options was hidden, by enabling both the IOMMU groups finally got nicely separated.

I have this system in my home-lab for a couple years already and couldn’t make sense of it. also that is was my first “modern” server hardware and the bios is only 3 letter abbreviations didn’t make easy to figure it out.