AMD Epyc Milan Workstation Questions

I don’t know where to find official numbers from AMD, but this pdf from Lenovo should be good enough see page 41 for a list. These are also a good reference for tuning epyc rome

My 7302p default is 155w, but can be bumped to 180w.

If you set the number above the max of what the chip it, it just automatically goes to the chip’s max.

Looks like the 7252 can be bumped to 150w

1 Like

The supplier for the 747 got in touch this morning, and I’ve now ordered the current version of that chassis with 2.0kW PSUs :slight_smile:

ETA is approx end of May, so a bit more of a wait, but worth it IMO. I really hope I won’t actually have 2kW of load - this room is too small for that much heat output!

1 Like

Thanks for the link! I realize when going back to AMD’s product pages, that they seem to list no cTDP for any of the Rome chips, but they do list it for Milan chips (example, scroll down). That’s what fooled me to think that my 7252 could not be configured.

AMD is really missing something like Intel’s ARK database.

I’ll definitely experiment with this later.

2 Likes

If you’re on Linux, make sure you have turbostat / cpufreqency-set installed so that you can see what speed all the cores are running at.

On my EPYC box I have this in my start-up script:

Set CPU governor to powersave,ondemand or performance modes

cpupower frequency-set -g ondemand

Oh, and for fan control :

Set fans to optimal

ipmitool raw 0x30 0x45 0x01 0x02

Set CPU fans to 33%

ipmitool raw 0x30 0x70 0x66 0x01 0x00 33

I’ll see if I can find the link to the article that showed me how to do that. Not sure if the zones will be the same on the H12SSL yet.

1 Like

Thanks, turbostat was kind of new to me! It shows the actual core freqs, in addition to the pstate-defined ones. That was something I always had problems to extract in the past (but on different hardware).

I also have to figure out how to set the lower rpm limits for fans. BMC currently thinks my case fans are running too slow to be healthy.

Edit: Btw when you set the CPU fans to 33%, is that simply their default idle value, or does it mean scaling their entire RPM range by a factor .33?

The obligatory IOMMU group review (Supermicro H12SSL-I)

Here’s the list of PCI(e) devices sorted by IOMMU group:

IOMMU Group 0:
	c0:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 1:
	c0:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 2:
	c0:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 3:
	c0:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 4:
	c0:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 5:
	c0:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 6:
	c0:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 7:
	c0:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 8:
	c0:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 9:
	c1:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU Group 10:
	c1:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 11:
	c2:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU Group 12:
	c2:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 13:
	80:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
	80:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	81:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT218 [GeForce 210] [10de:0a65] (rev a2)
	81:00.1 Audio device [0403]: NVIDIA Corporation High Definition Audio Controller [10de:0be3] (rev a1)
IOMMU Group 14:
	80:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 15:
	80:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 16:
	80:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 17:
	80:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 18:
	80:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 19:
	80:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 20:
	80:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 21:
	80:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 22:
	80:08.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 23:
	80:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 24:
	82:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU Group 25:
	82:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 26:
	83:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU Group 27:
	83:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 28:
	84:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 29:
	85:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 30:
	00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 31:
	00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 32:
	00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
	00:03.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	01:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation Device [1987:5018] (rev 01)
IOMMU Group 33:
	00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 34:
	00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 35:
	00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 36:
	00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 37:
	00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 38:
	00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 39:
	00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
	00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 40:
	00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 0 [1022:1490]
	00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 1 [1022:1491]
	00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 2 [1022:1492]
	00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 3 [1022:1493]
	00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 4 [1022:1494]
	00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 5 [1022:1495]
	00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 6 [1022:1496]
	00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship Device 24; Function 7 [1022:1497]
IOMMU Group 41:
	02:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU Group 42:
	02:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 43:
	03:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU Group 44:
	03:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 45:
	03:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Starship USB 3.0 Host Controller [1022:148c]
IOMMU Group 46:
	40:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 47:
	40:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 48:
	40:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
	40:03.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	40:03.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	40:03.5 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	40:03.6 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	41:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142]
	42:00.0 PCI bridge [0604]: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge [1a03:1150] (rev 04)
	43:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 41)
	44:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142]
	45:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
	45:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
IOMMU Group 49:
	40:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 50:
	40:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 51:
	40:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 52:
	40:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 53:
	40:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU Group 54:
	40:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 55:
	40:08.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 56:
	40:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU Group 57:
	46:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU Group 58:
	46:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 59:
	47:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU Group 60:
	47:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
IOMMU Group 61:
	47:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PTDMA [1022:1498]
IOMMU Group 62:
	47:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Starship USB 3.0 Host Controller [1022:148c]
IOMMU Group 63:
	48:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 64:
	49:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)

Looks ok. This group (no. 48) is less than optimal though:

IOMMU Group 48:
	40:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
	40:03.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	40:03.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	40:03.5 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	40:03.6 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	41:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142]
	42:00.0 PCI bridge [0604]: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge [1a03:1150] (rev 04)
	43:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 41)
	44:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1042A USB 3.0 Host Controller [1b21:1142]
	45:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]
	45:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe [14e4:165f]

The Broadcom NIC (both ports), seemingly the BMC controller (ASPEED […]), and 4 of the 6 USB ports are grouped together. This, and the fact that the Broadcom chip is one controller with 2 ports, obviously complicates passing through a NIC to a VM. I don’t know if that is commonly requested nowadays.

There are still the two USB ports that connect directly to the CPU, they are in groups of their own. I made an overview of the USB options:

Physically, the two CPU USB ports sit on the backplate, together with two Asmedia ports (the two others sitting on the MB connector).

The grouping of USB ports is of course relevant for running a VM with dedicated peripherals (e.g. for gaming). Here, this would only be practical using one or both of the CPU USB ports. (and enjoy a full x1 PCIe 4.0 line for a keyboard and a mouse, or even one for each :stuck_out_tongue: )

Add-in USB controllers are cheap though :slight_smile:

The add-in GPU (sits in slot # 7) has company of more things than I’m used to:

IOMMU Group 13:
	80:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
	80:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
	81:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT218 [GeForce 210] [10de:0a65] (rev a2)
	81:00.1 Audio device [0403]: NVIDIA Corporation High Definition Audio Controller [10de:0be3] (rev a1)

The 80:* stuff looks like they belong there, so I’m not concerned. I assume those will simply be passed together with the GPU.

(I did not test with 2 GPUs in the system yet - I simply don’t expect any strange behaviour from that).

3 Likes

Great information! You might also consider mapping out where the numa nodes align to physically as well, as that can affect device layout planning if wanting to essentially isolate a node or two to a vm.

Come to think of it, I should post the same for my S8030. I’ll post about how to do that late tonight.

1 Like

IIRC the 33% fan setting is their minimum idle speed - they do ramp up if the CPU warms up, they just don’t drop below it to whatever ‘optimal’ defaults to.

As for keeping the BMC happy with slow Nocutuas:

  • Set Noctua friendly lower fan thresholds
    ipmitool sensor thresh FAN1 lower 50 100 150
    ipmitool sensor thresh FAN2 lower 50 100 150

The source of all this magic is : Reference Material - Supermicro X9/X10/X11 Fan Speed Control | ServeTheHome Forums

You might have to work out which fan /sensor is which for your motherboard, bit it should give you a head-start.

1 Like

Follow up on finding physical numa node associations: To figure out the locations you need to install hwloc on your distro and then plug hardware into the various ports and use lstopo to see which numa node that hardware shows up under. You can investigate further using the “1234:5678” type of ID and lspci -nnv (using grep to quickly narrow things down).

Here is a tentative example of a Tyan S8030. I want to double check it before I make an official post, but don’t have time right now.

The IMPI fan control tool on the ROMED8-2T allows fan speed settings between 20 and 100 %. I have verified that indeed 20% is the default setting. If you’re looking for a really quiet EPYC, combine it with Noctuas. My UPS actually makes more noise than the workstation.

@oegat I’m a bit surprised by your temperatures because mine are lower despite using very slow fans in a big “gamer case”. Can you tell us the ambient temperature where your machine is running ?

On a side note, I’m loving the ambiance around here. Most useful & helpful forum I’ve joined in a very long time.

1 Like

Sorry - I didn’t mean the BIOS wouldn’t let me go below 33%, what I actually wanted was to prevent the BIOS from going below 33%. I prefer to have a bit too much airflow at idle than to have the fans constantly ramping up and down - I find that less distracting.

And yes, Noctuas are super quiet, but IMO that’s because they just aren’t spinning very fast or moving much air. But as most of us don’t have anything like the energy-density of a 2U-server packed with a pair of 280W CPU’s and 4 dual-slot GPUs, they’ll probably do just fine, unless you need to cool some high-powered but fanless server cards.

1 Like

Great question! I just now plugged in a USB temp sensor to get the ambient right. Its hanging about 20cm above floor level in front of the case, about level with the CPU cooler.

Here are some readings, with 4x 16Gb dimms, the 7252, a Gigabyte 7000s 1TB NVMe drive (rated 6.5 watts) and a Radeon 7790 in the box. However most of the heat from the latter two will likely go upwards (which is away from the CPU section, since my case is inverted) and exit through the top vent. Fans are the stock case fans, 2 in the front and one behind the CPU. CPU cooler is SNK-P0064AP4.

@jtredux you’ll see I haven’t changed the thresholds yet, I’ll look into that next. Thanks for the info.

Temps at idle:

# Ambient (measured in front of case, 20cm above floor):
Sensor C: 23.50

# output of "ipmitool sensor":
CPU Temp         | 27.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 95.000    | 95.000    
System Temp      | 33.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
Peripheral Temp  | 35.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
M2NVMeSSD Temp1  | na         |            | na    | na        | na        | na        | na        | na        | na        
M2NVMeSSD Temp2  | na         |            | na    | na        | na        | na        | na        | na        | na        
VRMCpu Temp      | 39.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMSoc Temp      | 38.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMABCD Temp     | 37.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMEFGH Temp     | 39.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
P1_DIMMA~D Temp  | 34.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
P1_DIMME~H Temp  | 32.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
FAN1             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN2             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN3             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN4             | 420.000    | RPM        | cr    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FAN5             | 1260.000   | RPM        | ok    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FANA             | na         |            | na    | na        | na        | na        | na        | na        | na        
FANB             | na         |            | na    | na        | na        | na        | na        | na        | na    

NB: FAN4 is actually 3 fans - currently all three pre-installed case fans, two in the front and one behind the CPU, are controlled through the case’s fan controller (which is set to act as a PWM repeater - though I haven’t verified its function). FAN5 is the CPU.

I find it remarkable that the CPU at idle is only 4 degrees above ambient. This is with the pre-applied stock thermal paste. However, the other temps are comparably a lot higher.

After running mprime on all cores until temps stabilize:

# Ambient (measured in front of case, 20cm above floor):
Sensor C: 23.50
# output of "ipmitool sensor":
CPU Temp         | 49.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 95.000    | 95.000    
System Temp      | 37.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
Peripheral Temp  | 37.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
M2NVMeSSD Temp1  | na         |            | na    | na        | na        | na        | na        | na        | na        
M2NVMeSSD Temp2  | na         |            | na    | na        | na        | na        | na        | na        | na        
VRMCpu Temp      | 47.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMSoc Temp      | 46.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMABCD Temp     | 43.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMEFGH Temp     | 47.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
P1_DIMMA~D Temp  | 41.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
P1_DIMME~H Temp  | 41.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
FAN1             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN2             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN3             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN4             | 560.000    | RPM        | nc    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FAN5             | 1820.000   | RPM        | ok    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FANA             | na         |            | na    | na        | na        | na        | na        | na        | na        
FANB             | na         |            | na    | na        | na        | na        | na        | na        | na       

At this stage all cores oscillate between 3.0 and 3.19 GHz (rated range is 3.1-3.2). cTDP is untouched, so I believe it is effectively at 125w.

Fan policy is set to “optimal” in BMC. Case fans (FAN4) max out at 1000rpm, and the CPU fan (FAN5) at 3800rpm, suggesting they run at 56% and 48%, respectively. So there is definitely headroom. I’ve ordered two more case fans, when they arrive I plan to experiment with cooling zones.

So what to make of these temps? How do they compare to yours @Nefastor? At least at idle our systems should be sort of comparable.

Yes, it’s very comparable. I think I’ve mentioned before that all my readings were at 18 °C ambient, so it makes sense they were lower. Spring is coming, however, so my temperatures are now pretty much the same as yours. All is well with the world ! :grin:

1 Like

Good point, I will look into this and report, esp since I plan to virtualize. Currently the system is configured as one node only, and I haven’t tried to change it. I guess the topology options are also limited by the number of CCDs? I.e. I cannot make 4 nodes active with a 2-CCD CPU and/or only 4 channels populated?

A related question is whether PCIe slot affinity is important also if running the socket as a single node? I am thinking that perhaps slots can sit more or less close to a CCD, and that it therefore makes sense to put things like GPUs closer to the CCD that interacts with them, regardless of NUMA. Or alternatively, is slot affinity only relevant from the perspective of DMA and such?

I thought that Rome and Milan were only 1 NUMA node - isn’t that the great improvement over Gen-1 Naples?

My understanding is if you want the lowest latency, you absolutely want the cores, ram and the pci slots involved to be as close as possible. For us, this means for “gaming” VM’s to have the numa node and gpu aligned, and calling it good enough (the Linux kernel supposedly automatically takes care of ram affinity, as long as your not using more ram than is connected to that numa node).

As far as ccd’s go, I’m unsure if there’s a difference practical enough for us to care. If someone had an ultra sensitive workload they’d have to care all the way down to the ccx’s on Rome because of the split cache. At that point I’d just get a separate gaming computer lol.

Found this for reference, your 16 cores are definitely different than my 16 cores. Looks like you only have 2 numa nodes to work with.

1 Like

Just test fired up my 7262 in my H12-SSL-NT. While I only really bought it as a cheap socket-filler (used and <£20/core) , 8 cores at 3.4GHz with 128MB of cache might actually be quite good for my single-threaded workloads. Will definitely have to put it though its paces once I have some 3200MHz RAM and a case to build it all in.

1 Like

Right, this makes sense. Though I still don’t have a good hunch about how much or little NUMA matters on Rome/Milan compared to previous gen NUMA CPUs (relating to @jtredux’s question).

From what you describe, it sounds like the best scenario for a gaming VM is to split the system in max number of nodes, and give only one node to the VM (provided the memory and cores of one node suffices for the workload). Though I wonder how much I lose in terms of latency if I, say, configure a max-4 nodes CPU into 2 nodes, giving one full node (half the CPU) to the VM. Currently I have only two possible nodes, but I plan to get a 4-node Milan chip at some point.

The reason I’m concerned with this is that I made kind of a mistake when speccing my last machine of the kind I plan here - at the time (late 2011) I went with dual 8-core socket G34 Opteron 61xx (at the time cheap on Ebay), thinking that they would make 2 nodes with 8 cores each. However, I had failed to realize that each 61xx CPU was really an MCM with two nodes already, so I ended up with 4 nodes. With only 32Gb RAM in that system, a single node (4c, 8Gb) for a gaming VM soon became underwhelming :slight_smile:

However, at that time NUMA was quite important, and I get the impression that current EPYC gens depend much less on it. My current plan with getting a 7313p is kind of based on making a Windows VM out of half of it, but that would correspond to 2 nodes.

Another way to mitigate any problems along this line seems to exist - Windows 10 supports 2 sockets, which means it is NUMA aware. So it should be possible to give a Windows VM 2 NUMA nodes and expose the topology, so that Windows knows about it. It still doesn’t help if the workload cannot be adapted though (e.g. games).

The 7252 I have now is only 8 cores on two CCDs, which I assume maps to 2 NUMA nodes. When I soon start experimenting with VMs I’ll simply split it in half. This is also one of the “4 channel-optimized” parts.

1 Like

Nice! I’m looking forward to your reports down the line, and I’m especially interested in the thermals of the Broadcom 10GbE chip - that will essentially tell if I made the right choice with not getting the NT.

Btw I tried messing around with the fan thresholds now. Apparently “ipmitool sensors”, as well as the BMC web interface, lists only lnr and lcr - lnc can be set but is not shown, but I’m sure it exists as the fan has been in a “nc” state before the change. Also it sets only speeds in steps of 140 rpm (rounded from what is typed in).

$ ipmitool sensor thresh FAN4 lower 0 140 280
Locating sensor record 'FAN4'...
Setting sensor "FAN4" Lower Non-Recoverable threshold to 0.000
Setting sensor "FAN4" Lower Critical threshold to 140.000
Setting sensor "FAN4" Lower Non-Critical threshold to 280.000

$ ipmitool sensor
...
FAN4             | 420.000    | RPM        | ok    | 0.000     | 140.000   | na        | na        | 35560.000 | 35700.000 
...

I set the lowest one to 0 to make it a reasonable range.

NUMA stands for Non-Uniform-Memory-Architecture, but in the Rome/Milan, the compute-dies are all connected to a single I/O die. They all have the same path to DDR interfaces/PCIe interfaces.

On Naples, you might have to talk to another compute-die to get to the DDR connected to it, or the PCIe connected to it, but on Milan/Rome, you seem to have similar latency to each as it’s a single hop…

BTW - just found this HPC / NUMA link which might help… https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1280442391/AMD+2nd+Gen+EPYC+CPU+Tuning+Guide+for+InfiniBand+HPC