AMD Epyc Milan Workstation Questions

Great information! You might also consider mapping out where the numa nodes align to physically as well, as that can affect device layout planning if wanting to essentially isolate a node or two to a vm.

Come to think of it, I should post the same for my S8030. I’ll post about how to do that late tonight.

1 Like

IIRC the 33% fan setting is their minimum idle speed - they do ramp up if the CPU warms up, they just don’t drop below it to whatever ‘optimal’ defaults to.

As for keeping the BMC happy with slow Nocutuas:

  • Set Noctua friendly lower fan thresholds
    ipmitool sensor thresh FAN1 lower 50 100 150
    ipmitool sensor thresh FAN2 lower 50 100 150

The source of all this magic is : Reference Material - Supermicro X9/X10/X11 Fan Speed Control | ServeTheHome Forums

You might have to work out which fan /sensor is which for your motherboard, bit it should give you a head-start.

1 Like

Follow up on finding physical numa node associations: To figure out the locations you need to install hwloc on your distro and then plug hardware into the various ports and use lstopo to see which numa node that hardware shows up under. You can investigate further using the “1234:5678” type of ID and lspci -nnv (using grep to quickly narrow things down).

Here is a tentative example of a Tyan S8030. I want to double check it before I make an official post, but don’t have time right now.

The IMPI fan control tool on the ROMED8-2T allows fan speed settings between 20 and 100 %. I have verified that indeed 20% is the default setting. If you’re looking for a really quiet EPYC, combine it with Noctuas. My UPS actually makes more noise than the workstation.

@oegat I’m a bit surprised by your temperatures because mine are lower despite using very slow fans in a big “gamer case”. Can you tell us the ambient temperature where your machine is running ?

On a side note, I’m loving the ambiance around here. Most useful & helpful forum I’ve joined in a very long time.

1 Like

Sorry - I didn’t mean the BIOS wouldn’t let me go below 33%, what I actually wanted was to prevent the BIOS from going below 33%. I prefer to have a bit too much airflow at idle than to have the fans constantly ramping up and down - I find that less distracting.

And yes, Noctuas are super quiet, but IMO that’s because they just aren’t spinning very fast or moving much air. But as most of us don’t have anything like the energy-density of a 2U-server packed with a pair of 280W CPU’s and 4 dual-slot GPUs, they’ll probably do just fine, unless you need to cool some high-powered but fanless server cards.

1 Like

Great question! I just now plugged in a USB temp sensor to get the ambient right. Its hanging about 20cm above floor level in front of the case, about level with the CPU cooler.

Here are some readings, with 4x 16Gb dimms, the 7252, a Gigabyte 7000s 1TB NVMe drive (rated 6.5 watts) and a Radeon 7790 in the box. However most of the heat from the latter two will likely go upwards (which is away from the CPU section, since my case is inverted) and exit through the top vent. Fans are the stock case fans, 2 in the front and one behind the CPU. CPU cooler is SNK-P0064AP4.

@jtredux you’ll see I haven’t changed the thresholds yet, I’ll look into that next. Thanks for the info.

Temps at idle:

# Ambient (measured in front of case, 20cm above floor):
Sensor C: 23.50

# output of "ipmitool sensor":
CPU Temp         | 27.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 95.000    | 95.000    
System Temp      | 33.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
Peripheral Temp  | 35.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
M2NVMeSSD Temp1  | na         |            | na    | na        | na        | na        | na        | na        | na        
M2NVMeSSD Temp2  | na         |            | na    | na        | na        | na        | na        | na        | na        
VRMCpu Temp      | 39.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMSoc Temp      | 38.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMABCD Temp     | 37.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMEFGH Temp     | 39.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
P1_DIMMA~D Temp  | 34.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
P1_DIMME~H Temp  | 32.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
FAN1             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN2             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN3             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN4             | 420.000    | RPM        | cr    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FAN5             | 1260.000   | RPM        | ok    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FANA             | na         |            | na    | na        | na        | na        | na        | na        | na        
FANB             | na         |            | na    | na        | na        | na        | na        | na        | na    

NB: FAN4 is actually 3 fans - currently all three pre-installed case fans, two in the front and one behind the CPU, are controlled through the case’s fan controller (which is set to act as a PWM repeater - though I haven’t verified its function). FAN5 is the CPU.

I find it remarkable that the CPU at idle is only 4 degrees above ambient. This is with the pre-applied stock thermal paste. However, the other temps are comparably a lot higher.

After running mprime on all cores until temps stabilize:

# Ambient (measured in front of case, 20cm above floor):
Sensor C: 23.50
# output of "ipmitool sensor":
CPU Temp         | 49.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 95.000    | 95.000    
System Temp      | 37.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
Peripheral Temp  | 37.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
M2NVMeSSD Temp1  | na         |            | na    | na        | na        | na        | na        | na        | na        
M2NVMeSSD Temp2  | na         |            | na    | na        | na        | na        | na        | na        | na        
VRMCpu Temp      | 47.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMSoc Temp      | 46.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMABCD Temp     | 43.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
VRMEFGH Temp     | 47.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 100.000   | 105.000   
P1_DIMMA~D Temp  | 41.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
P1_DIMME~H Temp  | 41.000     | degrees C  | ok    | 5.000     | 5.000     | na        | na        | 85.000    | 90.000    
FAN1             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN2             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN3             | na         |            | na    | na        | na        | na        | na        | na        | na        
FAN4             | 560.000    | RPM        | nc    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FAN5             | 1820.000   | RPM        | ok    | 280.000   | 420.000   | na        | na        | 35560.000 | 35700.000 
FANA             | na         |            | na    | na        | na        | na        | na        | na        | na        
FANB             | na         |            | na    | na        | na        | na        | na        | na        | na       

At this stage all cores oscillate between 3.0 and 3.19 GHz (rated range is 3.1-3.2). cTDP is untouched, so I believe it is effectively at 125w.

Fan policy is set to “optimal” in BMC. Case fans (FAN4) max out at 1000rpm, and the CPU fan (FAN5) at 3800rpm, suggesting they run at 56% and 48%, respectively. So there is definitely headroom. I’ve ordered two more case fans, when they arrive I plan to experiment with cooling zones.

So what to make of these temps? How do they compare to yours @Nefastor? At least at idle our systems should be sort of comparable.

Yes, it’s very comparable. I think I’ve mentioned before that all my readings were at 18 °C ambient, so it makes sense they were lower. Spring is coming, however, so my temperatures are now pretty much the same as yours. All is well with the world ! :grin:

1 Like

Good point, I will look into this and report, esp since I plan to virtualize. Currently the system is configured as one node only, and I haven’t tried to change it. I guess the topology options are also limited by the number of CCDs? I.e. I cannot make 4 nodes active with a 2-CCD CPU and/or only 4 channels populated?

A related question is whether PCIe slot affinity is important also if running the socket as a single node? I am thinking that perhaps slots can sit more or less close to a CCD, and that it therefore makes sense to put things like GPUs closer to the CCD that interacts with them, regardless of NUMA. Or alternatively, is slot affinity only relevant from the perspective of DMA and such?

I thought that Rome and Milan were only 1 NUMA node - isn’t that the great improvement over Gen-1 Naples?

My understanding is if you want the lowest latency, you absolutely want the cores, ram and the pci slots involved to be as close as possible. For us, this means for “gaming” VM’s to have the numa node and gpu aligned, and calling it good enough (the Linux kernel supposedly automatically takes care of ram affinity, as long as your not using more ram than is connected to that numa node).

As far as ccd’s go, I’m unsure if there’s a difference practical enough for us to care. If someone had an ultra sensitive workload they’d have to care all the way down to the ccx’s on Rome because of the split cache. At that point I’d just get a separate gaming computer lol.

Found this for reference, your 16 cores are definitely different than my 16 cores. Looks like you only have 2 numa nodes to work with.

1 Like

Just test fired up my 7262 in my H12-SSL-NT. While I only really bought it as a cheap socket-filler (used and <£20/core) , 8 cores at 3.4GHz with 128MB of cache might actually be quite good for my single-threaded workloads. Will definitely have to put it though its paces once I have some 3200MHz RAM and a case to build it all in.

1 Like

Right, this makes sense. Though I still don’t have a good hunch about how much or little NUMA matters on Rome/Milan compared to previous gen NUMA CPUs (relating to @jtredux’s question).

From what you describe, it sounds like the best scenario for a gaming VM is to split the system in max number of nodes, and give only one node to the VM (provided the memory and cores of one node suffices for the workload). Though I wonder how much I lose in terms of latency if I, say, configure a max-4 nodes CPU into 2 nodes, giving one full node (half the CPU) to the VM. Currently I have only two possible nodes, but I plan to get a 4-node Milan chip at some point.

The reason I’m concerned with this is that I made kind of a mistake when speccing my last machine of the kind I plan here - at the time (late 2011) I went with dual 8-core socket G34 Opteron 61xx (at the time cheap on Ebay), thinking that they would make 2 nodes with 8 cores each. However, I had failed to realize that each 61xx CPU was really an MCM with two nodes already, so I ended up with 4 nodes. With only 32Gb RAM in that system, a single node (4c, 8Gb) for a gaming VM soon became underwhelming :slight_smile:

However, at that time NUMA was quite important, and I get the impression that current EPYC gens depend much less on it. My current plan with getting a 7313p is kind of based on making a Windows VM out of half of it, but that would correspond to 2 nodes.

Another way to mitigate any problems along this line seems to exist - Windows 10 supports 2 sockets, which means it is NUMA aware. So it should be possible to give a Windows VM 2 NUMA nodes and expose the topology, so that Windows knows about it. It still doesn’t help if the workload cannot be adapted though (e.g. games).

The 7252 I have now is only 8 cores on two CCDs, which I assume maps to 2 NUMA nodes. When I soon start experimenting with VMs I’ll simply split it in half. This is also one of the “4 channel-optimized” parts.

1 Like

Nice! I’m looking forward to your reports down the line, and I’m especially interested in the thermals of the Broadcom 10GbE chip - that will essentially tell if I made the right choice with not getting the NT.

Btw I tried messing around with the fan thresholds now. Apparently “ipmitool sensors”, as well as the BMC web interface, lists only lnr and lcr - lnc can be set but is not shown, but I’m sure it exists as the fan has been in a “nc” state before the change. Also it sets only speeds in steps of 140 rpm (rounded from what is typed in).

$ ipmitool sensor thresh FAN4 lower 0 140 280
Locating sensor record 'FAN4'...
Setting sensor "FAN4" Lower Non-Recoverable threshold to 0.000
Setting sensor "FAN4" Lower Critical threshold to 140.000
Setting sensor "FAN4" Lower Non-Critical threshold to 280.000

$ ipmitool sensor
...
FAN4             | 420.000    | RPM        | ok    | 0.000     | 140.000   | na        | na        | 35560.000 | 35700.000 
...

I set the lowest one to 0 to make it a reasonable range.

NUMA stands for Non-Uniform-Memory-Architecture, but in the Rome/Milan, the compute-dies are all connected to a single I/O die. They all have the same path to DDR interfaces/PCIe interfaces.

On Naples, you might have to talk to another compute-die to get to the DDR connected to it, or the PCIe connected to it, but on Milan/Rome, you seem to have similar latency to each as it’s a single hop…

BTW - just found this HPC / NUMA link which might help… https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1280442391/AMD+2nd+Gen+EPYC+CPU+Tuning+Guide+for+InfiniBand+HPC

I know, this is why I’ve assumed that NUMA will not be important on the newer chips. However there is apparently some irregularities still, depending on distances within the complex. I have a memory of seing an article explaining at least parts of it, but I can’t find it now…

What I haven’t sorted out yet is whether those need attention on a practical level anymore. It is apparently possible to configure several nodes within a socket (NPS bios option). The link you posted discusses performances based on it, but doesn’t explain. It is clearly not as important as it has been, due to the unified memory controller.

Sorry - didn’t mean to come across as patronising! I’m no expert on this myself, but by limited understanding, things are much simpler on Rome in general.

You are right though, you can’t escape the speed of light - super long paths in the silicon will be split across multiple pipeline stages, and there may well be a hierarchy of arbitration between the various IF/PCIe/DDR controller busses. Not sure if there are diagrams of the internal details - I’d say that would be in the realm of AMD’s secret-sauce, but I’d imagine there is some marketing abstraction they’ve shared somewhere.

I guess for the VM case you might want to have more NUMA nodes so that you can keep a VM on a single compute chiplet (can’t remember whether they’re CCDs or CCX’s and don’t want to use the wrong term). But my undertsanding was that this was mostly so that you can ensure it stays on the same physical cores, so that there is less cache-thrashing. And I think in KVM at least, you can just pin to a given core by core-number, so I don’t really see what the difference would be between pinning to cores 0-7 out of 1 NUMA node or having 2 nodes, one with cores 0-7 and the other with 8-15 and then pinning to NUMA node #0 to get cores 0-7. I’m sure someone who does really understand this will come set me straight soon enough!

Cheers.

Of course, no offense taken! Sorry for cutting it short, I just went for ensuring common ground asap :slight_smile:

I feel I lack a lot of knowledge on exactly why we get NUMA-like effects on Rome and Milan. Perhaps @Log knows some details? It is also an open question to what extent it matters. I suppose it is up for testing with my specific workloads. Though in reality, I did not see much practical differences on my former Opteron rig that I mentioned, that was heavily NUMA-constrained. You are right about the pinning, getting the right cores to be used is no problem. Its memory, pcie slots etc. that has raised some questions for me, and also how to best inform the VM in case it gets more than one compute chiplet.

1 Like

I ordered a 7443P from WiredZone a week ago when it showed in stock.

A day later got an apology email that the stocking level was not accurate and I won’t get it till August.

1 Like

That PCIE slot is at 8X speed because the M.2 and Sata are enabled. This motherboard uses jumpers for some reason to configure pcie switching. The manual should have the jumper configurations, and the ServeTheHome review also has them on the last page

It’s actually the second PCIe slot that is shared with an M.2 and some SATA, and by default the motherboard is setup so that this slot is x16. As you can see from the board’s schematic, the first slot has 16 lanes directly wired to it.

In case you haven’t read this entire (rather long) thread, back then I was using an experimental Milan BIOS. I don’t know for sure if it’s related, but since I reverted to the Rome 1.3 BIOS this particular issue has disappeared and slot 1 works as x16. One more reason why beta BIOS should be avoided if possible.

Side note : in a “gaming” use case like Cyberpunk 2077 there is absolutely no impact to running in x8. I wouldn’t have noticed if I hadn’t used GPU-Z to check on something else. But then again, I’m limited to 4K / 60 FPS.