AMD Epyc Milan Workstation Questions

jtredux · May 13, 2021, 10:11pm

Sorry - didn’t mean to come across as patronising! I’m no expert on this myself, but by limited understanding, things are much simpler on Rome in general.

You are right though, you can’t escape the speed of light - super long paths in the silicon will be split across multiple pipeline stages, and there may well be a hierarchy of arbitration between the various IF/PCIe/DDR controller busses. Not sure if there are diagrams of the internal details - I’d say that would be in the realm of AMD’s secret-sauce, but I’d imagine there is some marketing abstraction they’ve shared somewhere.

I guess for the VM case you might want to have more NUMA nodes so that you can keep a VM on a single compute chiplet (can’t remember whether they’re CCDs or CCX’s and don’t want to use the wrong term). But my undertsanding was that this was mostly so that you can ensure it stays on the same physical cores, so that there is less cache-thrashing. And I think in KVM at least, you can just pin to a given core by core-number, so I don’t really see what the difference would be between pinning to cores 0-7 out of 1 NUMA node or having 2 nodes, one with cores 0-7 and the other with 8-15 and then pinning to NUMA node #0 to get cores 0-7. I’m sure someone who does really understand this will come set me straight soon enough!

Cheers.

oegat · May 13, 2021, 10:37pm

Of course, no offense taken! Sorry for cutting it short, I just went for ensuring common ground asap

I feel I lack a lot of knowledge on exactly why we get NUMA-like effects on Rome and Milan. Perhaps @Log knows some details? It is also an open question to what extent it matters. I suppose it is up for testing with my specific workloads. Though in reality, I did not see much practical differences on my former Opteron rig that I mentioned, that was heavily NUMA-constrained. You are right about the pinning, getting the right cores to be used is no problem. Its memory, pcie slots etc. that has raised some questions for me, and also how to best inform the VM in case it gets more than one compute chiplet.

KeithMyers · May 14, 2021, 1:10am

I ordered a 7443P from WiredZone a week ago when it showed in stock.

A day later got an apology email that the stocking level was not accurate and I won’t get it till August.

1s44c · May 14, 2021, 5:50pm

That PCIE slot is at 8X speed because the M.2 and Sata are enabled. This motherboard uses jumpers for some reason to configure pcie switching. The manual should have the jumper configurations, and the ServeTheHome review also has them on the last page

Nefastor · May 15, 2021, 6:29am

It’s actually the second PCIe slot that is shared with an M.2 and some SATA, and by default the motherboard is setup so that this slot is x16. As you can see from the board’s schematic, the first slot has 16 lanes directly wired to it.

In case you haven’t read this entire (rather long) thread, back then I was using an experimental Milan BIOS. I don’t know for sure if it’s related, but since I reverted to the Rome 1.3 BIOS this particular issue has disappeared and slot 1 works as x16. One more reason why beta BIOS should be avoided if possible.

Side note : in a “gaming” use case like Cyberpunk 2077 there is absolutely no impact to running in x8. I wouldn’t have noticed if I hadn’t used GPU-Z to check on something else. But then again, I’m limited to 4K / 60 FPS.

MasterMace · May 21, 2021, 4:03pm

Finally received my EPYC 7313p CPU yesterday and finished my build. Always a moment of anxiety whether or not it will Post, but that all went wel. Already updated the Bios and posted beautifully, temps all OK, all memory recognized…Bob is your uncle!

While the PC was in Bios, started to try and get the newest BMC firmware loaded, so I would be able to adjust the fan-speeds, but unfortunately, that didn’t go as planned. Both via IPMI and different browsers as with the ‘SocFlash’-tool, no dice on updating. Even got a newer version of the Socflash, but that keeps saying it 'Can’t Find the Device], after invoking the update comments…

Anyone here who was able to update the IPMI and how?

Other (more worrying issue) thing I see is that out of the blue, the server reboots spontaneously. Happened just idling in the Bios, during Windows install, in Windows and also running MemTest86.

All temps are OK, voltages seems fine and it really happens sporadically (first few times it was about an hour in between) and is now up and running for over 6 hours, running MemTest. MemTest doesn’t find any memory error, so a bit puzzled here. Any advise?

Nefastor · May 22, 2021, 7:23am

If you recall, I was not able to flash the Milan BMC firmware using a Rome CPU. Here’s what you told me a couple of weeks ago :

I’m assuming you followed that exact procedure and it didn’t work. That’s very annoying.

Regarding spontaneous reboots, I’ve had only one. But that was while using the Milan BIOS with my Rome CPU so that might explain it. If it happens to you, there may be other issues. Of course the board itself might be bad, but that seems unlikely. If I had to bet, I’d say your processor might not be installed as perfectly as it could. Did you use a torque wrench to close the socket ? No contaminant got into the socket during installation ? I’d remove and reinstall the processor just to be certain, if I were you.

Blunt talk, now : if you can’t install the proper BIOS for your CPU and/or you still get spurious reboots, don’t wait : return the motherboard for a refund. I know it’s a sexy beast but it has to perform.

I believe at least one of us got a Supermicro board : how is it working out for you ?

MasterMace · May 22, 2021, 9:08pm

Did exactly as I posted earlier, indeed, @Nefastor . Updated the Bios without a problem and without the CPU in the socket (is a Milan 7313p). IPMI is no dice, even with a newer SOCFLASH tool from Asrock Support (who are really fast in replying BTW).

Used the Threadripper Torque wrench (sent by the online retailer to me, to be sure the screws were correctly tightened), stopped after the first click…As for the bad seating of the chip: Wouldn’t that have a constant consequence and not at random intervals? I shielded the faceplate with a cover until just before putting the CPU in place, never had any issue with seating CPUs, but…once could be the first time. Laid down the case, so the board is now horizontal, that should solve the ‘not tight enough’-option. (Just trying, can’t hurt)

I did investigate further and found log entries regarding the Bios: System Firmware Progress, at the exact time the restarts were…

Next to that, I read some issues withe this Motherboard (and other Asrock Rack MoBo’s) that don’t seem to like Seasonic PSUs. Will align on that with Asrock.

Last thing I noticed in Windows, HWI reported a ‘CPU die’ temp of 97C, while all cores were around 30C. Seems a strange high temp, maybe incorrect sensor, but I’ll keep an eye on that too!

And thanks for the advice on returning when no solution is found: I always let the retailer I want to buy from confirm the choices of hardware and compatibility of them with eachother. Even did that this time with Asrock, and Kingston to confirm compatibility of board, memory and NVME.

To be continued…

KeithMyers · May 23, 2021, 2:09am

Curious why use this SoCFlash tool? Why not update the BMC firmware via the IPMI menu?

MasterMace · May 23, 2021, 6:41am

Because that update fails. Tried it with Chrome, IE and Opera, but with all the installation seems to ‘hang’ on verifying the uploaded file.

If you’ve got tips on how to make it work with IPMI, please share!

oegat · May 23, 2021, 8:41pm

Congrats! You seem to be the first in the thread who actually got hold of a Milan chip (well, except wendell…). Where in the world are you based? I haven’t tried ordering one yet, since I have the Rome chip to play with for now.

Btw, I’m very curious what kind of boost speeds you will see on 8 or 16 cores’ load. Esp. at max cTDP. Please let us know when you get the chance to test! (prioritize the reboot issue though…)

Mine (H12SSL-I) came with the latest BIOS and BMC firmware, so I haven’t tried updating. The present firmwares seem mature and working.

When it comes to unexpected events, I too have a small source of worry - I have never experienced spurious reboots, as @MasterMace did, however I have seen two spurious shutdowns:

During the first few hours of uptime, Xubuntu 20.04.2 running and the screensaver had blanked the screen, it simply powered off. Turned it on, checked the logs, everything looked like a sudden power loss (e.g. nothing in the OS logs).
After more than a week of daily operation, a few days ago: I found it off when I had left it on. Pressed power: nothing happened. Logged into BMC, got in, tried starting from there - got an uninformative error message. Soft reset BMC - no change. Unplugged power for a while and re-plugged it - it started as normal. No traces in the logs.

For both (1) and (2) there was nothing relevant in the BMC events log either. Now the machine is not behind a UPS currently, and while another computer in the same room was on at the time, perhaps this system is more sensitive to tiny power glitches. That would explain (1), but not as easily (2).

Also while (1) is recoverable in a Server scenario, (2) is likely not. If BMC cannot start it, then I assume that the “resume last state after power loss” setting cannot either. So it is a bit worrying.

PSU is Corsair AX850, most components are modest power-wise. Anyone having a guess what’s going on?

MasterMace · May 23, 2021, 9:21pm

Thanks, @oegat ! I’m in the Netherlands and really have been hammering my local supplier, the wholesaler/importer in the Netherlands and even AMD (didn’t get any wiser from the last one…). They were even that eager to deliver the CPU to me, that they directly forwarded the AMD package when it arrived at their warehouse…Resulting in:

Yes, that’s right, they sent me 4…

Will of course be returning them…

And some other good news to: The server has been online for just over 24 hours, running W10 Pro for now, so fingers crossed! (Will be a Proxmox server after I’ve done all stability testing).

@oegat Let me know what kind of tests you want to see, more then happy to run them! Seen speeds as low as 1375Mhz and as high as 3725. Now idling @1500Mhz, cores around 25C and pulling around 50 Watts of power!

oegat · May 24, 2021, 12:09pm

Wow, perhaps they mixed up the listing of a “tray” CPU with “a tray of CPUs”

Thanks for willing to test! I’m curious on what clock speeds to expect given a certain number of cores loaded. If on Windows, you could try this Sysinternals tool: CpuStres - Windows Sysinternals | Microsoft Docs

There you can start a bunch of processes, and set “Activity level = Maximum” in the menu. It is also possible to pin processes to specific cores. (did not work when I tried)

You could test e.g. to run 4, 8, 12 or 16 processes in parallel, and monitor the CPU frequency in task manager - I assume the number reported there will be the max frequency among all running cores.

What I’m not sure about is how SMT will be used in this scenario - if one thread / core is loaded first, then the above strategy will work - however if the scheduler opts to use both threads on a core before loading the next one, then we would need to double the numbers. I’ll investigate this on my Rome chip and get back… [the boldface scenario above is the correct one, see results below]

(I think for a “gaming VM” scenario it would make sense to load all cores on a CCD at the time, and test with 1, 2, 3, or 4 CCDs loaded. As a VM would probably run on 1-2 CCDs in order to share cache. It is possible to pin processes to cores within CpuStres, however it looks a bit cumbersome for 32 threads as it can seemingly only be done from the menu, one process at a time. It might be easier on Proxmox in the end.)

EDIT:

It looks like Win10 schedules one thread per socket first. Here is a screenshot from my 7252 (8c16t) when running CpuStres:

Summary

Every other logical CPU gets loaded, which corresponds to one thread per core (I checked the order with Coreinfo). Generally, it looks like Win10 spreads threads widely, utilizing all CCX when possible.

The “Affinity” and “Ideal CPU” columns seem buggy, I could not use them to control thread placement.

Forgot to include the clock reading in the picture above:

Summary

I believe that neither Rome nor Milan would leave one core at higher clock while clocking down the rest under load, why the single number from taskman.exe (on the left) would suffice. If unsure, one can check with CPUID HWMonitor (right).

Interestingly, my 7252 never goes below its max speed (3.2GHz) under this type of load. For prime95, however, it went down to base when fully loaded (probably due to AVX). And this is on the lowest cTDP setting

Testing of clock dynamics on Win10 (tentative procedure)

@MasterMace (provided you are still running Windows; or anyone else with a Milan chip running Windows): it would be interesting if you could try using CpuStres to start 4x “max activity” threads at a time up to 32, and note the max clock (as given by taskman.exe) at each step. E.g.:

Threads  clock
4t       ?
8t       ?
...
32t      ?

Preferably at max cTDP setting. What is interesting to see is how quickly max clock drops when more and more cores are loaded. NB: this result will not speak directly to the scenario when a CCD-pinned VM is running under load - for that we need to control core affinity too, which adds complexity.

Optional second test

Repeat the same procedure with Prime95 instead of CpuStres. Prime95 uses AVX and is therefore more power-hungry - I believe it will model more of a worst-case scenario.

Final remarks

The above could surely be scripted, but my savyness with cmd/powershell is limited… - I’ll try to come up with a procedure for *nix too, using a bit more automation.

Finally, I think the most interesting result would be at max cTDP setting. According to @wendell in another thread, Milan can boost quite a lot over rated base clock even under full load. I’d also curious if Milan ever gets significantly above rated max boost.

oegat · May 24, 2021, 8:02pm

Windows 10 on H12SSL-I

Supermicro H12/H11 boards do not officially support Windows 10, but I tried it just to know how well or unwell it would work (and also to get a bare-metal baseline for my planned VM benchmarks). Here is my short report.

Bare-metal installation failed

I could not install Windows 10 (latest ISO downloaded from MS days ago) from USB stick on bare metal. The installation stopped at the stage where disks are selected, claiming something along the lines of “missing drivers”. The window that usually shows available disks was empty, despite both NVMe and SATA devices existing in the system. Possibly there are drivers that could be loaded via a second USB stick, that would make installation possible - I did not investigate this.

Bare-metal boot of VM-installed image succeeded

Then I installed Windows virtually, to a passed-through NVMe drive (Gigabyte AORUS 7000s) using libvirt under Linux. After that I could select the drive as boot drive in BIOS, and Win10 booted. After a short automatic “configuring devices” it was operational. I’m writing this from Win10 on bare metal. I might add that when installing I used PCIe passthrough of the NVMe, not device passthrough - the latter did not work for me. I don’t know why.

Device list after boot

Here are all the devices according to device manager, when running Windows 10. No drivers were added manually.

Summary

All the things that are unavailable look non-critical to me. I took note of their PCIe addresses. I’ll check their ids when back in Linux, and add to this post.

wendell · May 24, 2021, 8:30pm

max boost 1t? only about 50mhz or so. All core? With ctdp up, base is higher in my testing in most scenarios, if no thermal limit applies.

MasterMace · May 25, 2021, 8:49pm

Wow…that’s quite a list @oegat , have to take some time to digest, analyse and see how I can test that. Have some patience, please…

First now trying to get the VGA output working, currently using KVM to work in the Windows10 install, but the resolution and screen size is awful and not to be adjusted it seems

Of course no VGA connector in my monitor, so bought a VGA2HDMI convertor, but don’t get any signal to the monitor. Adjusted the output to VGA in the BIOS, but still no dice…Any tips?

oegat · May 26, 2021, 6:04pm

Sorry, I’m notorious for thinking up complicated experiments for others to do…

Though I probably overcomplicated the procedure by writing a lot of how I tested it out - it should really suffice to start a bunch of worker threads in CpuStres and monitor the frequency. In all events - only if you find the time!

Anyway, for your VGA problem: Are you not getting a signal through the VGA2HDMI (1) even at post? Or (2) only from within Windows?

One thing you could check in case (1) is whether the OPROM to load is set to EFI or Legacy? It might need to be set to the same as the mode you are booting in, for the display to initialize. (now I’m assuming that your and my board have similar bios options)

I have a VGA2HDMI adapter from “Goobay” (who chose that brand name and why?), and it shows the POST display for me, as well as Linux (tried only text mode). In Windows the onboard VGA is displayed as a “Microsoft Basic Display Adapter” with a yellow “!” and Code 31, however this might have to do with me booting from another display, rather than incompatibility.

(when you say KVM I assume you are running Windows on bare metal and access the machine through iKVM, not that you run Windows virtually. Correct me if I’m wrong).

Nefastor · May 27, 2021, 12:33pm

It’s all about signal integrity. Any electrical connection can have three states : we all know about “open” and “closed” but there’s a dreaded third one called “clopen” (I just made that name up). When an electrical connection is clopen, for example due to oxidation on a pin, you get increased electrical resistance but not necessarily to the point where signal can’t go through. Clopen is a grey area : if it’s “copen”, a.k.a. “good enough for the ladies I go out with”, you might have no issue at all. If it’s “clospen” you might have a lot of glitches.

Connector manufacturers do make sure that the rated mating force on their products is safely in excess of what’s required for the product to perform as advertised. However when you’re pushing the envelope, i.e. with hundreds of multi-gigabit signals closely packed, tolerances shrink to almost nothing.

Nobody wants a clopen circuit, it’s the most annoying thing to identify and traceback. It can also kill you. Time for a horror story straight from my own life.

Around 20 years ago, I had an Nvidia nForce-based PC I built myself. It worked well for a couple of years, I was still living at my parents’ and working out of my bedroom. I started dry-coughing a lot. Also, whenever I drank water there was a 50% chance it would go down the wrong way, causing more coughing. After a week I went to a doctor and he had no idea what was wrong with me. No virus, no infection, no nothing. I kept on coughing and swallowing with my lungs instead of my stomach.

A couple of days after that, my mom walked into my room and noted that there was a weird acid smell. I didn’t smell anything so I asked her to describe it.

Luckily, I already had my engineering degree. I recognized she was describing the smell of ozone.

I immediately aired the room and went to the ER (I’m French, this cost me nothing) and they confirmed I had low-levels of ozone poisoning. Side effects include loss of sense of smell and, indeed, problems with swallowing. Main effect is lung damage, but at low level this is reversible and indeed I recovered quickly.

Now, you’ve probably guessed the problem was a clopen circuit somewhere. I had to look for it very hard.

Eventually I found it in my PC, in the motherboard’s ATX power connector. One of the pins was completely charred and the white insulator around it had turned black. This was a 12 V pin. Evidently, it wasn’t making proper contact and kept making electrical arcs which released ozone… which the fans blew into my room.

Because the ATX connector has several 12 V pins, however, this clopen one did not prevent the computer from working.

If my room had had better ventilation, chances are I would never have noticed until the arcs kept burning away the insulation on the motherboard and started an electrical fire.

And that, ladies and gentlemen, is why regulations are good for you. And proper quality assurance testing. And constant vigilance.

Not trying to scare you, or anything… I’d like to add that at the time I only had the money for store-brand power supplies. The motherboard was also a fairly cheap MSI and I understand they have a bad reputation. That being said, I’ve heard of self-incinerating PCIe risers and exploding power supplies as recently as this month so… be careful

Nefastor · May 27, 2021, 1:14pm

So. I’ve taken a few days off. Being your own boss means you tend to overwork yourself. I relaxed by playing through Subnautica. Very relaxing game, it reminded me of those old Windows “aquarium” screen savers. If you’re right-leaning politically you might not like the environmentally-friendly vibe, but that’s on you.

Anyway, back on topic : I’ve now lived with an EPYC workstation for a whole month. While it’s not a Milan (it’s a Milan-capable Rome ) I thought I’d share personal conclusions that apply to any EPYC generation.

First and foremost, it’s a cool machine. As in, you won’t need liquid cooling with a single GPU in it, and it won’t sound like a jet engine. It also uses very little electricity. If this is going to be your workhorse, those are definitely desirable traits. TR Pro clocks faster but the difference may not be enough to justify the additional expense. Believe it when @wendell waxes lyrical about EPYC : it is indeed glorious and a sight to behold, as I believe he put it.

However this is a server part. This is most noticeable in two areas : lack of mainstream connectivity options on the motherboard and lack of higher ACPI support, which I’ve discussed at length on this thread. I’ve tried it for a month and I thought I might get used to it, or even leave my PC on all the time, but really, the inability to hibernate is something I really, really regret.

(As for the connectivity option, I actually prefer being able to choose which I/O my system has, and it’s not like I need 675 USB ports)

A more… “subjective” thing, perhaps : software compatibility can be weird. I mentioned Cyberpunk 2077. Having both Intel and AMD machines under my desk I can confirm that there are more bugs when running on AMD than on Intel. A lot more bugs. I had never seen a T-pose until I started running the game on EPYC. So there’s clearly a minuscule difference somewhere between Intel and AMD. It’s likely that CDPR developed CP2077 on Intel machines, but AMD is very popular with gamers today, and that may explain the huge amount of bugs this game is criticized for. Again, none of which I’ve experienced while running on Intel.

In other words, if any gamer is reading this : maybe EPYC isn’t for you. Or AMD in general. But I’m no gamer so I probably don’t know what I’m talking about. I’m just “telling it like it is”.

I built this EPYC workstation for evaluation purposes and I do have a NAS to upgrade from an antique Core i7-975, so this EPYC beast will likely end-up in my rack. I still need a new workstation but I think it won’t be an EPYC after all. And yes, it’s just because of that “hibernation thing”.

I may well use a TR Pro instead. The Gigabyte WRX80-SU8-IPMI is starting to become available in Europe and that board should fit nicely in my tower case. I really need only 16 cores so I may get away with using the same Noctua heatsink.

I really like the ROMED8-2T, though. It’s good value for money. If you’re going to build an EPYC machine, it has my “seal of recommendation”

I’m not moving to TR Pro for a while. I’ll still be haunting this thread, hoping for news of a hibernation-capable Asrock BIOS update against all reason

PrincessAsu · May 27, 2021, 1:29pm

What are the boot times like into Windows (non virtual) for a workstation using EPYC? I currently shut off my machine properly whenever I dont use it instead of hibernation, but I also enjoy Fastboot etc so was wondering just how much of a crutch this is, or if its because you can’t just keep your apps open and keep working when you come back?