AMD Epyc Milan Workstation Questions

I’m not sure, and I don’t fancy experiementing on my own H/W TBH!

The BMC’s main job is to make sure all your H/W is heathy - it does have temp sensors on the VRM’s etc, and it does control the power, so I’d expect at least an alarm, if not a full shutdown if temps get too toasty.

One thing I did notice though is that the 10G NIC is active before the system boots, and in that scenario, the system fans aren’t blowing, so there’s no airflow and it gets quite warm. Again, I’d imagine this is well within normal operating parameters, but it’s still something I wasn’t expecting.

Thanks, I agree and since a few days I’ve moved my current UPS to keep my EPYC powered, and ordered a new one for the NAS. No crashes since 3 days now, however the mean time between these crashes has been more like 5-7 days over the month’s time during which the machine has been running. So it’s too early to settle.

It was always the plan to have the EPYC UPS-secured, these strange poweroffs merely sped up the process to get a UPS in place :slight_smile:

As @Nefastor mentioned we did discuss “around” the question a bit, about VRM cooling and fanwall vs. consumer chassis. My temps in this post is from a beQuiet Silent Base 802 with stock fans: 2 intake and 1 exhaust. That could give you another hunch on what to expect, though everything in my box is currently low-power stuff (including the NIC since I have the H12SSL-I). I have actually added two fans and a second GPU since that post, so I intend to make an updated report soon.

I’ve tested this now, by switching off power by the switch back on the PSU and watching the BMC heartbeat LED. I did it under two conditions:

  1. System turned off
  2. System on and posted, showing BIOS setup screen

In both cases the heartbeat goes on for about 8 seconds (8 blinks of the heartbeat) before dying. Log in to BMC via network worked after 4 seconds without power. So it definitely has something to keep it alive long enough to log a power loss.

There were no traces in the BMC logs of power supply being briefly cut. However, my Corsair AX850 PSU lacks an I2C cable, which is presumably the way the PSU would tell BMC about power events. So the lack of logging is not surprising in my case!

Concluding so far (SM H12SSL)

  • BMC heartbeat lives for up to 8 seconds without power
  • BMC is reachable by network at least up to 4 seconds
  • No logging of power events based on BMC sensing system power / voltages etc. (though probably by I2C bus if existing)
    → several seconds of BMC functionality after power loss

Can it explain my poweroff problem?

Recall: System suddenly switches off, BMC stays alive with no logging. Nearby computers (on adjacent wall outlet) not affected, only my H12SSL system. System won’t start until I completely removed power and returned it.

The problem I’m experiencing is consistent with the above: a tiny power glitch could kill the system without BMC noticing (or dying). Now it is 4 days of uptime since I put the UPS in place, and counting.

However, I did a final test to see if I could reproduce this behaviour:

System won’t start until I completely removed power and returned it.

So I tried this:

  1. Swithing off at the back of the PSU when the computer was running (so it blacked out)
  2. Switching power back on after 4 seconds (so the BMC stayed alive)
  3. Pressing the soft-power button at the front of the case

Expected behaviour if my sudden shutoffs are due to power loss: Step 3 does nothing.
Actual behaviour: Step 3 started the system normally.

This suggests against the hypothesis that my problem is a power delivery problem. However, I did not try removing power for shorter than 4 seconds. Maybe a tiny glitch (<< 1s) leaves the PSU or MB in an undefined state. I’m not testing that for now, as I suspect it is not healthy for the system. Time and the UPS will have to tell.

I’ve just remembered, the BIOS has settings regarding the behavior of the system after power loss. What is yours set to ?

It is set to remain powered off. I have not tried setting it to “power on” or “last state”, but I suspect it won’t help since the soft-power function completely stops working until I remove and return power by the PSU switch :confused:

If you run a UPS in front of a host, you should set the power resumption to “Last state”

I never had any power resumption issues after power glitches or lost power conditions with this setting. Allows the UPS to do its job of keeping the host alive as long as possible or until told to shut down gracefully when battery reserves fall too low.

1 Like

To me this sounds like the BMC or it’s firmware is the issue here, as removing the mains supply is the only way to (hard-)reset the BMC…

@jtredux I’m not sure I understand, I can restart the BMC from its web interface, even in the “problem” state. But when doing so the system can still not start. Power-cycling everything including BMC allows start of the system. Or do you mean that power-cycling BMC usually implies a harder reset (of BMC) than rebooting it from the interface?

Could it not as well be that the BMC is OK, but something in the mainboard’s power delivery (for the system, not BMC) gets stuck in some undefined state, that is cleared when removing power completely?

BMC is on the suspect list, of course. Next time I encounter the problem, I’ll try removing line power for 4 seconds, so that the BMC does not reboot but the rest of the board gets some time to clear itself up.

This is what I typically do, however now when configuring/troubleshooting I’ve had it at staying off to increase control. I am testing with “last state” currently, on the odd chance that it will allow the machine to start after the error somehow. (I don’t expect it to, since no soft-on had worked previously when in that failed state, but if it does I’d like to know it).

Btw I contacted my reseller today. They suggested reseating components and clear CMOS as a precaution, which I did, however they too suspected this will be an RMA in the end.

Does the web-interface actually allow you to reboot the BMC firmware, or just the main-CPU? Wouldn’t rebooting the BMC be cutting off the branch you’re sitting on?

It is possible that something on the board e.g. a CPLD or FPGA isn’t soft-resetting correctly, and only Power-On-Reset does the trick, but I’m not sure how you could actually diagnose this without probing the PCB and checking that the BMC is wigging the correct wires at the correct times, and you’d need the code/docs for that.

I’ve not used this supplier, but was considering them for a GPU. However, they claim to be able to get Milan CPUs:

It does, and it does :slight_smile: Here are the options:

Summary

“Unit reset” reboots BMC, it works even if the main system is running. I get thrown out from the web interface, and back to the login screen after a timer delay.

The power LED starts blinking as always when BMC boots, and then fans speed up for a little while (as if powering on the system after BMC is booted). The running system is afaik untouched.

The green button to the right in the image indicates that the system is on, pressing it lets me power on/off/reboot the main system.

1 Like

My Milan 7443P is arriving tomorrow from Germany. I got from CTT Computertechnik AG for 1,233 EUR VAT free. - So yes they are available now - now looking for ROMED8-2T they will get it by August so I have to wait :frowning:

1 Like

What country did you ship to?

Well, I’ve just about got this machine up and running to my liking with this Tyan S8030 and a UPS man has just appeared with my replacement H12SSL! So time to take it all apart again I guess!

For anyone it may help in the future, for some reason setting PCIe compliance to ‘Enable’ on the Tyan S8030 actually caused all my PCIe cards to disappear! I’d keep the default of ‘Disable’.

1 Like

Well, it took me about 50 mins, but my motherboard swap is complete, just booted back into Proxmox, this time on the H12SSL-NT, with the GPU running :slight_smile:

Nice! Was this a cross-ship, or did they want to check your MB before sending the new one? How long time were you without a H12SSL in total? Good to know if I end up needing an RMA.

What will you do with the Tyan board now? Keep it as a spare part?

I requested the RMA back on June 3rd, I finally got permission to send it back on the 15th, I sent it that day, tracking indicates it arrived on 17th. I was contacted to confirm I’d returned it on the 22nd which caused me a little alarm, and I got a replacement board today on the 23rd, so approx 3 weeks turn-around.

I don’t know what’s been found with the board I returned, just that this is another one.

1 Like

Does anyone with an H12 board know if you can disable the all-fans-to-max on power-on? I just measured 76dB on my sound meter - this SC747 case has some very serious fans for passively cooled GPUs!

Austria