There’s one log where I did shut down the OS to remove some SSDs (see below).
Story on the SSDs that might be causing this issue:
I have too many SSDs in this NAS for the +5V rail, but they’re spread out between two PSUs and have been working fine up to this point. If there was a PSU failure, I’d assume it would shut off the system rather than randomly restart it.
Also, the peak power draw numbers are nowhere near where they’ve been when I perform a scrub on this zpool.
I took out 48 SSDs tonight and moved them to an externally powered SAS expander to put me back even lower than I was 2 months ago (power-wise and in terms of internal drive count). That didn’t fix anything.
Not that I’ve changed this anytime recently, but I did plug my NAS’s redundant PSUs into 2 separate UPSs.
Since both PSUs are separate entities, it should be fine, but they’re also splitting the load.
I’ve had my NAS in this configuration for a few days now and don’t remember seeing any restarts.
Could this be the issue?
upsmon
I also have an HDD zpool in a separate chassis on a 3rd UPS by itself.
All 3 of those UPSs have USB cables plugged in the back of my NAS.
It’s possible upsmon is forcing the system to restart after 5-10 min of booting. But wouldn’t it call a kernel shutdown and be reported in journalctl?
SSDs still
It could still be the SSDs. It seems like any operation after the system boots that messes with the SSDs causes it to die.
The IMPI tool from Supermicro says the +5V rail is at 4.99V which should be fine right?
It also runs a bunch of snapshot backup tasks as soon as it boots, and those run time, but if I trigger one manually, it’s not long before the system shuts down.
This could be because of the upsmon issue, or it could because reading the SSDs causes this to happen.
Looks like something in the middleware is scheduling reboots considering the reboot cron job line only appears when you do a shutdown or reboot through the GUI, at least on my own system.
EDIT: Since you said you’re using upsmon which is part of the middleware, it seems likely to be related.
The issue appears to be my NIC overheating. I added a second SFP28 adapter to it 5 days ago, and that’s probably why. I dunno why those are green checkmarks though. Very misleading considering it says “non-recoverable”.
I removed the other SFP28 adapter, but now I can’t access it over LAN… These ConnectX-6 cards get really hot apparently. Even with my fans loudly blowing, I think it needs its own fan.
Not quite sure what to do to fix this issue though.
If I’ve looked it up correctly, it’s got a fin-stack heat sink but is just passive, like it’s supposed to go in a 4U with cross-flow or something like that. Maybe just brute-force it, strap an 80mm fan on top or something like that? Unless making a more direct cross-flow is easier for your system.
I was able to run two SFP28 adapters for a few hours no problem after putting the stock fans back on the case.
But then the restart happened again, so I removed the other SFP28 adapter again figuring that was the issue.
After I started a zpool scrub, it happened again.
And now it’s in a restart-loop because it’s gonna keep doing that “zpool scrub” over and over again for each restart.
Last month, I did a “zpool scrub” on the same pool multiple times without fail. Unlike last time, I no longer take any +5V power from the PSU. It’s only +12V (converting to +5V separately), and I’m a little over a third of it’s rated +12V capacity.
The restarts happened even without this configuration, and last night, I removed 48 SSDs and still had the reboot issue.
2 metrics you really need to monitor, maybe with graphs…
interface resets
crc errors
same number of interface resets among devices = problem with upstream device, or power supply.
if your data drives are on different power rails, you can see if it is just one power rail by looking at which set of devices have a high interface reset.
high interface resets on one device = problematic downstream device
high crc errors = bad data cable
truenas scale has the lsi drivers which give that information
truenas core the lsi drivers did not have that information as of 2016, I don’t know about now.
Watchdog is a subsystem that expects to receive regular pings from an agent running under the operating system. When the pings stop, for a certain period, the Watchdog reboots the system.
It appears that instead of running the agent, TrueNAS tries to stop the Watchdog, but for some reason cannot (maybe it is already not running).
Worth going into the Bios options and making sure Watchdog is off. HTH
I think the LSI stuff is exactly where I need to look.
Since this NAS is always on, stuff is always writing or reading from it.
I got it down to no operations and simply ran a scrub (it should be read-only). That works fine.
Then I tried a write operation, and that killed it. The server immediately restarts.
A few things to note:
I swapped the PSUs with ones from another server, so it’s not those.
The IMPI web view stays open. So it can’t be power loss, or I’d lose my session.
When I write a file, it’s as if something connects the reset pins on the motherboard, but nothing’s connected to them. This case only has a power button and an indicator light. There’s nothing special going on here.
Pulled a couple LSI cards and replaced them with SAS expanders.
That “fixed” it, but then the issue started happening again. I’m not sure which SAS cards are bad or not or what’s causing it or how to even check why.