[SOLVED] TrueNAS keeps restarting randomly every 6-10 minutes

All of a sudden, my NAS started restarting randomly tonight.

I’m running TrueNAS SCALE 23.

At the time this started, I was copying around some files on my SSD zpool, and then it happened.

What might be causing these restarts?


It looks like it’s been happening quite a few times tonight, and only the startup is logged. That makes me think it’s a hardware issue:

There’s one log where I did shut down the OS to remove some SSDs (see below).


Story on the SSDs that might be causing this issue:

I have too many SSDs in this NAS for the +5V rail, but they’re spread out between two PSUs and have been working fine up to this point. If there was a PSU failure, I’d assume it would shut off the system rather than randomly restart it.

Also, the peak power draw numbers are nowhere near where they’ve been when I perform a scrub on this zpool.

I took out 48 SSDs tonight and moved them to an externally powered SAS expander to put me back even lower than I was 2 months ago (power-wise and in terms of internal drive count). That didn’t fix anything.

I’m now wondering if this is UPS-related.

2 UPSs

Not that I’ve changed this anytime recently, but I did plug my NAS’s redundant PSUs into 2 separate UPSs.

Since both PSUs are separate entities, it should be fine, but they’re also splitting the load.

I’ve had my NAS in this configuration for a few days now and don’t remember seeing any restarts.

Could this be the issue?

upsmon

I also have an HDD zpool in a separate chassis on a 3rd UPS by itself.

All 3 of those UPSs have USB cables plugged in the back of my NAS.

It’s possible upsmon is forcing the system to restart after 5-10 min of booting. But wouldn’t it call a kernel shutdown and be reported in journalctl?

SSDs still

It could still be the SSDs. It seems like any operation after the system boots that messes with the SSDs causes it to die.

The IMPI tool from Supermicro says the +5V rail is at 4.99V which should be fine right?

It also runs a bunch of snapshot backup tasks as soon as it boots, and those run time, but if I trigger one manually, it’s not long before the system shuts down.

This could be because of the upsmon issue, or it could because reading the SSDs causes this to happen.

Still happens after going back on 1 UPS and unplugging all the USB cables from those UPSs.

Unless anyone has other ideas, I think the +5V rail is botched. Maybe it’s overheating?

Looks like something in the middleware is scheduling reboots considering the reboot cron job line only appears when you do a shutdown or reboot through the GUI, at least on my own system.

EDIT: Since you said you’re using upsmon which is part of the middleware, it seems likely to be related.

Watchdog without proper watchdog driver, maybe? Sorry, do not know if your motherboard has Watchdog.

Not sure what watchdog is, but it’s there when I manually reboot:

Tonight, it worked fine for 2 hours with only the four boot and TrueNAS Apps SSDs plugged in.

I started plugging SSDs back and got to 13 before I noticed it rebooted again, but this time, I finally got errors!:

The issue appears to be my NIC overheating. I added a second SFP28 adapter to it 5 days ago, and that’s probably why. I dunno why those are green checkmarks though. Very misleading considering it says “non-recoverable”.

I removed the other SFP28 adapter, but now I can’t access it over LAN… These ConnectX-6 cards get really hot apparently. Even with my fans loudly blowing, I think it needs its own fan.

Not quite sure what to do to fix this issue though.

1 Like

If I’ve looked it up correctly, it’s got a fin-stack heat sink but is just passive, like it’s supposed to go in a 4U with cross-flow or something like that. Maybe just brute-force it, strap an 80mm fan on top or something like that? Unless making a more direct cross-flow is easier for your system.

1 Like

I was able to run two SFP28 adapters for a few hours no problem after putting the stock fans back on the case.

But then the restart happened again, so I removed the other SFP28 adapter again figuring that was the issue.

After I started a zpool scrub, it happened again.

And now it’s in a restart-loop because it’s gonna keep doing that “zpool scrub” over and over again for each restart.

Last month, I did a “zpool scrub” on the same pool multiple times without fail. Unlike last time, I no longer take any +5V power from the PSU. It’s only +12V (converting to +5V separately), and I’m a little over a third of it’s rated +12V capacity.

The restarts happened even without this configuration, and last night, I removed 48 SSDs and still had the reboot issue.

2 metrics you really need to monitor, maybe with graphs…
interface resets
crc errors

same number of interface resets among devices = problem with upstream device, or power supply.
if your data drives are on different power rails, you can see if it is just one power rail by looking at which set of devices have a high interface reset.

high interface resets on one device = problematic downstream device

high crc errors = bad data cable

truenas scale has the lsi drivers which give that information
truenas core the lsi drivers did not have that information as of 2016, I don’t know about now.

I’m on SCALE. How do I check for those values on my LSI cards?

it was available as a tool on the lsi website next to the driver download.

Watchdog is a subsystem that expects to receive regular pings from an agent running under the operating system. When the pings stop, for a certain period, the Watchdog reboots the system.

It appears that instead of running the agent, TrueNAS tries to stop the Watchdog, but for some reason cannot (maybe it is already not running).

Worth going into the Bios options and making sure Watchdog is off. HTH

1 Like

That makes sense. Pretty sure watchdog is off, but that’s another avenue to check.

Do you know the name of this tool? I have some LSI stuff installed in this system already as part of TrueNAS.

I don’t know how to navigate the Broadcom website to find anything else to download though.

NIC overheating… That some hardware built wrong or a lack of server airflow,

“Watch Dog” is off:
image

1 Like

Which sas card do you have?

LSI 9305-24i or -16e depending on which card. I have 6 of them in this machine.

I think the LSI stuff is exactly where I need to look.

Since this NAS is always on, stuff is always writing or reading from it.

I got it down to no operations and simply ran a scrub (it should be read-only). That works fine.

Then I tried a write operation, and that killed it. The server immediately restarts.

A few things to note:

  1. I swapped the PSUs with ones from another server, so it’s not those.
  2. The IMPI web view stays open. So it can’t be power loss, or I’d lose my session.

When I write a file, it’s as if something connects the reset pins on the motherboard, but nothing’s connected to them. This case only has a power button and an indicator light. There’s nothing special going on here.

1 Like

Pulled a couple LSI cards and replaced them with SAS expanders.

That “fixed” it, but then the issue started happening again. I’m not sure which SAS cards are bad or not or what’s causing it or how to even check why.

It’s only happening when I write to the zpool.