Freenas "unscheduled system reboot"

I have freenas running a dell R710 and I finally realized its been restarting on me almost since the beginning. I would occasionally have my mapped drive to it disconnect. Which was the only symptom I ever really saw.

Finally one day plex (running in a jail on it) wouldn’t respond. I logged into idrac to realize it was rebooting. Software only though. No extra noise from the hardware spinning back up like you get from a cold boot.

Everything I see looks fine except in the debug I can find “unscheduled system reboot”, “The operating system successfully came back online”. I have been through the messages file in /var/logs/messages and cant find anything. Same in idrac. It all checks out as far as I can see.

So my questions are,
What part of the debug should I attach or just attach the whole thing? (I havent been able to find anything other than what I stated above. But this log style is new and confusing to me.)
Any other suggestions on where to look for logs?
And any suggestions on troubleshooting?

1 Like

You can have it send you an email before it were to reboot or when it comes up from a reboot. That way you would know exactly when it occurs to check the logs.

1 Like

First thing i’d be doing is a memory test and/or other hardware tests.

If you have a second system you may also want to consider directing its syslog to that, so you if the problem is storage related, you still get the log over the network.

I’d definitely run hardware diags though, in my experience FreeNAS is pretty solid. I’ve only seen reliability issues from hardware.

2 Likes

Ok, I’ll see about getting the logs directed away from that machine. Thats going to suck if its hardware.

I do have a debug and in going through it is where I found the unscheduled restart. I also want to say I saw something panic related but whatever followed it was gibberish to me.

I used to get this when it did a scrub. It basically got too hot (2u chassis and crap fans). Try reapplying thermal paste and check your fans as it may be thermal.

But yes most likely hardware. FreeNAS doesn’t “just crash”

1 Like

As above, it could just be something like thermal paste, RAM not quite seated properly, etc. Not necessarily broken, but could simply be dislodged, overheating, etc.

One of the things that makes me hesitate on hardware is the lack of events in idrac. I figure if freenas was getting hot enough to restart idrac would have events to.

Either way I will setup getting the logs and try to do some hardware testing.

Not suggesting this is your issue, but i have actually had a server with major problems and failing to boot before due to a faulty DELL Drac module (before they were called iDrac).

We had to pull the DRAC out to get it to boot. :smiley:

It could also be that you’re hitting some obscure hardware/driver bug, it would be worth googling for issues with your hardware chipset(s) - particularly SCSI controller - for both FreeNAS and FreeBSD.

So I found this after the last reboot.

<118>Sat Mar 14 18:04:24 PDT 2020
bridge0: Ethernet address: 02:86:b2:da:b4:00
epair0a: Ethernet address: 02:66:10:00:07:0a
epair0b: Ethernet address: 02:66:10:00:08:0b
<5>epair0a: link state changed to UP
<5>epair0b: link state changed to UP
<6>epair0a: changing name to 'vnet0:1'
<6>bce0: promiscuous mode enabled
<5>bridge0: link state changed to UP
<6>vnet0:1: promiscuous mode enabled
<6>arp: 172.16.123.51 moved from f0:9f:c2:c3:e4:12 to 00:0c:29:fa:c6:72 on bce0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 16
MCA: CPU 0 UNCOR PCC internal timer error
MCA: Address 0x806577869
MCA: Misc 0x0
panic: Unrecoverable machine check exception

It’s either something with the cpu or the socket. From my experience it’s usually the socket. Please reseat the cpu.

That or try googling the error. Now that I think about that’s probably the best bet.

Most say its a hardware problem but I did find an instance where someone said a microcode update packaged with the bios update fixed the problem. So I downloaded the bios update from dell and applied it.

If that doesnt work I’ll see about pulling it out and reseating the cpu. Its not easy to get it from its current physical location to a point where I can work on it.

And if that doesnt work I’ll probably pull that cpu and just have one in it. As it is a dual cpu system. Relocate the second one to the primary slot and move the RAM.

More specs? What sort of memory DIMMs are you running? Is all memory being detected?

JMO Pull that second CPU. Not only will it save power but QPI with CPUs of that era seems to really hurt performance. Mostly apparent if you have a 10+Gb network link. Unless the extra DIMM slots or cores really benefits your setup.

1 Like

Keep in mind when you do this you can no longer access half of your pci lanes. So plan accordingly.

1 Like

All the ram is being detected. It is only running freenas and a plex add on. Actually pulling the second cpu would increase the ram speed. Not sure why but when I was looking up how to populate the dimm slots I noticed when you have two cpus it lowers the ram speed. I just never pulled one because they were both in their when I purchased it. So I left it as is.

I’ve consided replacing them with more powerful ones or more efficient ones but never seemed worth the trouble until recently when I realized what was going on.

I highly recommend setting up email alerts in FreeNAS. It will send you any dmesg output which will often give you advanced notice of failing hardware.

1 Like

I do have that setup now. I just have to actually check my email now.

Populating more channels does lower RAM speed…At this point, unless you’re actually experiencing a CPU bottleneck, there are probably way better things to spend money on. One huge regret I have is buying two X5680 CPUs right at the get go. They were completely unnecessary for what I needed and caused a cascade of chasing performance while not knowing anything about dual CPU systems of that era. It would have been better to spend more on RAM or higher density storage. The backup box I have now is an R510 with a single Westmere era Xeon (X5650) and 64Gb of RAM. As long the file fits in ARC, I can easily get 800ish MB/s transfer on a 10Gb network connection. No tricks and no jumbo frames. Just using a direct copper connection.

2 Likes

Good to know. I thought I had it fixed as it ran fine for several days. I did get another unscheduled restart but I havent had the chance to go through the debug. Seems like I should just remove cpu 0 and move the second one to the 0 spot. Then move the ram and just use it as a single cpu system.

Im going to go ahead and close this. Im assuming its the cpu in socket 0. Im going to remove it and place the one in socket 1 in socket 0 and re-position the ram.
The servers are in such a place the physical maintenance isnt easy and I havent pulled them down to for it.

1 Like

Good luck!