I have freenas running a dell R710 and I finally realized its been restarting on me almost since the beginning. I would occasionally have my mapped drive to it disconnect. Which was the only symptom I ever really saw.
Finally one day plex (running in a jail on it) wouldn’t respond. I logged into idrac to realize it was rebooting. Software only though. No extra noise from the hardware spinning back up like you get from a cold boot.
Everything I see looks fine except in the debug I can find “unscheduled system reboot”, “The operating system successfully came back online”. I have been through the messages file in /var/logs/messages and cant find anything. Same in idrac. It all checks out as far as I can see.
So my questions are,
What part of the debug should I attach or just attach the whole thing? (I havent been able to find anything other than what I stated above. But this log style is new and confusing to me.)
Any other suggestions on where to look for logs?
And any suggestions on troubleshooting?
Not suggesting this is your issue, but i have actually had a server with major problems and failing to boot before due to a faulty DELL Drac module (before they were called iDrac).
We had to pull the DRAC out to get it to boot.
It could also be that you’re hitting some obscure hardware/driver bug, it would be worth googling for issues with your hardware chipset(s) - particularly SCSI controller - for both FreeNAS and FreeBSD.
<118>Sat Mar 14 18:04:24 PDT 2020
bridge0: Ethernet address: 02:86:b2:da:b4:00
epair0a: Ethernet address: 02:66:10:00:07:0a
epair0b: Ethernet address: 02:66:10:00:08:0b
<5>epair0a: link state changed to UP
<5>epair0b: link state changed to UP
<6>epair0a: changing name to 'vnet0:1'
<6>bce0: promiscuous mode enabled
<5>bridge0: link state changed to UP
<6>vnet0:1: promiscuous mode enabled
<6>arp: 172.16.123.51 moved from f0:9f:c2:c3:e4:12 to 00:0c:29:fa:c6:72 on bce0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x106a5, APIC ID 16
MCA: CPU 0 UNCOR PCC internal timer error
MCA: Address 0x806577869
MCA: Misc 0x0
panic: Unrecoverable machine check exception
Most say its a hardware problem but I did find an instance where someone said a microcode update packaged with the bios update fixed the problem. So I downloaded the bios update from dell and applied it.
If that doesnt work I’ll see about pulling it out and reseating the cpu. Its not easy to get it from its current physical location to a point where I can work on it.
And if that doesnt work I’ll probably pull that cpu and just have one in it. As it is a dual cpu system. Relocate the second one to the primary slot and move the RAM.
More specs? What sort of memory DIMMs are you running? Is all memory being detected?
JMO Pull that second CPU. Not only will it save power but QPI with CPUs of that era seems to really hurt performance. Mostly apparent if you have a 10+Gb network link. Unless the extra DIMM slots or cores really benefits your setup.
All the ram is being detected. It is only running freenas and a plex add on. Actually pulling the second cpu would increase the ram speed. Not sure why but when I was looking up how to populate the dimm slots I noticed when you have two cpus it lowers the ram speed. I just never pulled one because they were both in their when I purchased it. So I left it as is.
I’ve consided replacing them with more powerful ones or more efficient ones but never seemed worth the trouble until recently when I realized what was going on.
Populating more channels does lower RAM speed…At this point, unless you’re actually experiencing a CPU bottleneck, there are probably way better things to spend money on. One huge regret I have is buying two X5680 CPUs right at the get go. They were completely unnecessary for what I needed and caused a cascade of chasing performance while not knowing anything about dual CPU systems of that era. It would have been better to spend more on RAM or higher density storage. The backup box I have now is an R510 with a single Westmere era Xeon (X5650) and 64Gb of RAM. As long the file fits in ARC, I can easily get 800ish MB/s transfer on a 10Gb network connection. No tricks and no jumbo frames. Just using a direct copper connection.
Good to know. I thought I had it fixed as it ran fine for several days. I did get another unscheduled restart but I havent had the chance to go through the debug. Seems like I should just remove cpu 0 and move the second one to the 0 spot. Then move the ram and just use it as a single cpu system.
Im going to go ahead and close this. Im assuming its the cpu in socket 0. Im going to remove it and place the one in socket 1 in socket 0 and re-position the ram.
The servers are in such a place the physical maintenance isnt easy and I havent pulled them down to for it.