Recently I acquired a faulty Dell DR6000 for cheap with the intent of using it for parts. The system was no slouch with dual Xeons, 192GB of RAM, 12x Seagate Constellation ES3 4TB hard drives and an additional two 15K 300GB Toshiba SAS disks for the OS.
Before tearing it down I figured if I can get it working again, i’d rather use this and upgrade my home NAS to it if it can be recovered. It had a note attached to it by the prior admin “randomly powered off and wont power on again”.
When I first powered it up, my monitor was reporting the signal was out of range, which was odd. Putting an oscilloscope on it showed that there was indeed a VGA signal there, but the timing was all messed up. I figured it would be best to try the easiest options first such as clear the BIOS and iDRAC (there are jumpers for both). I didn’t expect this to yield any results as it looked like a hardware fault, but I was wrong.
The VGA timing was still all messed up, but had changed, and would change several times if I waited. Eventually if I waited long enough (about 5 minutes) it would settle down and I would finally get an image on the screen. This presented with the next clue, the dreaded iDRAC cant start error indicating a faulty eMMC. I have seen this before as has quite a few others, unfortunately this usually means either replacing the BGA eMMC with a new part and re-flashing it, or replacing the entire motherboard.
Still I held out hopes that it could be recovered, while I am no novice with a soldering iron and hot air workstation, BGA rework is still out of my league. As such this left me with only one option, which is to re-flash the iDRAC via the debug UART port. Thankfully the community has done a lot of work figuring out how to recover systems with faulty iDRACs, so there was quite a bit of information on this available.
First step was to connect to the UART, as can be seen in the below image, the UART is easily accessible as an unpopulated four pin header.
I was not keen on pulling the motherboard out, or even soldering to it if could help it. I found if I just sat a header into it, connected my USB TTL UART adapter to the header and sat some light weight on it so it canted over in the holes, I could get a reliable connection without needing to solder in the header.
The blue markings on the other socket in that photo are the iDRAC reset and UBOOT interruption enable pins on the debug header. Simply pulling either of these pins to ground would be enough to trigger the reset or enable the uboot interrupt .
By pulling the reset line low (grounding) and then holding the second pin for uboot low while holding a key on the keyboard, I was able to get into the uboot command prompt and using the dell tool included in the uboot build re flashed the eMMC from a image on an SD card.
The good news, the flash succeeded, no errors. The bad news, the system still reported iDRAC was not responding on POST. This however turned out to be easily solvable by powering off the system and discharging the “flea” power (hold the power button while unplugged) forcing a completely cold boot of the system.
So now it was working, should I trust this system? did the eMMC actually fail and may fail again? or was it caused by some external factor. Thankfully as this is a server with non volatile logging of events, it was possible to determine the cause of the failure.
These are the last log entries before I powered the system back on again for the first time. What is of note here is that the prior admin had performed a firmware upgrade, very likely using the dell ISO they provide which updates everything at once that is out of date. This is the smoking gun, clearly the upgrade to the iDRAC firmware failed during this process, bricking the system. So much for the note on it that it “randomly powered off”… the admin had directly triggered the fault.
Anyway this left me with a fully functional system ready to upgrade my home NAS with. This was also wrought with other performance related issues but I will create a new thread for this.