General NAS advice (now blog)

I am super confused. Something is wrong, and I have no idea what or how to figure it out. I’m having about 50/50 success booting the nas on the new hardware. Only a couple times have I gotten as far as to log in to the UI from my PC, and none of those have gotten me as far as actually accessing the files. Once I got all the way to unlocking the encryption, but still couldn’t access it on the network, and shortly thereafter it kicked me out of the UI and never reconnected again, like it was powered off. It has done random reboots before as I think I already said, but this didn’t reboot, or power off. it just… stopped. On the failed boots, it basically never leaves this state. A couple times it froze on the post screen, prompting me to enter bios. One time I was able to, one time not. A couple times it literally never did anything but a black screen. Once or twice it booted into truenas, but just continually spat a bunch of nonsense down the screen and couldn’t connect from the PC. Tried resetting bios to defaults, but based on what it told me it didn’t actually change much of anything back. Mostly just inconsequential stuff like wifi and the stupid RGB on the board. Maybe it had already reset in one of those failed boots? Except it never lost the compatibility boot setting.

And it’s starting to look like one of the drives is indeed cooked. The last two times it booted far enough to tell, one of the HDDs was not detected. I guess to be on the safe side I should do nothing else until that is replaced. But in this state, I don’t even know how I would recover the pool back to full health.

What kind of danger am I in if I just try to start fresh? The actual data disks - is there anything about those tied to THIS particular truenas install? They have password encryption. don’t know if that means I can’t just yoink them and put them in a fresh install. Or if you can even do that if they AREN’T encrypted.

Well it’s finally back working on the new hardware, mostly. I pulled everything apart, swapped out the sata cables, and cleaned off the contacts on the drives with some alcohol and qtips. There was a small bit of black crap on most of them. I thought it was deep scratches in the gold contacts at first, because it wouldn’t scrape off with my fingernail, but scrubbing with the qtips and alcohol eventually cleared it off. The sata cables were twisted pretty gnarly from the odd angles forced on it by the cable management, so I undid pretty much all of that and they’re all just loose in the front now. Bonus, that lets me get the back side panel on a lot easier. I also started to install the 5.25" bay drive sled I got for future disk expansion, but turns out it has a physical design flaw and will need some plastic dremeled away to fit. With all that done, and also some minor tweaking in the bios that shouldn’t have anything to do with the problems (disabling unneeded stuff, turning the stupid RGB off, setting eco mode, etc), all drives are detected, it boots into truenas fine, no weird warnings or disks being unavailable, and it just generally works like it always did on the old hardware. As long as it doesn’t do the self-reset/restart thing anymore, should be good. It’s already been up longer than it had been taking to do that.

The “mostly” part of the new hardware is I also changed the ECC ram out for the old non-ecc of the gaming build in case that was leading to instability. If all is well as is for a couple days, I’ll re-swap that and as carefully as I can make sure bios is configured right for it, but the bios menu on this board is SUPER stupid and confusing. There doesn’t seem to be anywhere to just set it to run at the jedec or xmp (i forget what the amd version is called) profiles stored on the sticks. And of course it never really tells you what “auto” is doing, and these days its as likely as not that “auto” tries to do some overclocking or otherwise non-baseline settings so that their motherboard comes out ahead in any performance testing/reviews.

It stayed up and accessible just fine for about 2 days, during which I ran extended SMART tests on everything and scrubbed the pool. To the extent Truenas tells me, which as far as I’m aware is not much more than nothing, there were no errors anywhere along the way.

Although one odd thing was that the scrub progress percentage reset and started again after reaching 100. I thought at first maybe this was a per-disk thing. Only about 3% of the way through that, my power blinked, and for some damn reason my UPS didn’t keep me up, so I had a moment of panic, not knowing if hard losing power during an active scrub might fuck something up. When I turned everything back on though, and I checked the notifications on the right in the truenas UI, it reported as scrub completed on the pool, corresponding to about the time it would have hit 100%. So I don’t really know why it started again. It didn’t say anything about errors in notifications, and under the disks tab, each one has a “success” entry for the long test, but doesn’t tell me anything else but a “lifetime” number that seems way too high to be simple power-on hours at almost 2000, since I don’t keep the machine on when not in use.

So with everything looking all hunky-dory, I swapped the ECC memory back in, and faffed about in bios, finally finding the setting to enable ECC. It was already set to “auto”, and based on the tool tip, that should enable it with ECC memory physically present, but I went ahead and set it to “enable” anyway. I did not however, manage to find the DOCP (as it was called on this am4 board) setting that I eventually did locate with the previous memory installed. It was under the ‘A.I. overclocking tweaker’ menu, which I didn’t mess with because I didn’t want to use any proprietary, non-amd approved nonsense (that has a history of killing hardware from both amd and intel, after all, and with asus specifically), which it sure sounded like anything under that menu would be. Perhaps there aren’t such profiles on these sticks, and that option only presents itself if they are found? With everything on “auto” except disabling a few “performance enhancement” functions, it says it is running the speed the sticks are rated according to the sticker, 2933, anyway. If memory serves that is just jedec for DDR4.
Booted up fine, loaded the UI fine, and now I have a nice (ECC) showing on the home page memory… widget.

I’m hoping but not really expecting the extra capacity (64 vs 32 gb) might make it load up a whole video file into memory when I watch them, and I’ll not get the minorly annoying delay when skipping/scrubbing around that I do now. But then again maybe that’s from network latency and not disk. But either way, based on what reading I have done on zfs, it doesn’t seem like pre-loading a file it thinks i might use is something it does, and it wouldn’t be until after I’ve watched the whole video that it would reside in memory. This is all just me guessing though, I guess I’ll find out over time.

So next up needs to be making sure I know how to backup and “recover” the install. Not the data itself, i.e. just another nas, but getting access to the data back should something like this happen again where I have (at least all but one of) the data disks but something happens to the install, either software or hardware failure. Then once that’s sorted, probably migrate to scale because I’m pretty sure I’m going to need that add a disk to the pool feature in the future. Although I did just see something about a “community edition” that seems separate to both core and scale, not sure what that’s about.