The story of how I destroyed my TrueNAS Scale apps. Or did I? Find out in the next episode!

TLDR: Fucked around, found out. Migrated the Apps pool, then reinstalled and now my apps are gone.

Actual spoiler for the end of the story

And sometimes turning it off and on again is all it takes.

If you’re here for the helpdesk tag and don’t care too much about the story, scroll down to after the horizontal line.


OK so… I’m not entirely sure how this happened. But before I get into it, let me just say if there isn’t any way to recover this it is not a massive deal, but it would certainly be nice.

Anyway, bit of backstory:
I built a new NAS based on TrueNAS Scale in October-ish 2023. Put some apps on it from the TrueNAS catalogue and a couple “Custom” Apps via Docker containers. That’s all been working great so far. The install was still on 24.04 (Dragonfish).
But what always bothered me is that I have had a 250GB SSD in this thing which basically went to waste, because TrueNAS takes the entire disc even though the install size is relatively tiny.
My Apps and all their data were so far stored on my data pool (mirrored 16TB HDDs). But seeing as they don’t take too much space and would benefit from the SSD IOPS, I wanted to migrate the apps and their data to the SSD. TrueNAS itself supports this migration with a simple checkbox.

Before I set this thing up I found this thread:

Basically, it’s creating a partition of a set size during the install so that a second partition with the rest of the drive can be created later, which I did. Before the reinstall I updated to the latest version of Dragonfish and also downloaded the Dragonfish ISO (in case something went wrong I figured the best chance of recovery is when the config backup is from the same version).

I know that this method is explicitly unsupported, but I wouldn’t be in the L1 forum if I wasn’t someone who fucks around with shit (and sometimes finds out, as the case may be).

Anyway, here’s what happened:

  1. Stopped all my apps, then created a backup of the TrueNAS config (including “secret seed”)
  2. I followed the post above, i.e. replaced line in the installer, went through the install. Then went back into the shell, created another partition (yes I checked the man pages before to check what these parameters actually do):
    $ sed -i 's/sgdisk -n3:0:0/sgdisk -n3:0:+16G/' /usr/sbin/truenas-install
    $ truenas-install
    # checking the drives to find the SSD
    $ lsblk
    # BF01 is what the truenas-installer uses as well, so I figured it can't be too wrong
    # truenas-installer creates 4 partitions: BIOS Boot, EFI Partition, boot-pool, and swap
    $ sgdisk -n5:0:0 -t5:BF01 /dev/sdb
    
  3. At that point I checked lsblk again and noticed that my newly created partition supposedly was already formatted as ZFS and had the name boot-pool as well. Finding this odd I researched a bit what these partition types actually mean. During my research I stumbled on a post by the libparted author. TLDR: They don’t really mean anything, and Linux doesn’t really use them. With this knowledge in mind I figured maybe ZFS is doing something when both typecodes are the same (BF00 is Solaris /usr & Mac ZFS), so I decided to remove the partition again and recreate it with a different type code:
    # confirming the SSD is still the same /dev
    $ lsblk
    $ sgdisk --delete=5 /dev/sdb
    # BF00 is Solaris root
    $ sgdisk -n5:0:0 -t5:BF00 /dev/sdb
    
    After this, lsblk was still showing the partition as boot-pool so I decided to say fk it and created a new pool anyway:
    $ zpool create apps-pool sdb5
    
    At this point I got an error message (I don’t remember the exact wording unfortunately) that this operation would destroy a “potentially used” pool (or something along those lines) and this could be overwritten with -f. Having just reinstalled and everything was nuked anyway, I said fk it again and forced it, and that worked.
  4. Booted up the new TrueNAS install for the first time. Went straight to importing my backup, and after the reboot everything was back how it used to be. So, yeah SUCCESS.
  5. Went to Storage > Import Pool, the new apps-pool showed up there just fine, so I did.
  6. Went to Apps > Settings > Choose pool > chose the new apps-pool, checked the Migrate existing Apps checkbox and OKd it. It went about replicating the current apps from my data pool to the new apps-pool. It was about 134 GB (dunno how, the directories are only 8-ish gig, so I guess Snapshots somewhere), which took about an hour.
  7. After everything was done I started up the first couple apps, which worked fine. After those were running for a bit I started the rest. At that point some of the apps didn’t come up. They were stuck in a “deploying” state and in the WebUIs task manager I was getting errors about some kubernetes config in /etc being group-readable, which would be unsafe, and another file (which unfortunately I don’t remember, because it was 2 days ago) being corrupted, which resulted in the deployment failing.
  8. At this point it gets a bit hazy but I think I went to restore my backup config again to get back to a known working state. I don’t think I switched back the App pool because I figured it’s stored in the config anyway so switching it back is pointless.
  9. After the automatic reboot, I tried starting my apps again, and they all behaved normally. At this point I assumed the restore worked and I was back using the Apps on my data-pool (the migration doesn’t “move” them, it does a ZFS replication to the new pools, so the old stuff remains).
  10. This is where I guess I started to fuck up (and from this point on I also didn’t think to copy any errors anymore, which… uh yeah, lesson learned). I got curious how TrueNAS actually creates/destroys pools because zpool was an unknown command in the shell. Turns out their zpool executable is stored in /usr/sbin/, which is not in their path (which makes sense because it’s not really intended for you to do it manually).
    Anyway, I was probably being stupid in thinking I could just recreate the apps-pool using sudo /usr/sbin/zpool create -f apps-pool /dev/sdb5 instead of doing the not dumb thing by just rebooting into the ISO and doing it from the installer’s shell like I did earlier. Doing this resulted in some error that it couldn’t mount because it was a read-only file-system. I just assumed it meant it couldn’t mount the (newly recreated) apps-pool and I had to import it again.
  11. At that point TrueNAS started acting a little weird: For one, the Storage Screen didn’t show the apps-pool anymore, but that was expected because I just recreated it… right? Except, under Import Pool the (new) apps-pool didn’t show up. So it seemed like TrueNAS was thinking the pool was already imported, but it also wouldn’t let me use it. Under Datasets the pool showed up, but it threw me a bunch of errors that it couldn’t read permissions or some such.

At this point I just let it sit because my Apps were running, and because I had other things to do. Now, 2 days later, I wanted to clean this up.
Since TrueNAS was acting “weird” and I thought I was back on the old config anyway, i.e. the backup restore still worked as expected, my plan was to just do the whole thing all over again. I.e. boot into ISO, change the install script, do the install from scratch, create the partition. Just like in Step 3 the new partition was supposedly already formatted, but this time it was named apps-pool. But having seen that before I didn’t think much of it and just zpool create -fd over it and finished.
Booted the (new new) TrueNAS install for the first time, then went to upload the config backup again. Automatic reboot went smooth, checked the Storage Screen, and I was indeed able to select the (new) apps-pool for import (but didn’t, because I wanna figure out why it went wrong in the first place). But then… when checking the Apps Screen, it was telling me no apps were installed, even though on the first reinstall they showed immediately.

Me IRL:
a close up of a doll with the words oh hell no

Me not wanting to risk anything, did the smartest thing I have probably done in 3 days: I shut the NAS down and stopped touching it.

So… this is where we get to 2 hours ago. This is where I started typing all this up, trying to remember what I did, finding a fitting meme to post.
I don’t even know what exactly was riding me (while typing this), but I ended up deciding to turn it on again, I think I was wanting to check what pool was selected in the Apps Screen.

After booting it back up, and going into the Apps Screen, all my Apps were there. I can even start them, everything is as it was before the first reinstall.

Lesson learned: Turns out Roy was right (again)
Roy from The IT Crowd saying his catch phrase: Have you Tried turning it off and on again?


OK If you read all this, first of all thank you.

Secondly, I am curious:
As noted in 3. the newly created partition was showing as Type ZFS member with the label boot-pool (and on 11. apps-pool), and zpool create refused to do so without forcing it. But… why? I don’t really understand. sgdisk doesn’t create filesystems as far as I know, or if it does (by providing the partition type), it would be a blank new filesystem with nothing on it. So why the hell would it be part of the boot-pool? Or even better, where the hell was it getting the name apps-pool from in 11.? boot-pool I can see from the existing pool on the drive, but I nuked the entire drive with the reinstall, where is apps-pool coming from?
That doesn’t make sense to me. Does anyone know?


Helpdesk request:

Third, I still want to do the App migration to the SSD. The Apps take what feels like an eternity to come up especially after a reboot when nothing is cached in RAM yet. So that should be a lot faster from the SSD. Besides, database heavy Apps would love an SSD anyway.
But, even though I know (sort of) that the backup config works, I would not want to go through all of this again (and write yet another post). So any tips on how I could avoid this if it goes south again?

Better yet, does anyone have an idea what might have gone wrong during the App Migration in the first place? The partitions don’t seem to be the issue, and I wasn’t really doing anything outside of TrueNAS functionality (well, except for the 2 partitions). As far as TrueNAS is concerned these are separate pools, and it a) lets me actively import said pool from the UI, and b) lets me select said imported pool as the app-storage pool. It even does the migration on its own.
So I wonder, has anyone else run into issues migrating the Apps to a different pool and them not starting afterwards? I can’t say that anything remarkable happened either… During the replication the WebUI froze up but I figured it just hit a session timeout (although my timeout is set to an hour or so). After reloading the page the task was still active in the task manager, and after the replication was done it restarted the Apps service on its own. So from an observer’s perspective it seems everything went as intended. And yet, I’m getting errors about group-readable files in /etc? That doesn’t really make sense to me… I mean files in /etc aren’t even in the apps-pool, right? So if they had been a problem they should have been a problem before the migration too… any ideas? :confused:

Anyway, thanks for coming to my TED Talk. Now go and destroy some Apps.

3 Likes