LOG device on an encrypted pool = single point of failure? Can't unlock pool anymore after loosing SLOG

Mastakilla · February 4, 2021, 10:59am

Hi,

I am running TrueNAS 12 on my server and had some issues with data corruption recently which caused me to temporarily remove my Intel Optane from my server. After finding out the issues were caused by a broken CPU, I replaced the CPU and wanted to re-include my Intel Optane.

But something went wrong and now I can’t unlock my pool anymore… Here’s what happened:

I actually followed the same steps that I successfully performed on FreeNAS 11, when initially installing my Intel Optane about 6 months ago.

Create a GPT partition table
- gpart create -s gpt nvd0
Create 3 partitions (I didn’t do anything yet with the swap)
- gpart add -i 1 -b 128 -t freebsd-swap -s 16g nvd0
- gpart add -i 2 -t freebsd-zfs -s 20g nvd0
- gpart add -i 3 -t freebsd-zfs nvd0
Add the 2nd partition as SLOG to my existing password encrypted RAIDZ2 pool
- zpool add hgstpool log /dev/gptid/cf3dbeb8-c5dc-11ea-8b09-d05099d3fdfe
Create a non-redundant test pool from partition 3
- zpool create -m /mnt/optanepool optanepool /dev/gptid/d23e29f0-c5dc-11ea-8b09-d05099d3fdfe
- zpool export optanepool
- Import optanepool in the GUI

Next I wanted to enable lz4 compression on optanepool in the GUI, but I got an error:

Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 137, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self,
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/service.py", line 463, in update
    rv = await self.middleware._call(
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 972, in nf
    args, kwargs = clean_and_validate_args(args, kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 930, in clean_and_validate_args
    value = attr.clean(args[args_index + i])
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 607, in clean
    data[key] = attr.clean(value)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 167, in clean
    value = super(Str, self).clean(value)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 53, in clean
    raise Error(self.name, f'Invalid choice: {value}')
middlewared.schema.Error: [aclmode] Invalid choice: DISCARD

This step did work in FreeNAS 11, but for some reason it doesn’t work anymore in TrueNAS 12.

I thought: Ok, no problem. I’ll just re-create optanepool and enable compression with the “zpool create” command itself. So I exported the pool the pool in the TrueNAS GUI, but then I couldn’t create a new zpool over the old one, as the old one was still there, but not visible, as it was exported.

Then I made the mistake to think that it might be better to destroy optanepool from the GUI, so I re-imported optanepool and then exported it again in the GUI, while selecting the destroy checkbox. At that time my main RAIDZ2 pool was actually still locked after a reboot (I even thought it would be safer to keep it locked).

But, here it comes, when destroying optanepool from the TrueNAS GUI, it didn’t just destroy the optanepool partition only. No it destroyed ALL partitions on my Intel Optane, including the partition that I was using as SLOG for my password encrypted RAIDZ2 pool

When I now try to unlock my RAIDZ2 pool, it fails with the following error:

Error: concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 91, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 45, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 977, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/zfs.py", line 371, in import_pool
    self.logger.error(
  File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/zfs.py", line 365, in import_pool
    zfs.import_pool(found, new_name or found.name, options, any_host=any_host)
  File "libzfs.pyx", line 1095, in libzfs.ZFS.import_pool
  File "libzfs.pyx", line 1123, in libzfs.ZFS.__import_pool
libzfs.ZFSException: one or more devices is currently unavailable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 361, in run
    await self.future
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 397, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 973, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/pool_/encryption_freebsd.py", line 290, in unlock
    raise e
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/pool_/encryption_freebsd.py", line 272, in unlock
    await self.middleware.call('zfs.pool.import_pool', pool['guid'], {
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1238, in call
    return await self._call(
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1203, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1209, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1136, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1110, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
libzfs.ZFSException: ('one or more devices is currently unavailable',)

In the last line, you can clearly read “one or more devices is currently unavailable”. In the console, you can also see that it successfully unlocks each HDD, one by one, and after unlocking all 8 HDDs, it then destroys them again:

Feb  4 10:35:29 data GEOM_ELI: Device gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:29 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:29 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:31 data GEOM_ELI: Device gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:31 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:31 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:32 data GEOM_ELI: Device gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:32 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:32 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:34 data GEOM_ELI: Device gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:34 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:34 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:36 data GEOM_ELI: Device gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:36 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:36 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:38 data GEOM_ELI: Device gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:38 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:38 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:39 data GEOM_ELI: Device gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:39 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:39 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:41 data GEOM_ELI: Device gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:41 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:41 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:42 data GEOM_ELI: Device gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.

So apparently my accidently destroyed SLOG device, is causing the pool to become impossible to unlock.

I can’t add a new / replacement SLOG, because I need an unlocked pool for that
I can’t “remove” the destroyed SLOG from my pool, because I need an unlocked pool for that
I tried physically removing the Intel Optane, so that TrueNAS might realize the SLOG is no longer there, but I still can’t unlock my pool

I have some ideas to go forward, but my ZFS knowledge is too limited to know if these are safe / good ideas. I hope someone here can give some feedback…

I was thinking unlock my HDDs one by one using the command line. Perhaps then, even though I can’t unlock the destroyed SLOG anymore, I’ll be able to “see” and modify my unlocked RAIDZ2 pool and perhaps remove the SLOG, so that TrueNAS no longer expects a SLOG while unlocking. I found following commands to do this:

geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe

Is this a good idea?

Log · February 4, 2021, 6:36pm

While I’ve not personally tested such a scenario, a pool should be able to have the slog come and go at any time.

So I don’t think this is a ZFS problem, but a GUI problem. To be honest, I’m fairly confused about most of what you’ve done here and why.

I’m not familiar with bsd/freenas or ZFS encryption or the combination of both. I’d recommend Staying away from flailing about with ‘attach’ commands for now lest you dig yourself deeper hole.

Instead you should be looking to import the pool. In the command line type ‘zpool import’ to get a list of what pools are available. Hopefully there’s someone here who’s familiar with how encryption is handled.

You can also head over to r/zfs for advice.

Mastakilla · February 4, 2021, 7:40pm

Thanks for the advice!

I know the “zpool import” command. It even has an option specifically for “my problem”:

  -m      Allows a pool to import when there is a missing log device.  Recent transactions can be lost because the log device will be discarded.

However, because the pool is encrypted, it remains “invisible” to the OS until the geli devices are loaded. That is why I suspect I need to get the “geli attach” commands to work first. So that the pool becomes “visible” etc

Mastakilla · February 5, 2021, 8:59am

Here is what made my data available again (not sure if TrueNAS is still 100% healthy, but that’s a worry for later)

Below steps made my pool visible / available again in CLI (unlock each HDD, one by one)
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe
- geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe
The next step made my pool fully visible in GUI (as degraded because of the missing SLOG)
- zpool import -m hgstpool
This removed the log device from the pool and made the pool also healthy again in the GUI
- zpool remove hgstpool gptid/f8725bbb-64c5-11eb-bbcd-d05099d3fdfe

After a reboot, I could unlock my pool again in the normal way via the GUI… I’ll now start updating my offline backups first