Thank you. So to clarify, as long as I use two devices as mirrors and later decide that I don’t want to use Metadata anymore, will I be able to remove the entire metadata config from my Pool?
Oh, that however is something else. From what has been said here, you can replace (remove) sides of a mirror, and add mirrors, and mirrors of mirrors, but unless I missed it in the discussion, you can’t remove the the last VDEV of a special device mirror.
You can send your pool to another pool without special VDEVs, and then the meta data gets put right back into the normal VDEVs with data and all the other types of metadata. You can also rsync the data or use another method to move it to some place that doesn’t have special VDEVs.
correct. You can remove if your pool doesn’t have RAIDZ vdevs. There is always metadata, but with the special vdev, you store the metadata on fast NVMe.
You did. See: zpool-remove.8 — OpenZFS documentation
The entire pool must be mirrors. Doesn’t work with RAIDZ. The term “Top-level-vdev” refers to vdevs where actual data is stored, like the vdevs for HDDs and special where metadata is stored.
edit: I removed my special vdev in my pool half a year ago and was using L2ARC with both SSDs, but lately decided to revert back because workload changed. Mirrors are great and flexible.
Indeed I did miss it.
Mirrors are GUD.
Removes the specified device from the pool. This command supports removing hot spare, cache, log, and both mirrored and non-redundant primary top-level vdevs, including dedup and special vdevs.
Thank you. That explains why I can’t remove my Metadata from my pool. I am using RaidZ-2 on my data tier, and while the Metadata special setup is mirrored, I can’t remove it from my setup and will need to rebuild my entire pool.
Looking at your suggestions, I will probably stay away from Metadata unless my workload change.
Just got my 118GB P1600Xs today.
I have “demoted” my Samsung 9A1 500GB drives (previously Tier 1 VM Storage) to be the special devices on my main pool, They are in a striped 4-way mirror (8 drives total, capacity of 2).
The P1600Xs as L2ARC. Maybe some day they will give me a flag that allows me to not cache metadata in L2ARC
I am finally encroaching into big boy SAN land.
pool 137T 65.3T 72.0T - - 2% 47% 1.20x ONLINE /mnt
raidz1-0 45.5T 20.9T 24.6T - - 5% 45.9% - ONLINE
24f5b474-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
24bd6bc9-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
2511846b-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
24d7465b-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
24f69fb7-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
raidz1-1 45.5T 22.6T 22.9T - - 3% 49.7% - ONLINE
24aadaec-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
spare-1 - - - - - - - - ONLINE
82e1befb-c6f5-47cc-9b7a-a81351d10437 9.09T - - - - - - - ONLINE
6919a517-7b86-49c8-a2f2-0b65d03c57d3 9.09T - - - - - - - ONLINE
255a65de-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
24f77b18-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
24bf6bcb-f68c-11ec-9fa9-246e9642ff38 9.09T - - - - - - - ONLINE
raidz1-2 45.5T 21.5T 24.0T - - 0% 47.3% - ONLINE
b90af04b-6aba-4c1a-bd53-0fde02098d2b 9.09T - - - - - - - ONLINE
451f468f-73b1-4eb8-aec5-780594ce367a 9.09T - - - - - - - ONLINE
e9dd89c9-24b3-49f8-b319-32a6b26fcf69 9.09T - - - - - - - ONLINE
29083760-6de5-48a5-bdf9-3281c02c79dd 9.09T - - - - - - - ONLINE
fd472396-91b8-48c1-8240-efe931defb4a 9.09T - - - - - - - ONLINE
special - - - - - - - - -
mirror-3 476G 191G 285G - - 10% 40.2% - ONLINE
529b8e0e-89d0-44b9-ab02-ac1327e738b6 477G - - - - - - - ONLINE
5de80780-19d2-4a41-a90f-97a4ec0e88fe 477G - - - - - - - ONLINE
4a877232-bc85-40b3-aedf-99ee2c481d0c 477G - - - - - - - ONLINE
04ffc4c8-7a7f-46e1-a1b1-b49daebda47f 477G - - - - - - - ONLINE
mirror-4 476G 192G 284G - - 11% 40.3% - ONLINE
06af2010-9f76-41c7-ab3f-e08fd09629f1 477G - - - - - - - ONLINE
f84cecd7-6e7d-4ba8-9c51-379b664ddf5f 477G - - - - - - - ONLINE
8e61280b-d2d9-4442-9651-c72f8d532f29 477G - - - - - - - ONLINE
d260dbe8-a426-4b72-b7c4-d2f70a27e04c 477G - - - - - - - ONLINE
cache - - - - - - - - -
1960bed3-d6c2-48b8-8c83-f4bb63bd4082 110G 2.93G 107G - - 0% 2.65% - ONLINE
6ef0339d-c3c2-4387-9d4f-f08414775810 110G 2.86G 107G - - 0% 2.59% - ONLINE
spare - - - - - - - - -
6919a517-7b86-49c8-a2f2-0b65d03c57d3 9.09T - - - - - - - INUSE
3388494b-9f79-43cb-8da0-b73d197b9032 9.09T - - - - - - - AVAIL
11014265-e5ed-490c-a68c-6f914764fe0a 9.09T - - - - - - - AVAIL
50309bdd-3ece-45fc-a305-33bfd43c9a9e 9.09T - - - - - - - AVAIL
The 9A1s were replaced by a pair of 905ps 960gb drives, mirrored.
root@prod[~]# zpool list -v optane_vm
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
optane_vm 888G 622G 266G - - 42% 70% 1.08x ONLINE /mnt
mirror-0 888G 622G 266G - - 42% 70.0% - ONLINE
0a3ec38c-61bc-4c92-989e-bb886b22c241 892G - - - - - - - ONLINE
e33c70aa-009f-4032-8ae6-0465156ad9ca 892G - - - - - - - ONLINE
root@prod[~]#
So now I have a really fast Tier 1 storage pool also.
If electricity rates weren’t so high I’d be building a middle ground storage tier with a bunch of flash disks…I even have a NetApp disk shelf waiting in the wings… My rates were $.09 and are doubling to $.16, and almost tripled to $.27 if we didn’t change suppliers…
TDLR, my wife hates my hobby.
In any case, Thanks, Intel, for killing off one of your best products so I could get a deal. lol
Took about a month to get the money together, do the upgrades, and have fun. Well worth it. Thanks @wendell for the tip in your video recently.
Add this to your kernel command line/module options:
zfs.l2arc_exclude_special=1
Added in 2.1.6 IIRC. Have fun!
Great, Didn’t know they did that
added as a POSTINIT script
echo 1 > /sys/module/zfs/parameters/l2arc_exclude_special
Intel still seems to have stock left. Today only, Newegg has an additional $60 off on the $399 Optane 905 (960GB).
Intel Optane 905P Series 960GB, 2.5" x 15mm, U.2, PCIe 3.0 x4, 3D XPoint Solid State Drive (SSD) SSDPE21D960GAM3 - Newegg.com-20-167-463--12222022
This is your chance if you haven’t got your fill, yet.
(no, I’m not affiliated with Newegg nor Intel, but exited that this technology is finally being priced accessibly for prosumers)
Not that Newegg sells here in Europe, but that is a nice price for Optane.
However, $700 for 2x1TB Optanes for special VDEVs is… still quite a hefty price.
An alternative that I took is to use WD Red SN700 SLC NAND NVME M.2. The 1TB models have a write endurance of 2000 TB, and for 100 EU a piece, that is a no-brainer.
Not that we have a choice here, since that nice Newegg price isn’t available to us anyway.
I don’t share the hype for Optane special
vdev either. Standard consumer SATA SSD or NVMe SSD is totally fine. And you certainly won’t wear out some drive because of metadata and 8k files, same with L2ARC which is even more read-heavy. For SLOG, Optane has a (3DX-)Point.
I rather have 1-2TB special
with SATA SSD than one with 110G of Optane.
My zpool list -v
shows 360G out of 1T used (incl. 64k small blocks pool-wide) and traffic is rather low all things considered, the PCIe 3.0 NVMe certainly doesn’t take a sweat. Endurance will last until my retirement.
I finally got around to doing this…
Block Size Histogram
block psize lsize asize
size Count Size Cum. Count Size Cum. Count Size Cum.
512: 32.1K 16.0M 16.0M 30.5K 15.3M 15.3M 0 0 0
1K: 31.9K 38.3M 54.4M 29.0K 35.0M 50.3M 0 0 0
2K: 42.7K 113M 167M 36.9K 97.3M 148M 0 0 0
4K: 1.24M 4.97G 5.13G 31.6K 166M 314M 0 0 0
8K: 541K 5.87G 11.0G 22.2K 247M 561M 242K 2.84G 2.84G
16K: 182K 3.85G 14.9G 650K 10.3G 10.8G 1.24M 29.8G 32.6G
32K: 381K 17.4G 32.3G 775K 24.5G 35.3G 739K 34.7G 67.3G
64K: 1.85M 177G 209G 20.4K 1.80G 37.1G 1.08M 111G 179G
128K: 111M 13.8T 14.0T 113M 14.2T 14.2T 112M 18.3T 18.5T
256K: 0 0 14.0T 0 0 14.2T 1.17K 439M 18.5T
512K: 0 0 14.0T 0 0 14.2T 0 0 18.5T
1M: 0 0 14.0T 0 0 14.2T 0 0 18.5T
2M: 0 0 14.0T 0 0 14.2T 0 0 18.5T
4M: 0 0 14.0T 0 0 14.2T 0 0 18.5T
8M: 0 0 14.0T 0 0 14.2T 0 0 18.5T
16M: 0 0 14.0T 0 0 14.2T 0 0 18.5T
I am trying to figure out if the special metadata vdevs would make much sense for my use case still… My biggest gripe with my setup is the slow write speeds, even over gigabit networking… I can’t seem to get more then ~350 mbps writes to my 10x4 TB Z2 array.
A couple of days ago I finally got the hardware to expand my pool with a mirrored special device. I went for low latency consumer grade SSDs. Now, a couple of days in, I’m starting to massively regret that decision, since my pool is quickly forming all these data errors.
$ sudo zpool status main -x
Shows
pool: main
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 85K in 00:08:48 with 13 errors on Sat Dec 31 15:30:47 2022
config:
NAME STATE READ WRITE CKSUM
main ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sdb ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sda ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
sdi ONLINE 0 0 2.23M
sdj ONLINE 0 0 2.23M
cache
nvme1n1 ONLINE 0 0 0
spares
sdh AVAIL
sde AVAIL
errors: 9 data errors, use '-v' for a list
I don’t know how to confirm whether the special device is the cause of these errors, but given that I had no errors before and they started not long after adding the special vdev, I’m betting its the cause.
Since my pool is not entirely backed up, please don’t judge, I’m not in a position to get there, I would rather not rebuild the whole pool, leave the special devices.
I know there’s no real undo button here. I’ve looked around seeing if there are ways to migrate the metadata back to the data vdevs, but I am either too unfamiliar with ZFS, or can’t find the right resources to guide me here.
I’m also unclear if and how this might be related, but since the errors started showing up, a zed
process has continuously started consuming huge amounts of resources. This persists even after a reboot of the system, which should remove suspicions of software updates mismatching versions with loaded software / drivers /modules / what have you.
I’m really hoping someone would be able to help me out and guide me on any way of removing the special vdev without loss of data.
Thanks for reading this and for your time and consideration.
Bummer
Probably won’t matter, but for reference what models of SSDs did you get and how are they connected? It’s known that sometimes a particular samsung evo drive will generate a few errors every so often, but nothing like this and certainly not on both drives at the same time.
What’s strange is it’s both drives, which can point more to something like a sata controller being bad/overheating. Bad cables (the most common error generating problem) will cause errors only for the drive it’s connected to.
However it’s also very possible you hit some kind of software bug and ZFS is getting confused.
I’d search the GitHub issues, as well as take this to the ZFS mailing list and the subreddit since you’re more likely to find someone who can help figure out what’s going on in regards to the software internals.
Beyond that, how I’d personally approach this is basically clear it all, completely make a fresh pool from scratch, and then transfer the data from the backup pool, and if that doesn’t work possibly even using rsync (with byte for byte verification) instead of sending/receiving as snapshot to make sure everything is completely rewritten.
I am looking at adding a metadata special device to my pool while I am rebuilding my NAS. Does zdb -Lbbbs *pool name*
only provide a block size histogram in the output on certain platforms? Because my NAS is currently on Ubuntu 20.04 and the output of that command does not contain a block size histogram.
EDIT: Nevermind, it looks like this feature was added in a later version of ZFS according to the merge request and was not backported to the 0.8 version that 20.04 is still on. I will need to wait until I have my new hardware installed with a newer OS to run these commands.
I found this which might explain why you strangely have no read errors: Topicbox
Christian Quest
Mar 25 (9 months ago)
I got something very similar with one of my pool a few months ago, millions of checksum errors, no read errors.It was caused by a cache SSD that was silently dying. ZFS was not reporting errors from the cache SSD, but from the vdev HDDs. I removed the cache SSD from the pool, did a scrub and all checksum problems disappeared.
Your pool doesn’t use cache, but has a special vdev. I don’t know if we are is a more or less similar situation, and I have no clue how to get away from your problem because data on the special vdev are not duplicated anywhere else if I’m not wrong.
It’s possible your nvme1n1
is dying.
Additionally, check to make sure that cat /sys/module/zfs/parameters/zfs_no_scrub_io returns 0
, as this is also something I found.
The SSDs used for the special vdev are 2x a model WDC WDS120G2G0A-.
They are both connected to the last two SATA ports on the motherboard. X570 Phantom Gaming 4. The only “off” thing I went about connecting the drives is their power connectors. They are fed through a 4-pin Molex connector through an adapter that splits in 2 SATA power connectors.
I see you added an additional reply. Thanks a lot for helping out so much!
I can look into detaching the cache device from the pool, to see if that helps at all. I’t not my expectation the drive is dying, because it has barely any write cycles on it, but who knows.
The result of catting the zfs_no_scrub_io is indeed 0
, as expected.
Before detaching the cache drive to see if that solves the problem, I’ll first have to try to better reproduce the error rates, as they seem to have stabilized after writing my initial post. I’m not sure if this is something that is due to the low activity on the pool, or not having run the proper scrubs since, but I am currently looking for a way to properly test that detaching the cache device solves my problem.
Given the state of things, I will give it some time for now. Thanks a lot for all the help, you really gave me the courage and motivation to continue. I’ll post back with any updates, good or bad.
The errors seem to increase still. For a bit more insight, I’m sharing the files that are marked as an error:
errors: Permanent errors have been detected in the following files:
/media/anime/Pokémon/Season 3/Pokémon - S03E06 - Flower Power SDTV.mkv
/var/lib/mysql/nextcloud/oc_activity_mq.ibd
/var/lib/mysql/nextcloud/oc_twofactor_providers.ibd
/var/lib/mysql/undo_001
main/applications/docker-compose:<0x0>
/srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui31.log
/srv/docker-compose/bazarr/config/log/bazarr.log.2022-12-29
/srv/docker-compose/plex/config/Library/Application Support/Plex Media Server/Logs/Plex Crash Uploader.log
/srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui02.log
/srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui30.log
/srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzlogs/bzbui/bzbui29.log
/srv/docker-compose/backblaze/wine/drive_c/ProgramData/Backblaze/bzdata/bzfilelists/bigfilelist.dat
These mostly seem to be highly active files.
- Log files
- Database files
- A recently transcoded file
Some file has remained in the list of errors, without a proper name, after having manually deleted it when it showed up in the list before and I deemed it unimportant.
I will try removing the cache drive for a couple of days and see if that changes anything.
$ sudo zpool status main
pool: main
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 45K in 00:08:33 with 19 errors on Mon Jan 2 23:10:42 2023
config:
NAME STATE READ WRITE CKSUM
main ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sdb ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sda ONLINE 0 0 0
special
mirror-1 ONLINE 0 0 0
sdi ONLINE 0 0 355
sdj ONLINE 0 0 355
spares
sdh AVAIL
sde AVAIL
errors: 12 data errors, use '-v' for a list
Do the drives have the latest firmware? A 4 pin molex cable can handle 2 SDD’s just fine, the only potential issue being the molded ones are known to sometimes short out/catch on fire.
I’m still leaning toward a software bug being the problem here, given that both drives have identical numbers of checksum errors. If they had differing amounts of errors, then that’d point more to a pure hardware issue.
These mostly seem to be highly active files.
That’s good to know.
I basically can’t realistically give any further actual help, but I’m definitely interested in what you find. I should strongly suggest making a github issue and/or a mailing list post, as 1# you’re more likely to talk to someone with more in depth knowledge, and 2# it’ll put your issue on the radar if it is a software bug, so it can get fixed.