Since recently each scrub on my RaidZ2 stalls a little bit below 85% completion.
The scan rate drops from about 150M/s to about 64K/s, the issuing rate goes to nil.
Dedup and compression are active, ashift is 12.
The RaidZ2 comprises 10 4TB WD RED (CMR) connected via a HPE 240 HBA and a 500GB WD Black SSD as L2ARC; the System has 64GB of RAM and runs Ubuntu 20.04.
I initialized the RaidZ2 some weeks ago and transfered a lot of very important research data for my PhD on it (this data does highly benefit from dedup). After this transfer a scrub did go swiftly and completed ok.
Next I transfered my personal photo library to the RaidZ2 - then the problems started - since then all scrubs stall at about 85%
I checked the SMART reports from all drives and found no issues. I even checked the HBA but found no problems either.
The storage pool is reported as online with no errors.
I did run the scrub several times and it always stalled at the same point.
The SMART tests did not reveal any errors - yet I recognized that there was a problem with the UID in ZFS – one of the drives hat an unusual one (differing from all the others in zpool status -v … only numbers not alphanumeric).
I did export the zpool and reimport it by /dev/by-uid – then all disks showed up with the same UID and zfs performance was trashed (though still online!)
I then imported it asl /dev/by-id and the whole situation resolved. The scrubs are successfull now and without any problems.
Well the good thing is: the raidz2 is ssafe - but can anyone explain what happened?
You’re not supposed to setup pools via drive letter as this is volitile. Like plugging in a flash drive could mess it all up if it gets picked up first drive letter or somewhere not expecting.
Well, I did set up the RaidZ2 initially via drive letters but than exportet it and reimported by UID, so I assumed to be on the safe side.
But curiously there seems to be an UID mess up with my HBA, the HPE 240 Itś in HBA mode, I’ve quadrouple checked )
The UID of my drives are NOT unique, but only the IDs are …it’s very strange
The pool might have the same UID for every member of each vdev, or the pool as a whole, depending on which tool you use.
/dev/disk/by-id/ should* be unique.
[Edit, meant blkid] should show all members in an array as having the same ref
the by-id combines the reported model and serial number. Some drives report funny numbers as a serial number, so some whole models appear identical. In that case, one can use the wwn- number from within the /by-id/ folder, but it is much less intuitive.
The last digit of the wwn can Sometimes show different to the drives label. The correct wwn can be viewed with other tools like sgparm, iirc, will check on that.
I still use the /by-id/ as the wwn does not always make sense / match the drive.
only use the wwn if the main serial number does not get listed.
But you do you boo, I don;t have to switch your drives out of the machine.
You can label your drives with gpart or gdisk or something, and then make the array from those, I did that recently with a batch which did not wanna play nice…
#use gdisk to label drives
sudo gdisk /dev/sdX #where X, use drive path
o #format to gpt, if not formatted / is MBR
d #clear any existing partitions if not in schemes
n #create partition sectors 3907012607 starting at 2048 for 2T
#or 7814017024 starting at 2048 for 4T drives
t #change partition type ["bf01" "Solaris /usr & Mac ZFS"]
c #label partition for use with /dev/disk/by-partlabel/
w #write changes to disk or q to exit
because teh ZFS mastery series, and Lucas’s storage books talk about using Geom to label disks.
And I cry in Linux…
but just the serial number from /by-id/ does me well enough, and Zfs auto formats drives well enough for me to not really worry about them.