Dude, I conceded that one failed VDEV will destroy the pool.
However, this deviated very much from your original argument that a single pool of multiple single-drive VDEV is not possible. You have yet to acknowledge you were wrong.
Now whether to proceed with ZFS without parity disks is up for debate which I was and am not interested in at all.
The “critical target error” seems to be btrfs-specific? I only get a single hit doing a web search. Not much to go on, unfortunately.
Could it be the checksum option blake that I selected that could be the cause? For the time being I’m going to attach both the drives to my more powerful local machine and directly to sata ports. Will test out the same rsync command and see if I’m still facing the same issue. The good thing is at least for the time being the older drive still seems to be able to read and write. I have a feeling it is mostly the checksum issue but I could be wrong.
I tried a new usb type c cable for the mediasonic enclosure but faced the same issue again. I just hope both drives are healthy. I’m liking btrfs for the time being… Being able to check the condition of the drive and it’s data is quite handy
The old drive on fedora is showing an error like the earlier 14TB “DISK IS LIKELY TO FAIL SOON”
I ran the smartctl command and I’m guessing it’s because of these two properties having a number 0. Do I need to start the return process for this drive too? Is there a way for me to check for bad sectors in linux? (ignore… found a command called badblocks which I think will do it for me
I know almost nothing about btrfs so I can’t comment on that, really. I don’t see why a checksum option would cause that error though. I would certainly suspect any USB controllers first, that’s for sure. So it’s great that you’re trying with a direct SATA connection!
When it comes to ZFS, I can mostly agree with the points in the post you linked to. ZFS was made for the data center, and it shows. Many points are not very important to me though (I don’t need the absolute latest kernel for instance; in fact I like to use a “longterm” kernel on my server). Some points are “mostly” wrong (“Can’t recompress or uncompress”, “Can’t move data between datasets”), and at least one point has changed since that was written (“No reflink copies” - although there were some nasty bugs related to this one when it was first released. I have no real use for it so haven’t tried reflinks yet.)
But yeah, ZFS is kinda inflexible and I’m not impressed with the code quality: I’ve ran into bugs relating to parsing of command line options (including the --dry-run option doing nothing, not even parsing the options! ), and I don’t like ZFS’ habit of panicking the kernel if something’s wrong rather than aborting the mount/whatever and emitting a kernel Oops.
ZFS’s features are amazing though and worth the tradeoff to me.
Just remember that the typical advice given if something is wrong with your pool is to recreate it and restore from backups. Think about that beforehand and keep it in mind while planning your pool(s) and backup strategy! (You do need backups anyway and ZFS makes backups relatively easy!)
I personally like multiple smaller pools rather than to have all drives in a single pool partly because I don’t fully trust ZFS and partly because of its inflexibility: if something goes wrong it’s quicker/easier to restore just part of the data, and if I want to redo a pool layout it’s very nice to have other pools that I can migrate the data to temporarily. So multiple pools, to me, is a way to relieve ZFS’s inflexibility.
So the format of the SMART data is a bit weird. VALUE starts off at 100 (most values) and ticks down when something’s not good. If it reaches the THRESH value it triggers a SMART fail.
I don’t see anything wrong with that drive, although I’d be a little bit vigilant of the Current_Pending_Sector and Offline_Uncorrectable RAW values. I’d like them to be 0 – but it’s possible it’s a remnant from the factory testing and that everything is perfectly fine. I don’t see a reason to return the drive.
Also, rather than badbocks I’d run an long offline smart test. (smartctl -t long <dev> + wait many hours and then check the “SMART Self-test log” with smartctl – where it shows a “Short offline” test in your output above.)
Edit: I have no idea why Fedora would think the drive is about to fail. There’s nothing in the smart data that would entail such a warning, unless I’m missing something. Maybe it’s reacting to controller issues?
One thing you might want to look at once everything else is working fine is the Load Cycle Count. Its VALUE is lower than the Power On Hours VALUE which means you risk wearing out the head ramps before the drive has reached its usable lifetime. Nothing urgent but worth looking at down the line, perhaps.
Perhaps the one pending sector (which AFIAK is a detected bad sector that is pending to be remapped). I had a drive too that had a single pending sector but it was obviously failing (getting IO error 5’s then getting the data off of it, etc.). I also suspect it’s not Fedora .
Is this the empty btrfs drive? If not, My first step would be to copy all data off of the drive if you can, perhaps using ddrescue. After that you can reevaluate if it’s still useable. I say this because running badblocks or a long test on a failing drive might just push it over the edge (especially a large one)
If it’s the empty drive, you could do the long smart test and hope the drive remaps the bad sector and is useable again. I would not be comfortable to use this drive without redundancy or backups again though.
BTW the IO error 5 is from the kernel indicating an error on the device level, it’s nothing to do with btrfs. The same would/could happen if you would dd from/to the device on the block level (I’ve seen that happen before).
@quilt: Any idea how Seagate weighs the Current_Pending_Sector VALUE? I find it weird that the value would be unaffected by a current pending sector. But I can’t say I know the details here. Or that I trust Seagate to do the right thing.
I belive many places in the kernel could issue an IO Error (-5). (I’ve never worked with filesystem code but I did work on other kernel code for several years so I say this out of that experience.) So this error might not mean the same as the one you got (unless it was in the exact same context). That said, I fully agree that the first step should be ensuring working backups of any data OP wants to keep.
This is the older 18TB x18 drive. That had ext4 on it.
The newer 18TB x20 drive is the one that has btrfs on it.
I’m trying to copy all the data from the ext4 onto btrfs and that’s why I’m running the rsync command. All these failures seem to be during the copy and most of them seem to be with write. I had run a full scan on the newer x20 drive and there were no errors on it before I attached it. Have a feeling it could be the mediasonic usb to sata enclosure. I’ve now connected it to my personal machine and trying out on fedora. First I’m running the btrfs scrub command again just to confirm all the data is intact. After that I will run the rsync command again
@homeserver78 That’s exactly what I’m doing first. Copying everything onto the btrfs x20 drive. After that I will run the scans etc on the older x18 drive
I’ll run the command you mentioned above once the copy is done smartctl -t long <dev>
So I did some more reading on this and apparently Seagate’s firmware keeps the VALUE at 100 even with large RAW counts on Current_Pending_Sector, Offline_Uncorrectable, and Reallocated_Sector_Ct. Which prevents the “SMART overall-health self-assessment” from failing as it should.
I also found someone with the same “1” raw count on Current_Pending_Sector and Offline_Uncorrectable, also on a Seagate drive. A very old thread, but still. An interesting read but no real conclusion there except on that drive the values does not behave as expected according to the spec.
I would keep a close eye on those values. As long as they don’t increase during use/scans I would probably keep using the drive, making sure to run periodic scrubs and to keep working, up-to-date backups (as should be done with any drive).
BTW, I try to always run smartctl -a <dev> > smartctl-a-<driveid>-<date>.txt on new (to me) drives so I have something to compare later checks to.
FWIW I had an 18 TB IronWolf Pro fail SMART a couple years ago with Offline_Uncorrectable = 1. Checked with Seagate and support said RMA the drive, which went without fuss.
FWIW I had an 18 TB IronWolf Pro fail SMART a couple years ago with Offline_Uncorrectable = 1. Checked with Seagate and support said RMA the drive, which went without fuss.
I would keep a close eye on those values. As long as they don’t increase during use/scans I would probably keep using the drive, making sure to run periodic scrubs and to keep working, up-to-date backups (as should be done with any drive).
I bought this drive from serverpartdeals. Will reach out to their customer support simultaneously and check if it’s worth sending it back
BTW, I try to always run smartctl -a <dev> > smartctl-a-<driveid>-<date>.txt on new (to me) drives so I have something to compare later checks to.
Since i’m a windows person I usually run the surface scan in MiniTools Partition wizard.
The good thing for now is… On putting the two drives in the local machine and running the rsync command I havne’t faced any issues till now. So it looks like the mediasonic enclosure could be the issue. Might have to build a separate low powered rig now.
Just wondering… If this x18 drive also ends up having bad sectors could it be due to high temperatures?
After running the rsync on the minipc when I went to take out the drives both of them were really hot to the touch