WD20EFRX (WD Red 2TB CMR NASware 3.0) power off under write load consistently creates dead blocks?

cwh · December 29, 2021, 1:25am

I have 2 almost new WD20EFRX (WD Red NASware 3.0 – CMR) I was playing around with what various drives do when you pull the power in the middle of an IOP. Do you get old data? New data? a mix? If it’s a mix, where is the boundary? What’s really strange is that after plugging the drive back in after the shutdown I see the following problems:

[12620.956669] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[12621.090251] ata1.00: ATA-9: WDC WD20EFRX-68EUZN0, 82.00A82, max UDMA/133
[12621.090565] ata1.00: ATA Identify Device Log not supported
[12621.090569] ata1.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 32), AA
[12621.091492] ata1.00: ATA Identify Device Log not supported
[12621.091508] ata1.00: configured for UDMA/133
[12621.091634] scsi 0:0:0:0: Direct-Access     ATA      WDC WD20EFRX-68E 0A82 PQ: 0 ANSI: 5
[12621.091902] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[12621.091907] sd 0:0:0:0: [sda] 4096-byte physical blocks
[12621.091915] sd 0:0:0:0: Attached scsi generic sg0 type 0
[12621.091919] sd 0:0:0:0: [sda] Write Protect is off
[12621.091923] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[12621.091941] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[12621.100913]  sda: sda1
[12621.101190] sd 0:0:0:0: [sda] Attached SCSI disk
[12639.232247] EXT4-fs (sda1): recovery complete
[12639.259366] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: disabled.
[12722.363242] ata1.00: exception Emask 0x0 SAct 0x80e08001 SErr 0x0 action 0x0
[12722.363260] ata1.00: irq_stat 0x40000008
[12722.363267] ata1.00: cmd 60/00:a8:00:70:09/02:00:00:00:00/40 tag 21 ncq dma 262144 in
                        res 41/40:00:60:71:09/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12722.364041] ata1.00: ATA Identify Device Log not supported
[12722.364832] ata1.00: ATA Identify Device Log not supported
[12722.364843] ata1.00: configured for UDMA/133
[12722.364892] sd 0:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
[12722.364902] sd 0:0:0:0: [sda] tag#21 Sense Key : Medium Error [current] 
[12722.364908] sd 0:0:0:0: [sda] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
[12722.364915] sd 0:0:0:0: [sda] tag#21 CDB: Read(10) 28 00 00 09 70 00 00 02 00 00
[12722.364919] blk_update_request: I/O error, dev sda, sector 618848 op 0x0:(READ) flags 0x80700 phys_seg 10 prio class 0
[12722.364957] ata1: EH complete
[12725.963238] ata1.00: exception Emask 0x0 SAct 0xf02000 SErr 0x0 action 0x0
[12725.963244] ata1.00: irq_stat 0x40000008
[12725.963245] ata1.00: cmd 60/08:68:60:71:09/00:00:00:00:00/40 tag 13 ncq dma 4096 in
                        res 41/40:00:60:71:09/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12725.963976] ata1.00: ATA Identify Device Log not supported
[12725.964689] ata1.00: ATA Identify Device Log not supported
[12725.964691] ata1.00: configured for UDMA/133
[12725.964703] sd 0:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
[12725.964705] sd 0:0:0:0: [sda] tag#13 Sense Key : Medium Error [current] 
[12725.964707] sd 0:0:0:0: [sda] tag#13 Add. Sense: Unrecovered read error - auto reallocate failed
[12725.964708] sd 0:0:0:0: [sda] tag#13 CDB: Read(10) 28 00 00 09 71 60 00 00 08 00
[12725.964709] blk_update_request: I/O error, dev sda, sector 618848 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[12725.964736] ata1: EH complete
[12729.463163] ata1.00: exception Emask 0x0 SAct 0x3c5000 SErr 0x0 action 0x0
[12729.463182] ata1.00: irq_stat 0x40000008
[12729.463189] ata1.00: cmd 60/08:70:60:71:09/00:00:00:00:00/40 tag 14 ncq dma 4096 in
                        res 41/40:00:60:71:09/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[12729.464412] ata1.00: ATA Identify Device Log not supported
[12729.465405] ata1.00: ATA Identify Device Log not supported
[12729.465416] ata1.00: configured for UDMA/133
[12729.465462] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
[12729.465471] sd 0:0:0:0: [sda] tag#14 Sense Key : Medium Error [current] 
[12729.465478] sd 0:0:0:0: [sda] tag#14 Add. Sense: Unrecovered read error - auto reallocate failed
[12729.465484] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 00 09 71 60 00 00 08 00
[12729.465488] blk_update_request: I/O error, dev sda, sector 618848 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[12729.465520] ata1: EH complete

Those are physical read failures. The data is GONE. It’s not like it’s giving me old or corrupted data, it’s refusing the reads. Consumer SSDs (samsung evo for example) return a mix of old an new data in 4k-aligned 4k-blocks, but never a read failure or smartctl error. I assumed it was a failing drive and tried the exact same experiment on an identical drive. Now both of the WD drives have Current_Pending_Sector counts potentially reducing the lifetime of the drives. This seems so bizarre to me. I mean, all I did was unplug it. With all the marketing around how these drives are very resilient physically I wouldn’t expect permanent drive degradation from a power outage.

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       5
  3 Spin_Up_Time            0x0027   172   170   021    Pre-fail  Always       -       4400
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       70
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   093   000    Old_age   Always       -       4830
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       12
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       637
194 Temperature_Celsius     0x0022   116   113   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

Do WD drives have a “feature” where they refuse to return data from blocks they have detected to be potentially half-written in fear of sending “corrupt” data to the OS like the consumer drive does? If so, how can I get my block back? You can’t have it both ways and say the drives are resilient to power failures yet permanently degrade every time there is a power outage under load. I can think of a few ways I’m wrong about what’s going on here:

The drive isn’t actually permanently degraded and there is some kind of reset I don’t know about
This is actually a really nice feature and large companies don’t care about power outages because they almost never happen and drives are cheap to them. The data is more important.
The filesystem somehow plays a roll in all this I’m not seeing. As far as I can tell, that dmesg output is a physical read failure, not a filesystem error. The only IO on the disk was overwriting a pre-allocated file, so it’s not like the filesystem could be getting confused. Only user data could be potentially corrupted. dentry info was always correct after the outage.
Pulling the drive by hand causes enough vibration to physically crash the head into the platter. I’d be surprised given the supposed quality of these drives.

I could be convinced to repeat the experiment with block-level IO to rule out the filesystem completely, but I would need a pretty convincing argument. It would take me a couple of days to write the software to do that correctly. I need tools to track expected state and more tools to verify data after power losses.

It probably doesn’t matter, but the IO load in question was random 1M writes in a 1G O_DIRECT | O_SYNC fallocate()'d file over ext4 with default journaling. that is, no fast_commit, metadata is journaled, but user payload isn’t.

I hate to sound arrogant, but in the interest of keeping any discussion on track if the previous paragraph didn’t make sense to you please just DM me your comments or questions first. I’m be happy to discuss, educate or be educated offline. I tried discussing elsewhere first but kept getting derailed with “pick a better filesystem” or “caching is bad” comments.

I’m hoping to find some documentation on exactly what it is about a power failure that can cause block read failures.

I’d be willing to share the code I used to run this test if you have a drive you’re willing to risk and want to compare results. It requires linux 5.10+ and c++17 or higher.

risk · December 29, 2021, 11:45am

Well that sucks, boo.

Do you remember or can you check whether this happens with “advanced format” 4k “sector size” ?

If yes, that would make LUKS people unhappy.

Is it refusing writes (to overwrite bad data) to those offsets as well, or just reads?.. Is that a forever permanent bad spot on the disk now?

My understanding is you send writes, disk says “ok” and that’s only an indication that the write requests were enqueued into a buffer to be written at some point. You need to “sync” and wait for “sync” to return to guarantee blocks are written onto disk.
O_DIRECT just means the kernel won’t cache reads and writes, doesn’t make claims about the drive caching itself.

ie. All writes are always async but ordered within their command queue, and the filesystem or whatever is using the block device needs to deal with read errors gracefully despite having poor insight into causes of errors.

Re enterprise use, there’s different size enterprises, hyperscalers get to both choose the behavior they want from the firmware.

I don’t know about smaller enterprises.

cwh · December 30, 2021, 12:05am

Do you remember or can you check whether this happens with “advanced format” 4k “sector size” ?

Can’t check. The Samsung 840 EVO only supports 512. The WD Red has 4k physical blocks, but was in 512 mode when I ran my tests. mkfs.ext4 creates a 4k filesystem by default anyway. I don’t know how it decides.

Is it refusing writes (to overwrite bad data) to those offsets as well, or just reads?.. Is that a forever permanent bad spot on the disk now?

I hadn’t thought of that! I repeated the test. After the power outage I tried the following:

$ cat /tmp/wdred/testing > /dev/null
I/O Error
$ dd if=/dev/urandom bs=512 count=204800 oflag=direct of=/tmp/wdred/testing
No error
$ cat /tmp/wdred/testing > /dev/null
No Error # <------ !!!!

So it looks like overwriting the blocks either resets it or the firmware remapped it and gave me a new physical one. smartctl seems to imply it reset the block:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       17
  3 Spin_Up_Time            0x0027   173   172   021    Pre-fail  Always       -       4316
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       76
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   093   000    Old_age   Always       -       4764
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       21
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       638
194 Temperature_Celsius     0x0022   122   111   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

Reallocated_Sector_Ct has always been zero. I’m not sure what to make of Current_Pending_Sector…

risk · December 30, 2021, 5:50am

Hah, so it’s a feature. If they were really bad, they’d turn into reallocated on a subsequent write.

With flash storage, the FTL in the controller will do a bunch of stuff, including read-modify-write of much larger pages when you have small writes.

So, if you end up with a partial write, at worst the FTL may be tracking some half written page somewhere, as a page that may need to be trimmed.

This was not the case with hard drives, historically. They come from an era of much weaker microcontrollers.

With HDDs there’s a checksum at the end of each sector that the drive firmware will compute. e.g. when you do a 512b write, magnetically you get about 528b of data recorded onto the platter + some preambles and trailers between sectors for alignment and safety.

With 4k drives it’s the same, you get more capacity on the account of not having as much space wasted on preambles and trailers between sectors. I don’t know OTOH whether they’re using the same checksummming or different.

Based on the behavior you’ve been seeing, it sounds like the drive is detecting this internal checksum mismatch with a partially written sector - because simply overwriting the sector makes it good.

IMHO that sounds like a useful feature overall.

From an ext4 perspective, e.g. if these blocks (and underlying sectors) are unallocated they’re unreachable and you’re good.

Anything that’s journaled, e.g. a database with a transaction log or something, is likely to replay committed transactions and rollback uncommitted ones and overwrite these bad blocks on next startup.

ext4 metadata itself is journaled, so I guess the only problem is if you run into a bad block trying to read the data e.g. while trying to create a backup of a whole system with rsync.

Reading the entire disk only to find them makes no sense.

I don’t know whether this is something that fsck can correct on next boot - how would it know what files were recently written to? maybe from metadata journals?

I don’t know whether SMART might keep LBA sector offsets that you can feed into dd safely with ext4 in read only mode, and just zap the whole blocks with zeros using dd.

I do know that zfs and btrfs shouldn’t run into this issue, albeit there’s a performance penalty to using them, because they will do this kind of accounting.

I’m thinking from the perspective of doing your own fsck kind of thing, is the write pattern predictable in any way?

How many of these 1G files are you writing to at any given time?

Is there anything within these 1M blocks that might help recovery? e.g. mysql innodb has 16KiB pages spanning four 4k filesystem pages on ext4, there’s a page id and a page checksum in first and last block of a 16k page which helps detect corruption.

cwh · December 30, 2021, 6:00am

Hah, so it’s a feature. If they were really bad, they’d turn into reallocated on a subsequent write.

yeah it’s looking like this is the case.

I’m thinking from the perspective of doing your own fsck kind of thing, is the write pattern predictable in any way?

This is from a journaled database perspective. I guess if I was reading a journal and got an IO error from this type of drive I could just stop replay there and assume the job is done.

Thanks for your insight. re-writing the block and trying the read again was key.

Is there anything within these 1M blocks that might help recovery?

This is purely test data for exploration purposes. There is nothing to recover.

risk · December 30, 2021, 6:18am

I’m willing to bet there’s no POSIX filesystem standard on this but…

The typical open(… write(… close(… workload, where you just append to a file shouldn’t normally cause this problem, … and yet, in my mind ext4 should prevent you from reading past end of file … ie. you shouldn’t be protected from being able to reach half written sectors just by virtue of filesize being journaled.

If you preallocate space, you’re waving that protection,

you could alternate between a pair of journal files, write to one, and when it reaches some size, as well on every startup, alternate to the other one, and delete the previous one.

cwh · January 4, 2022, 7:38am

One more interesting development. Both drives are now failing their Extended Offline tests in almost exactly the same way:

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      4818         2748248
# 2  Short offline       Completed without error       00%      5142         -

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      4911         2220184
# 2  Short offline       Completed without error       00%      5142         -```

risk · January 4, 2022, 3:38pm

Cool… I wonder if the block is in use: BadBlockHowto – smartmontools

system · October 5, 2022, 9:39am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.