My Optane System Drive Finally Failed and I Need to Assess Recoverability

My Intel Optane P4800X has finally failed to the point where certain operations cause the controller to reset. I need to remove it from the system completely before booting—even off a recovery USB thumb drive.

I’ve found that some sectors can be read. The system can read enough to enumerate the partition table, show their file system types, and show their labels. I have a 512 MiB EFI partition, 1 GiB boot partition, and then a LUKS partition occupying the remainder (sectors 393222–366284640). dd if=… conv=sync,noerror bs=4096 skip=393222 count=4096 successfully reads the first 16 MiB of the LUKS partition, which means the entire LUKS header is intact, and I could confirm with cryptsetup luksDump ….

It would be easy if I could simply clone the LUKS partition in whole or in part, but I’m using OPAL, which means I cannot even read the bytes until the region has been unlocked by the hardware. What region might that be? Everything following the LUKS header to the end of the drive (excluding the last 5 sectors). While cryptsetup open … system-crypt doesn’t fail, the drive controller resets immediately. dmesg has some clues:

  • AER: Multiple Uncorrectable (Fatal) error message received from …
    • AER: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
  • frozen state error detected, reset controller
  • ctrl state 6 is not RESETTING
  • Disabling device after reset failure: -19
  • AER: device recovery successful
  • I/O Cmd(0x2) 0 LBA 366284624, 1 blocks, I/O Error (sct 0x2 / sc 0x86) DNR
    • critical medium error, dev …, sector 2930276992 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0

Whatever cryptsetup or Gnome’s Disks utility is doing seems to reset the drive, leaving it in a locked state. Mounting any partition with Gnome Disks causes an I/O error, while mount -o ro … works just fine. I have several ideas in mind:

  1. Use the SED utility with the admin key to unlock the region. (This should be sneaky enough to avoid drawing the attention of udev.)
  2. Assuming the low-level unlock doesn’t also cause a controller reset, attempt to copy the the bytes as an image file to a working storage device. I wish I could simply mount a file system, but within the encrypted region is an LVM PV with several LVs and I don’t know the “geometry.”
  3. Recover using the image and a spare drive.

My last backup was just over six months ago. Let that be a lesson learned… :sweat_smile:

Ah weird. It works now. (I’m on my seventh reboot.) Successfully unlocked, just imaged the home partition, and rebooted back into the system as if the drive never had a problem.

That gave me a good scare.

1 Like

I hope you already pulled everything into some form of backup so should the drive fail again/altogether you end up with a slightly more up to date backup :slight_smile:

2 Likes

Compressing and encrypting the image in the background right now. :slightly_smiling_face:

zstd --compress --ultra --long -22 --single-thread --zstd=strategy=9,hashLog=30,chainLog=30 --stdout --verbose 'Home (2025-06-29).img' | gpg --symmetric --cipher-algo AES256 --digest-algo SHA512 --compress-algo None --output 'Home (2025-06-29).img.zst.gpg'

1 Like

Perfect! Just making sure because I have been side tracked from the joy of a drive showing up again before and it was a mistake :laughing:

1 Like

This whole ordeal has me rethinking my reliance on enterprise hardware being less likely to fail… :disappointed:

It’s incredibly ironic because I had just posted this last week:

I want bcachefs to succeed where btrfs could not. This has been what I’ve been waiting for for a very long time:

  • Hot reads and writes to a striped (RAID-0) Optane array
  • Cold data to a Z-standard-compressed, RAID-6 NAND array
  • Automatic snapshots sent off-site on a schedule

It’s a clumsy setup to approximate with mdadm + bcache + ext4, and closing the write hole with mdadm journalling exacts a performance and amplification penalty for writes―though not as much as integrity metadata emulation with dm-integrity.

2 Likes

I tested a bcache-based installation on my parents’ system for about 6 months, which failed yesterday (July 4th, 2025). The Intel Optane Memory Series SSD (16 GB) failed writes. To have two SSDs choke within two weeks, looks like these Optane drives aren’t much more invincible than regular NAND SSDs after all.

It took about 6 hours to fix due to the fact that some USB-C-to-USB-A adapters I left at their house (from Aliexpress) were shoddy and the USB flash drive containing the disk image backups would drop in the middle of a restore. In the meantime, the OS lives on the Samsung SSD without an Intel Optane cache layer.

These failures have me slightly worried as I’m about to include some optanes in a “20-year server” deployment.

Could you tell what was wrong with the M10s? had they just gone beyond their write endurance?

I tried looking into the drive’s error log, but nothing turned up. dmesg | grep -i error turned up I/O errors writing to nvme0n1 (which was the device it was exposed as). Oddly, Gnome Disks’ benchmark tool was able to read and write to the entire drive without choking after I nuked the bcache superblock from the device.

I intend to finish diagnostics later; I don’t have physical or network access to the system right now to check.

I doubt they went past their endurance as they were new when I installed them six months ago, and the system had only been used for web browsing and some occasional gaming.

1 Like

When you say they were “new” - where were they sourced from?

eBay (model number MEMPEI1J016GAL)

I checked the usage on them before deploying. I bought them in bulk so I have another to check right now:

nvme smart-log /dev/nvme…n1 output on first insertion:

Smart Log for NVME device:nvme…n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 30 °C (303 K)
available_spare				: 100%
available_spare_threshold		: 0%
percentage_used				: 0%
endurance group critical warning summary: 0
Data Units Read				: 24 (12.29 MB)
Data Units Written			: 0 (0.00 B)
host_read_commands			: 2705
host_write_commands			: 0
controller_busy_time			: 0
power_cycles				: 5
power_on_hours				: 0
unsafe_shutdowns			: 0
media_errors				: 0
num_err_log_entries			: 0
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0
1 Like

Thanks for checking.

I also recently bought a handful (p5800x, p4800x, p5801x) from eBay and haven’t had any issues. But seeing so many, all from the same place (literally all from China, most days) with a pretty wide spread on prices for the same model - despite having the same packaging, “new” condition… it makes me wonder.

So I’m always asking others their experiences.

Hopefully just my paranoid imagination :slight_smile:

One should always be paranoid. :slightly_smiling_face:


Out of curiosity, I decided to double-check the Intel Optane P4800X again.

nvme smart-log /dev/nvme0n1:

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning			: 0
temperature				: 40 °C (313 K)
available_spare				: 100%
available_spare_threshold		: 0%
percentage_used				: 0%
endurance group critical warning summary: 0
Data Units Read				: 60128316 (30.79 TB)
Data Units Written			: 36409553 (18.64 TB)
host_read_commands			: 1945525416
host_write_commands			: 450523694
controller_busy_time			: 429
power_cycles				: 403
power_on_hours				: 12806
unsafe_shutdowns			: 77
media_errors				: 0
num_err_log_entries			: 1738
Warning Temperature Time		: 0
Critical Composite Temperature Time	: 0
Thermal Management T1 Trans Count	: 0
Thermal Management T2 Trans Count	: 0
Thermal Management T1 Total Time	: 0
Thermal Management T2 Total Time	: 0

Zero media errors? :face_with_raised_eyebrow:

I can only guess… it’s a failing controller or some bad shutdown caused corrupt T-10 integrity metadata. I’m consistently getting read errors from very specific sectors on boot now. I’ll probably get to the bottom of it this weekend.

For troubleshooting the drive, you could use software that checks the read speed of each sector. I’ve noticed sector’s read speeds tend to slow down prior to them actually going bad on all the NAND SSDs.