Ubuntu update bricked my SSD

Having quite a peculiar problem with my nvme SSD that I wasn’t able to solve on my own, perhaps someone here can help.

Here’s a TL;DR of what happened: I was updating my Linux mint install when my system suddenly went into read-only mode. I went into recovery mode and ran ‘dpkg – configure -a’, I booted up again and the OS goes immediately into read-only mode again.

This time I booted from a live usb and ran:

mint@mint:~$ sudo e2fsck -cfpv /dev/mapper/vgmint-root 
/dev/mapper/vgmint-root: Updating bad block inode.

      579180 inodes used (0.47%, out of 122011648)
        2162 non-contiguous files (0.4%)
        1290 non-contiguous directories (0.2%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 469942/772/2
   239252046 blocks used (49.02%, out of 488031232)
         125 bad blocks
          35 large files

      402062 regular files
       67201 directories
          55 character device files
          25 block device files
           0 fifos
       55787 links
      109812 symbolic links (108360 fast symbolic links)
          16 sockets
------------
      634958 files

Also:

mint@mint:~$ sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 38 C
available_spare                     : 100%
available_spare_threshold           : 5%
percentage_used                     : 2%
data_units_read                     : 88660714
data_units_written                  : 60548722
host_read_commands                  : 845681396
host_write_commands                 : 584031973
controller_busy_time                : 2137
power_cycles                        : 1705
power_on_hours                      : 16467
unsafe_shutdowns                    : 35
media_errors                        : 5261
num_err_log_entries                 : 8061
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 31
Thermal Management T2 Trans Count   : 24
Thermal Management T1 Total Time    : 14467
Thermal Management T2 Total Time    : 672

One of the packages I’ve updated was ‘linux-firmware’, which I suspect what caused all this.

I’ve also tried running ‘sudo nvme error-log /dev/nvme0’, but it only gives me 63 entries, all of which are basically empty.

1 Like

What errors do you get when you read-write mount it? Something in messages or dmesg?

How about a block-by-block test of the partition, like:
pv /dev/mapper/vgmint-root > /dev/null

I can mount it in read/write, but soon after something triggers it to go into read-only mode.

ran the test, halfway it gives me “pv: /dev/mapper/vgmint-root: read failed: Input/output error”

Well that’s something to work with. Is it identifying your device as the correct size? Can you read ANYTHING past that half-way mark (dd with a skip= parameter or ddrescue should help you find out). If it’s just a few bad-blocks that can’t be read, you can overwrite them and your drive will be good to go.

The easy answer is to backup all your data from the drive elsewhere, then wipe the drive and restore everything.

smartctl -x /dev/nvme0 may give some insight into whether the drive hardware is failing.

smartctl output:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    2%
Data Units Read:                    90,621,754 [46.3 TB]
Data Units Written:                 60,548,869 [31.0 TB]
Host Read Commands:                 853,407,535
Host Write Commands:                584,035,810
Controller Busy Time:               2,142
Power Cycles:                       1,723
Power On Hours:                     16,530
Unsafe Shutdowns:                   35
Media and Data Integrity Errors:    5,275
Error Information Log Entries:      8,116
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Thermal Temp. 1 Transition Count:   31
Thermal Temp. 2 Transition Count:   24
Thermal Temp. 1 Total Time:         14467
Thermal Temp. 2 Total Time:         672

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged

I’ve also ran badblocks and there’s 610 bad blocks reported.

How can I overwrite specific bad blocks?

Found this after bit of searching:

hdparm --write-sector <sector> --yes-i-know-what-i-am-doing /dev/nvme0

would this do it?

You have no idea what you are doing. This is asking for trouble. If there are any valuable files on disk make a backup now!

You can’t, without severe risk of corrupting your file system. A backup, full wipe, and restore is the best option. It’s good to know it isn’t the whole second half of your drive being inaccessible or something like that, and rules out any way I can think of that your system update could have caused the drive to malfunction.

If you really want to try something, you can mount the partition rw, create a huge file that fills-up all unused space, then erase it, sync, and fstrim, and see how your drive fares in subsequent tests. Might fix several issues, but it might not.

Seems like quite a few Media and Data Integrity Errors there. Your drive should work again after a wipe, but it is now suspect, and I would recommend frequent backups, and weekly Tripwire style integrity checks of all files on it to be sure data isn’t being silently corrupted going forward.