NVMe + Linux stability issues

Hi,

I am looking for some help/advice to troubleshoot a stability issue I am having with a Samsung 960 Pro SSD on Ubuntu 18.04.

Machine Specs

  • i7 8700K
  • Gigabyte Z370m-DS3H
  • 16GB RAM
  • Samsung 960 Pro 512GB

Issue description

  • Ubuntu 18.04 64 bit server OS “hangs” with the following errors in dmesg
    INFO: task jbd2/nvmeon1p2-432 blocked for more than 120 seconds (see image attached)
  • I can consistently reproduce the issue by running the Unixbench tool
  • I tried tried disabling power saving for NVMe via the command below, without any success
    sudo nvme set-feature -f 0x0c -v=0 /dev/nvme0
  • Ubuntu 18.04 and 16.04 desktop versions hang during the installation process; only 18.04 server version installed without hanging
  • Ubuntu 18.04 server is stable when I switch out the 960 Pro with a SATA III M.2 drive (Transcend MTS400 256GB SSD)
  • This system (with the 970 Pro) was stable in Windows 10 under Aida64 stress test

Any advice would be much appreciated :slight_smile:

Check for firmware updates for the SSD. Some of the 800 series Samsung SSDs were known to not play nicely with ACPI and TRIM on GNU/Linux. That was mostly fixed with firmware updates by Samsung. I don’t know if the 900 series has/had the same issue.

1 Like

How would one check and update the firmware on a drive in Linux?

Edit: I was guessing you could view and update firmware revisions through hdparm, and found the following after a brief search. Does it look about right?
It was specifically for Seagate HDD’s, but I’m presuming it could be adapted?

might be a dumb question but why are you using these specs for ubuntu server? just wondering

I am setting it up as a build server for AOSP builds :slight_smile:

post your partition table

I am doing some stability tests at the moment with Ubuntu 18.04 server on a VM (Windows host). So, don’t have the problem environment to capture the partition table at the moment.
Await my findings from the VM experiment…

coming from the aur world, im not sold on using ubuntu server. I have been trying to use it for about a month. documentation is lacking. I would also check the #ubuntu irc to solve your answer.

You can try some of the suggestions from here. You most likely need to disable discards, and maybe play around with powersaving settings. https://wiki.archlinux.org/index.php/Solid_State_Drive/NVMe

Easiest is to use the manufacturer’s tools. Crucial provides a GNU/Linux boot image.

I’m running 960 pro on arch without any issues. I guess try installing Manjaro or Antergos, and see if the issue clears up.

I had this exact issue with a drive that had not been trimming correctly. Doing a secure erase (the drive got EXTREMELY hot) and letting it sit overnight, everything was just peachy.

I think the situation was that at some point I had used something on it that wasnt trimming properly. And now I was, but the current software assumes that the previous software already issued a trim command and that any trim now for something not explicitly written, then deleted, needs a trim.

In other words only the parts of the drive you have erased have been trimmed, not all the space from all of the past history of the device. So the OS may be assuming the “already empty” space when it started was in a proper ready-for-writing trimemd state.

Remember, samsungs are the firmwares that are ntfs/fat32 aware to try to mitigate some of this when the internal drive’s trim table gets out of sync with the os for whatever reason.

You can try issuing a command to trim ALL deleted space and see if that helps, otherwise use mfg utility to reformat and condition the drive.

I also experienced something similar, but not to this extent, on a surface pro device which was ultimately returned to MS due to defective ssd. Samsung nvme in that case.

4 Likes

I have never had issues with Crucial drives, but they don’t make an NVMe ssd yet.

I have had problems with the Samsung 800 series and returned everyone of them.

I am using GNU/Linux though.

Thank you all the advise.

Wendell’s answer made the most sense. It best matches my observations. But I haven’t solved my problem yet.
@wendell what tool did you use to secure erase the drive? and what is the reason to leave the device overnight?
I tried secure erasing with Samsung’s Magician software and using parted magic (and also Diskpark clean all). That made no difference to the stability. But I didn’t leave the machine idling overnight.

Few other updates/clues:

  • Windows continues to be stable; I can run Ubuntu inside a VM on Windows without any issues
  • I used my 960 Pro SSD on a friend’s Ryzen 1700/X370 system and Ubuntu was stable. So it has something to do with my MB/SSD combo.
  • I reached out to Gigabyte support and they advised to enable CSM and set the storage boot option control to Legacy. Gigabyte support sent screen shots of them installing Ubuntu on a samsung NVMe drive (960 EVO 1TB) without any issues. However it didn’t make any difference to me.
  • Antergos was unstable as ubuntu; didn’t try Vanila Arch yet

Did you take a look at the Arch Wiki link that Zerophase provided. If you have not then I would recommend it. Your installation may have tried to be smart, or conservative on some options that could be causing issues. Like the scheduler and what not. If you do not see anything in dmesg or jounalctl then you either have some defective hardware or you are missing some important configuration somewhere.

I guess broken pins could always be an issue.

To iron out issues with NVME on the RVE, with a SATAIII drive also installed, I had to enable “all drives” in the boot options. Previously, I just had “boot drives” only. But, that had only caused issues with machine check exceptions when booting into Windows. Might have caused problems with Linux too, but I didn’t check the logs. It’s worth a shot.

I would also go ahead and disable power saving features in the kernel, just to rule that out. And make sure discards aren’t being issued.

Which tests are you running in UnixBench? I could always check if I get errors too under the stock and ck kernel. Using “none” for the disk scheduler.

@Zerophase and @Mastic_Warrior: thanks for the advise.
it’s been a couple of frustrating days playing around with Arc. The problem still exists and here are a few updates:

  • I can get the Arch based system to either get stuck or get a kernel panic related to timeouts/watchdogs as soon as I do something that is disk I/O intensive. Like building a package or enable a swap file deamon
  • Disabling Autonomous Power State Transition didn’t make a difference

Just about ready to give up and go to Windows :cry:

Maybe you are having the below issue, event though your hardware is not old

Again, I have had nothing but bad luck with the Samsung SATA SSDs (I don’t have u.2 or m.2 ports).

So, secure erase and possible firmware update? I see the same (failed to unmount /old_root and hang).

There was a brickage with the late 2017 firmware, but I see whispers that this has been fixed as of Jan 2018. Anyone know otherwise or have reason for caution against firmware update? (caution given the very bad result of the Q4 2017 update).