NVMe + Linux stability issues

thoughtlessruvi · May 15, 2018, 4:32am

Hi,

I am looking for some help/advice to troubleshoot a stability issue I am having with a Samsung 960 Pro SSD on Ubuntu 18.04.

Machine Specs

i7 8700K
Gigabyte Z370m-DS3H
16GB RAM
Samsung 960 Pro 512GB

Issue description

Ubuntu 18.04 64 bit server OS “hangs” with the following errors in dmesg
INFO: task jbd2/nvmeon1p2-432 blocked for more than 120 seconds (see image attached)

20180514_083632.jpg2138x2851 2.46 MB
I can consistently reproduce the issue by running the Unixbench tool
I tried tried disabling power saving for NVMe via the command below, without any success
sudo nvme set-feature -f 0x0c -v=0 /dev/nvme0
Ubuntu 18.04 and 16.04 desktop versions hang during the installation process; only 18.04 server version installed without hanging
Ubuntu 18.04 server is stable when I switch out the 960 Pro with a SATA III M.2 drive (Transcend MTS400 256GB SSD)
This system (with the 970 Pro) was stable in Windows 10 under Aida64 stress test

Any advice would be much appreciated

Mastic_Warrior · May 16, 2018, 8:31pm

Check for firmware updates for the SSD. Some of the 800 series Samsung SSDs were known to not play nicely with ACPI and TRIM on GNU/Linux. That was mostly fixed with firmware updates by Samsung. I don’t know if the 900 series has/had the same issue.

Trooper_ish · May 17, 2018, 11:23am

How would one check and update the firmware on a drive in Linux?

Edit: I was guessing you could view and update firmware revisions through hdparm, and found the following after a brief search. Does it look about right?
It was specifically for Seagate HDD’s, but I’m presuming it could be adapted?

Ctrl_Null · May 17, 2018, 12:12pm

might be a dumb question but why are you using these specs for ubuntu server? just wondering

Ctrl_Null · May 17, 2018, 12:16pm

thoughtlessruvi · May 17, 2018, 12:17pm

I am setting it up as a build server for AOSP builds

Ctrl_Null · May 17, 2018, 12:18pm

post your partition table

thoughtlessruvi · May 17, 2018, 12:23pm

I am doing some stability tests at the moment with Ubuntu 18.04 server on a VM (Windows host). So, don’t have the problem environment to capture the partition table at the moment.
Await my findings from the VM experiment…

Ctrl_Null · May 17, 2018, 12:26pm

coming from the aur world, im not sold on using ubuntu server. I have been trying to use it for about a month. documentation is lacking. I would also check the #ubuntu irc to solve your answer.

Zerophase · May 18, 2018, 12:19am

You can try some of the suggestions from here. You most likely need to disable discards, and maybe play around with powersaving settings. https://wiki.archlinux.org/index.php/Solid_State_Drive/NVMe

Mastic_Warrior · May 18, 2018, 6:19pm

Easiest is to use the manufacturer’s tools. Crucial provides a GNU/Linux boot image.

Zerophase · May 18, 2018, 7:46pm

I’m running 960 pro on arch without any issues. I guess try installing Manjaro or Antergos, and see if the issue clears up.

wendell · May 18, 2018, 8:14pm

I had this exact issue with a drive that had not been trimming correctly. Doing a secure erase (the drive got EXTREMELY hot) and letting it sit overnight, everything was just peachy.

I think the situation was that at some point I had used something on it that wasnt trimming properly. And now I was, but the current software assumes that the previous software already issued a trim command and that any trim now for something not explicitly written, then deleted, needs a trim.

In other words only the parts of the drive you have erased have been trimmed, not all the space from all of the past history of the device. So the OS may be assuming the “already empty” space when it started was in a proper ready-for-writing trimemd state.

Remember, samsungs are the firmwares that are ntfs/fat32 aware to try to mitigate some of this when the internal drive’s trim table gets out of sync with the os for whatever reason.

You can try issuing a command to trim ALL deleted space and see if that helps, otherwise use mfg utility to reformat and condition the drive.

I also experienced something similar, but not to this extent, on a surface pro device which was ultimately returned to MS due to defective ssd. Samsung nvme in that case.

Mastic_Warrior · May 18, 2018, 9:53pm

I have never had issues with Crucial drives, but they don’t make an NVMe ssd yet.

I have had problems with the Samsung 800 series and returned everyone of them.

I am using GNU/Linux though.

thoughtlessruvi · May 22, 2018, 10:41am

Thank you all the advise.

Wendell’s answer made the most sense. It best matches my observations. But I haven’t solved my problem yet.
@wendell what tool did you use to secure erase the drive? and what is the reason to leave the device overnight?
I tried secure erasing with Samsung’s Magician software and using parted magic (and also Diskpark clean all). That made no difference to the stability. But I didn’t leave the machine idling overnight.

Few other updates/clues:

Windows continues to be stable; I can run Ubuntu inside a VM on Windows without any issues
I used my 960 Pro SSD on a friend’s Ryzen 1700/X370 system and Ubuntu was stable. So it has something to do with my MB/SSD combo.
I reached out to Gigabyte support and they advised to enable CSM and set the storage boot option control to Legacy. Gigabyte support sent screen shots of them installing Ubuntu on a samsung NVMe drive (960 EVO 1TB) without any issues. However it didn’t make any difference to me.
Antergos was unstable as ubuntu; didn’t try Vanila Arch yet

Mastic_Warrior · May 22, 2018, 4:05pm

Did you take a look at the Arch Wiki link that Zerophase provided. If you have not then I would recommend it. Your installation may have tried to be smart, or conservative on some options that could be causing issues. Like the scheduler and what not. If you do not see anything in dmesg or jounalctl then you either have some defective hardware or you are missing some important configuration somewhere.

Zerophase · May 22, 2018, 6:05pm

I guess broken pins could always be an issue.

To iron out issues with NVME on the RVE, with a SATAIII drive also installed, I had to enable “all drives” in the boot options. Previously, I just had “boot drives” only. But, that had only caused issues with machine check exceptions when booting into Windows. Might have caused problems with Linux too, but I didn’t check the logs. It’s worth a shot.

I would also go ahead and disable power saving features in the kernel, just to rule that out. And make sure discards aren’t being issued.

Which tests are you running in UnixBench? I could always check if I get errors too under the stock and ck kernel. Using “none” for the disk scheduler.

thoughtlessruvi · May 24, 2018, 6:37am

@Zerophase and @Mastic_Warrior: thanks for the advise.
it’s been a couple of frustrating days playing around with Arc. The problem still exists and here are a few updates:

I can get the Arch based system to either get stuck or get a kernel panic related to timeouts/watchdogs as soon as I do something that is disk I/O intensive. Like building a package or enable a swap file deamon
Disabling Autonomous Power State Transition didn’t make a difference

Just about ready to give up and go to Windows

Mastic_Warrior · May 24, 2018, 5:35pm

Maybe you are having the below issue, event though your hardware is not old

Again, I have had nothing but bad luck with the Samsung SATA SSDs (I don’t have u.2 or m.2 ports).

cekim · May 24, 2018, 6:33pm

So, secure erase and possible firmware update? I see the same (failed to unmount /old_root and hang).

There was a brickage with the late 2017 firmware, but I see whispers that this has been fixed as of Jan 2018. Anyone know otherwise or have reason for caution against firmware update? (caution given the very bad result of the Q4 2017 update).