NVMe + Linux stability issues

wendell · May 18, 2018, 8:14pm

I had this exact issue with a drive that had not been trimming correctly. Doing a secure erase (the drive got EXTREMELY hot) and letting it sit overnight, everything was just peachy.

I think the situation was that at some point I had used something on it that wasnt trimming properly. And now I was, but the current software assumes that the previous software already issued a trim command and that any trim now for something not explicitly written, then deleted, needs a trim.

In other words only the parts of the drive you have erased have been trimmed, not all the space from all of the past history of the device. So the OS may be assuming the “already empty” space when it started was in a proper ready-for-writing trimemd state.

Remember, samsungs are the firmwares that are ntfs/fat32 aware to try to mitigate some of this when the internal drive’s trim table gets out of sync with the os for whatever reason.

You can try issuing a command to trim ALL deleted space and see if that helps, otherwise use mfg utility to reformat and condition the drive.

I also experienced something similar, but not to this extent, on a surface pro device which was ultimately returned to MS due to defective ssd. Samsung nvme in that case.

Mastic_Warrior · May 18, 2018, 9:53pm

I have never had issues with Crucial drives, but they don’t make an NVMe ssd yet.

I have had problems with the Samsung 800 series and returned everyone of them.

I am using GNU/Linux though.

thoughtlessruvi · May 22, 2018, 10:41am

Thank you all the advise.

Wendell’s answer made the most sense. It best matches my observations. But I haven’t solved my problem yet.
@wendell what tool did you use to secure erase the drive? and what is the reason to leave the device overnight?
I tried secure erasing with Samsung’s Magician software and using parted magic (and also Diskpark clean all). That made no difference to the stability. But I didn’t leave the machine idling overnight.

Few other updates/clues:

Windows continues to be stable; I can run Ubuntu inside a VM on Windows without any issues
I used my 960 Pro SSD on a friend’s Ryzen 1700/X370 system and Ubuntu was stable. So it has something to do with my MB/SSD combo.
I reached out to Gigabyte support and they advised to enable CSM and set the storage boot option control to Legacy. Gigabyte support sent screen shots of them installing Ubuntu on a samsung NVMe drive (960 EVO 1TB) without any issues. However it didn’t make any difference to me.
Antergos was unstable as ubuntu; didn’t try Vanila Arch yet

Mastic_Warrior · May 22, 2018, 4:05pm

Did you take a look at the Arch Wiki link that Zerophase provided. If you have not then I would recommend it. Your installation may have tried to be smart, or conservative on some options that could be causing issues. Like the scheduler and what not. If you do not see anything in dmesg or jounalctl then you either have some defective hardware or you are missing some important configuration somewhere.

Zerophase · May 22, 2018, 6:05pm

I guess broken pins could always be an issue.

To iron out issues with NVME on the RVE, with a SATAIII drive also installed, I had to enable “all drives” in the boot options. Previously, I just had “boot drives” only. But, that had only caused issues with machine check exceptions when booting into Windows. Might have caused problems with Linux too, but I didn’t check the logs. It’s worth a shot.

I would also go ahead and disable power saving features in the kernel, just to rule that out. And make sure discards aren’t being issued.

Which tests are you running in UnixBench? I could always check if I get errors too under the stock and ck kernel. Using “none” for the disk scheduler.

thoughtlessruvi · May 24, 2018, 6:37am

@Zerophase and @Mastic_Warrior: thanks for the advise.
it’s been a couple of frustrating days playing around with Arc. The problem still exists and here are a few updates:

I can get the Arch based system to either get stuck or get a kernel panic related to timeouts/watchdogs as soon as I do something that is disk I/O intensive. Like building a package or enable a swap file deamon
Disabling Autonomous Power State Transition didn’t make a difference

Just about ready to give up and go to Windows

Mastic_Warrior · May 24, 2018, 5:35pm

Maybe you are having the below issue, event though your hardware is not old

Again, I have had nothing but bad luck with the Samsung SATA SSDs (I don’t have u.2 or m.2 ports).

cekim · May 24, 2018, 6:33pm

So, secure erase and possible firmware update? I see the same (failed to unmount /old_root and hang).

There was a brickage with the late 2017 firmware, but I see whispers that this has been fixed as of Jan 2018. Anyone know otherwise or have reason for caution against firmware update? (caution given the very bad result of the Q4 2017 update).

thoughtlessruvi · May 25, 2018, 4:07am

@Mastic_Warrior thanks, but it doesn’t appear to be relevant
@cekim the firmware is up to date and tried multiple secure erases

cekim · May 25, 2018, 4:42am

That was a question, not a suggestion… I’m in the same boot… can’t reboot my machine without hard reset.

Zerophase · May 25, 2018, 4:03pm

What are the temps of the drive? As a last ditch effort, keeping temps down might help. Something like the cryorig heatsink, or a Waterblock. Personally, I keep my 960 pro under water just so I don’t have to deal with some edge cases that might cause throttling.

thoughtlessruvi · June 1, 2018, 3:51am

@Zerophase I am going on a slight tangent here; i hope u r only cooling the controller? Because lower operating temperatures are bad for NAND endurance. ref: https://www.anandtech.com/show/9248/the-truth-about-ssd-data-retention

thoughtlessruvi · June 1, 2018, 3:56am

So, I’ve had a break through with may stability issues.
I took my computer to the shop I bought it from and we tried swapping out parts.
Changing the motherboard solved the problem

Unfortunately we didn’t have another motherboard of the same model around but we tested it with a Asus Z370-P and it was all good.
So, I returned my Gigabyte board and purchased the Asus board.

SudoSaibot · June 1, 2018, 6:06am

I have a 960 pro 512gb m.2 with Ubuntu 18.04 installed. I previously had a 850 pro 256gb sata ssd with 16.04 installed.
Both had no issues. If you want me to verify any software, firmware or configs let me know.

Zerophase · June 1, 2018, 6:11am

It’s like 50c under load and I’m sure I’ll be replacing this drive within 5 years.

Zerophase · June 1, 2018, 6:11am

Was it a used motherboard? Sounds like a broken pin.

thoughtlessruvi · June 1, 2018, 6:35am

@SudoSaibot: thanks but this is clearly motherboard related.

@Zerophase: it was a brand new motherboard and a broken pin doesn’t make sense because it was stable on Windows.

Marten · June 1, 2018, 7:18am

Glad you got it fixed
We tend to trust hardware is good and often dont have drop in replacements to test. I have only had it happen to me once when my MB in a Q6600 system started corrupting the BOOT SSD and I needed a platform update because it was all EOL, well overdue to update at least.

Zerophase · June 2, 2018, 9:50pm

@thoughtlessruvi I usually avoid Gigabyte they’ve always seemed like a reliable bargain brand to me. I usually just stick to MSI, EVGA, and ASUS. I’ve had a great experience with ASUS boards and cards so far, but I hear their customer service is a bit lacking. Might switch to EVGA in the future, I hear they treat you very well.

cekim · July 8, 2018, 6:57am

Update: updating firmware alone was not enough to address the “failed to unmount” issue. Had to also run secure erase. So far so good after that.

2x960pro in raid0 on Asus hyper x16 in Centos 7.4.