[Solved] Linux is unstable ever since I upgraded to Ryzen

Try manjaro if nothing else.

Usually, stuff gets put there. But every distro likes to do something different. /var/log is the directory where they go.

Doesn't seem to be much in /var/log...
.
β”œβ”€β”€ btmp
β”œβ”€β”€ Calamares.log
β”œβ”€β”€ faillog
β”œβ”€β”€ gdm [error opening dir]
β”œβ”€β”€ gssproxy
β”œβ”€β”€ httpd
β”œβ”€β”€ journal
β”‚ β”œβ”€β”€ 7db0dc23f4004d1a8849a344f33d9c56
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ system.journal
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ β”œβ”€β”€ [email protected]~
β”‚ β”‚ └── user-1000.journal
β”‚ └── remote
β”œβ”€β”€ lastlog
β”œβ”€β”€ old
β”œβ”€β”€ pacman.log
β”œβ”€β”€ wtmp
β”œβ”€β”€ Xorg.0.log
β”œβ”€β”€ Xorg.0.log.old
β”œβ”€β”€ Xorg.1.log
└── Xorg.1.log.old

system.journal is just filled with random jibberish.
faillog just prints /dev/tty1

You said you changed to manjaro right?

Correct.

alright, I'm reading their documentation for where specific logs are kept.

@tenten8401 https://forum.manjaro.org/t/error-hunting-in-logs/16401

looks like you need to search your pacman log.

cat /var/log/pacman.log | grep error

There might be useful stuff in there.

I piped journalctl -b2 into a text file, and at the end I found this... Could it be what I'm looking for?
https://hastebin.com/mohuvesoro.go

Searching for the line:

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u32:1:3773]

Reveals serveral results. So far, people with that issue have either solved it by getting a new power supply, or doing the nouveau.modeset=0 to grub, but you're using the propriatary drivers.

A 'soft lockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run.
The watchdog daemon will send an non maskable interrupt (NMI) to all CPUs in the system who in turn print the stack traces of their currently running tasks.

Under normal circumstances those messages may go away if the load decreased.
This 'soft lockup' can happen if the kernel is busy, working on a huge amount of objects which need to be scanned, freed or allocated respectively.
The stack traces of those tasks can give a first idea what the tasks were doing. However, to be able to examine the cause behind the messages, a kernel dump would be needed.

Would a 650W not be enough for this build? https://pcpartpicker.com/user/tenten8401/saved/TqZRBm
Also, nouveau.modeset=0 solves an entirely different problem from what I can tell.

could you give a full detailed description of everything in your rig; PSU ratings et al?

PCPartPicker part list / Price breakdown by merchant

CPU: AMD - Ryzen 5 1600 3.2GHz 6-Core Processor ($199.99 @ SuperBiiz)
Motherboard: Gigabyte - GA-AX370-Gaming K5 ATX AM4 Motherboard ($151.98 @ Newegg)
Memory: Team - Dark 16GB (2 x 8GB) DDR4-3000 Memory ($114.99 @ Newegg)
Storage: Samsung - 850 EVO-Series 250GB 2.5" Solid State Drive ($104.88 @ OutletPC)
Storage: Seagate - Desktop HDD 4TB 3.5" 5900RPM Internal Hard Drive ($108.70 @ Amazon)
Video Card: Asus - GeForce GTX 1070 8GB Video Card
Case: Phanteks - Enthoo Evolv ATX Glass ATX Mid Tower Case ($189.99 @ Amazon)
Power Supply: Cooler Master - 650W 80+ Bronze Certified Semi-Modular ATX Power Supply ($83.98 @ PCM)
Monitor: AOC - G2260VWQ6 21.5" 1920x1080 75Hz Monitor ($124.88 @ OutletPC)
Monitor: AOC - G2260VWQ6 21.5" 1920x1080 75Hz Monitor ($124.88 @ OutletPC)
Headphones: Kingston - HyperX Cloud Revolver Headset ($119.99 @ Amazon)
Total: $1324.26
Prices include shipping, taxes, and discounts when available
Generated by PCPartPicker 2017-07-18 11:58 EDT-0400

EDIT: Realized the PSU was being pulled from a parametric selection, updated it to be the one I have.

Hmmm, that GPU averages at about 150-170 W, but it may have spikes that go above that. However, 650 W, should be just the right amount. Unless your PSU is faulty, I can't think that that's the problem here also. Do you have a load tester? Or does the unit have a self-test button?

PCPartPicker averages the TDP of everything and it is about 350W total being used, but I don't think the CPU could be using that much more power at 3.6 GHz

No, it was a good average. What ever the estimated power draw is, you then want to double that and get the nearest power supply to that number; that would be 700W, but you got 650W but that should be fine.

1 Like

If you want to test the PSU by itself, disconnect everything connected to the PSU, and then jump the PSU with a paperclip or PSU jumper. Then watch it for a little while and if it just shuts off by itself then its whats at fault.

Anecdote: I had to do this recently with my GF's computer. Her computer would POST, but almost immediately after doing so after getting into the OS it would shut off. We determined the motherboard to be at fault.

@tenten8401 just thought of this, if you can boot into the UEFI/BIOS, and just sit there, how long will your machine stay on? Will it still crash?

Never had a crash in the BIOS yet, may try it but it seems to be stable from what I can tell.

The machine fans and LEDs stay on so I doubt it's PSU related, but the screen just goes black after like 20s or so of freezing.

I run Windows 10 on my Ryzen machine (Taichi with 1600, 3200Mhz CL14 ram) in the beginning things were rough, Im talking throw the thing out the window rough. The main issue comes from memory timings really were terrible. I ended up inputting all of the timings manually to get it stable verified with memtest.

Then everything was hunky dory then windows updated and I started getting freezing under heavy load Cinebench, autocad etc..) where the mouse, keyboard and monitor would not respond for a few seconds. The commands would catch up once the freeze went away, and nothing in the event log, very weird. Ended up uninstalling all programs one by one then retest. Found out there is a program my Tascam audio interface installs during driver install and that was causing the few second freeze on my system.

This story isn't very helpful but first thing I would look into is ram, We are spoiled with ram compatibility on intel for so long you just put it in and it works. Some of the timing values the Taichi choose for my memory were really off and super loose. I suspect being too loose is just as unstable as being too tight.

Which memory and what timings are you using by chance? The memory I had was listed as compatible with the X370 SLI Plus I had and it still had the freezing issues, but it's not explicitly listed on the motherboard I have now as compatible...
(Gigabyte X370 Gaming K5)

I am using G Skill Trident Z 3200Mhz CL14 samsung B-die (Not on my boards QVL). I am using 14-14-14-14-28 with a lot of changed sub timings. Gear down mode disabled, Bank group swap disabled, SOC 1.15V, DRAM 1.375V. Have you tested with Memtest just to rule out memory issues?