[SOLVED] PC unstable, freezing, unknown cause

pixelfixer_cg · July 31, 2020, 1:25pm

Hey Hako,
some other things that came to my mind…

BIOS-related:

– Your BIOS should have an Option called “ErP Ready” (energy-saving-policy-thing) …try disabling that!
– Did you try resetting your BIOS to default settings (or) just running without any CPU/SoC undervolts, OC’s & Co? If things are not crashing under mainly “Auto”-conditions this would be a starting point for tweaking things…soft-medium LLC settings should be fine though…
– Do you run your FANS at fixed speed (providing same airflow over night)?

Hardware-related:

– You mentioned that you use a NVMe for temporary/cache… some PCIe 4.0 NVMe drives just shut down or at least throttle when reaching certain temps - the AORUS Gen4 simply stops at 70°, reboot is required to get it up running again…try a batch render without caching to the SSD to check.

Windows-related:

– did you try disabling hibernation completely (via typing “powercfg.exe /h off” in elevated command prompt)?
– in Device Manager, disable Option “Allow Windows to put this device to sleep” under every USB-Root-Hub
– Set all your GPUs to maximum power usage (for NVIDIA its in Control Panel / “Manage 3D settings”) - your Renderer might get upset causing crashes if OS or other power guidlines tempt to change GPU power states. Even try without undervolting GPUs, if there is nothing happening on the display (over night duty) the related GPU might get bored and blacks out due to lack of juice.

Hako · July 31, 2020, 1:46pm

Thank you pixelfixer for the many suggestions, let me go through them point-by-point:

ErP Ready is disabled (default)
Yes even on everything auto there are memory erros happening and freezing (why I even started to mess with everything to begin with)
My fans run as fixed as it can get (most at 100% all the time, one fan is running at about 70% but turn up to 100% when temps reach 86°C)
The NVMe drives are Gen3 and not that heavily used when caching, nothing there reaches more than 60°C
Hybernation is off (one of the first things I do when installing a new OS)
I did that now. (Though that shouldn’t have to be done)
Maximum performance is now set. I do have Wallpaper Engine running on the main GPU with always a load of about 20%-35%, will try that for the second GPU that is bored most of the time, as it is mostly there for more monitors.

What I found out recently is that the less voltage my SOC/RAM gets the less memory erros are happening.
For example I just tried RAM voltage of 1.42V and got memory erros in the first 2 seconds of starting HCI Memtest. So I just reduced the RAM voltage to 1.32V and suddenly the first memory error happens at 19XX%. Same with SOC, I can go as low as 1.02V and everything is fine, if I go up in any way above 1.14V the system freezes up when starting the Windows desktop or memory errors happen in the first 10%.
For now I’ll see if reducing RAM voltage even more will reduce memory errors further.

pixelfixer_cg · July 31, 2020, 3:21pm

Be careful with SoC voltage! Setting SoC quite higher than default (1.1v) and raising LLC will stress the fabric towards degradation! If you’re not going to apply stupidly high overclocks (and Ryzen is not very happy with allcoreOC) you won’t need to raise SoC-voltage manually.

Sounds if there is (at least) something awkward with your RAM configuration… to start off I would

– open the case and put a room ventilator in front blowing air on everything to find out if your high temps are a factor in the game.

– reset cmos and only reenter the most necessary settings, no OC or downvolts etc.

– rund your RAM with JEDEC base settings not XMP

Sometimes it’s helpful to step back to basics to exclude problem causes when fixing and tweaking things to death does not help!

Hako · July 31, 2020, 3:32pm

No worries SoC voltage is “safe enough” up to 1.2V, I’ll never go over.
In many of my testings I never touched the LLC for anything.

I also kinda suspect my RAM the most, hence why I messed with it the most and reduced the speed of it from 3200 to 3066 for what I thought was stable. I’m reducing it more if the voltage doesn’t help enough, I tried 2866 and 2800 too just not long enough maybe.
The RAM itself should be fine, as it ran for about 3 years on XMP spec in my old TR system.

I have cleared CMOS about 5 times for all my testing which included just running XMP settings and leaving everything else untouched. That was unstable.

I will try JEDEC specs again if the voltage and speed reductions won’t do it.
Though before that happens I may just buy some other RAM just to test the waters more.

As for the fan, I’ll check if I can borrow one from someone.

Hako · July 31, 2020, 10:17pm

Soo seems I made an interesting observation:
More RAM voltage: Longer operating time before freeze, but more memory errors.
Less RAM voltage: Shorter operating time before freeze, but less memory errors.

How does that make sense at all?

pixelfixer_cg · August 1, 2020, 9:35am

there shouldn’t be any memory errors at all… 1st thing to try is what settings make it work 100% free of errors. check with other / or less RAM sticks. And get proper cooling, at least with some crappy room ventilator or desk fan blowing into the case to get this problem cause eliminated or verified!

Hako · August 1, 2020, 10:58am

So I downclocked the memory and voltage further now 2866 and 1.28V and let it run to 2000% which took already an enormous amount of time but this time there are no errors at least. I will report on the freezing if and when it happens. As a next step though I will try to get a room ventilator asap.

MisteryAngel · August 1, 2020, 12:42pm

I would suggest to try just run the memory at 3200mhz with the standard infinity fabric speed of 1600mhz to match that and of course just 1.35V vmem.

If that does not work out well, then i would try 2933mhz.

Keep in mind that Buildzoid is using a Ryzen system, which is a dual channel system.
Threadripper is quad channel that is a significant difference.
Because the mem clocks you are able to achieve and the timings,
are really depended on many factors, memory modeles (single / dual rank), cpu, board etc.

Hako · August 1, 2020, 12:45pm

Tried that 3 times (in different intervals for sanity checking), it’s giving me memory errors at about ~10%.

MisteryAngel · August 1, 2020, 12:48pm

And what about if you run the memory modules without xmp profile?
But just stock jedec speed?
Of course this will tank performance on the infinity fabric drastically.
But just for a test to figure out, if not one of the memory modules might cause the cull prit maybe?
Or in case of a temperature issue, that should shine some light on it then i guess.

I mean if you still have issues on jedec speeds, then i suppose there might be a module issue.
Or maybe you have the worst cpu ever.

Hako · August 1, 2020, 12:50pm

I tried that once, but haven’t checked enough. since it takes so much time if there are no memory errors happening but freezing may still happen.

For now I’ll try my settings I landed on now on 2866 with no memory erros even to 2000% and see what happens, if that works great, if not then I will go further with the suggestions made here of course.

MisteryAngel · August 1, 2020, 12:54pm

Well the last thing i could possibly thinking off might be the motherboard.
But yeah, that is kinda difficult to test without spending allot of money.
And if the issue isn’t fully isolated to that part yet.
Because it’s still strange that it does complete renders without issues when you are just at the system doing other things as well.

Hako · August 1, 2020, 1:00pm

Oh I changed the title actually because at some point in my testing it did freeze up now (when I decreased the Voltage more but leaving frequency alone) even when I was on the PC and doing stuff. That way I actually could follow what was happening:

The first thing that happens: whatever I click on doesn’t happen anymore, clicking on it again does nothing.
Encoding task stops
The second thing is that the start menu doesn’t open anymore
I can’t close anything anymore nor resize anything anymore
Some programs like HWInfo64 still run and report correctly
Wallpaper Engine freezes
Sound is the last thing that freezes
Total freeze up, nothing works anymore

The only difference that I made until now in my testing is what I outlined in my post about the observation I made:

MisteryAngel · August 1, 2020, 1:06pm

Yeah this is really a difficult one.

If it was me personally i would pick up a new memory kit and try that firstly really.
Because the unstable memory seems to be the only factor that always seem to come back here.

Hako · August 1, 2020, 1:10pm

Yes it is and in my 9 years of being in the IT business I never encountered something like it, hence my cry for help .

I think I may actually get a new memory kit with different ICs than Samsung B-die before I’m going horribly crazy.

MisteryAngel · August 1, 2020, 1:13pm

Micron-E die maybe?

I believe Buildzoid also talks often about those being cheap and good.

Hako · August 1, 2020, 1:18pm

Probably going with that yes, they were in talks sometimes though as being heat sensitive but Buildzoid took a hairdryer to one and gave it 70°C on a good OC and it didn’t spew a memory error.
If that still has the same issues in my system I’ll try another one that is hopefully Hynix CJR and if that also spews the same things I know it’s not the memory.

Hako · August 1, 2020, 8:25pm

Welp, it froze again, this time while actually being idle for a longer time (4H+) and temps being at the low 60s.
That may be a point for it not being the RAM nor heat or am I wrong about that?

EDIT:
Holy c**p, on 2666 and 1.28V it froze after about 2H of operating time.
This time it froze while HCI Memtest was running, no errors popped up while it was running. Does that mean it’s something different than RAM?
I’m now down to 2400 JEDEC.

Hako · August 3, 2020, 8:55pm

2400 JEDEC 1.2V also froze. Oh and FCLK was 1:1 for this.
For now I’m trying a different drive with an OS that didn’t get a memory crash yet, maybe the OS is borked.

Hako · August 6, 2020, 10:15am

Update time:

After I kinda gave up on my RAM I got a new one: 3600 1.35V Micron E-die
So far so good.

In the first night it completed HCI Memtest without erros to 1500%.
On the second night running it again to 2000% it finished without errors but when freeing up the RAM again from Memtest use it froze again.

Can that simply happen or this again an indicator of something being wrong?
Sadly the new RAM kit doesn’t have temperature sensors.