i9-7920X build spuriously reboots after ~1-2 days uptime

Hi all,

Scratching my head on this one. journalctl --list-boots and journalctl -b -1 help me to determine at which point the system took a nose dive.

Dec 21 05:58:40 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:8007:1::3
Dec 21 05:59:45 i9-7920X ntpd[964]: Soliciting pool server 2400:8901::f03c:91ff:fefb:7b7c
Dec 21 06:00:50 i9-7920X ntpd[964]: Soliciting pool server 2a04:3543:1000:2310:d862:f5ff:fe4e:6e9a
Dec 21 06:01:01 i9-7920X CROND[67761]: (root) CMD (run-parts /etc/cron.hourly)
Dec 21 06:01:01 i9-7920X run-parts[67764]: (/etc/cron.hourly) starting 0anacron
Dec 21 06:01:01 i9-7920X run-parts[67770]: (/etc/cron.hourly) finished 0anacron
Dec 21 06:01:57 i9-7920X ntpd[964]: Soliciting pool server 2001:418:3ff::1:53
Dec 21 06:03:02 i9-7920X ntpd[964]: Soliciting pool server 2a04:3543:1000:2310:d862:f5ff:fe4e:6e9a
Dec 21 06:04:09 i9-7920X ntpd[964]: Soliciting pool server 2001:a98:11::40
Dec 21 06:05:15 i9-7920X ntpd[964]: Soliciting pool server 2400:6180:0:d0::1494:e001
Dec 21 06:06:20 i9-7920X ntpd[964]: Soliciting pool server 2001:df1:801:a005:3::1
Dec 21 06:07:25 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:8007:1::30
Dec 21 06:08:30 i9-7920X ntpd[964]: Soliciting pool server 2001:3c8:e10e:399f::20
Dec 21 06:09:36 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:8007:1::3
Dec 21 06:10:44 i9-7920X ntpd[964]: Soliciting pool server 2405:aa00:2::10
Dec 21 06:11:48 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:8007:1::3
Dec 21 06:12:53 i9-7920X ntpd[964]: Soliciting pool server 2406:f000:3:e000::2
Dec 21 06:13:58 i9-7920X ntpd[964]: Soliciting pool server 2001:418:3ff::53
Dec 21 06:15:05 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:d800::1
Dec 21 06:16:12 i9-7920X ntpd[964]: Soliciting pool server 2402:f000:1:416:101:6:6:172
Dec 21 06:17:18 i9-7920X ntpd[964]: Soliciting pool server 2a02:2a50:6::123
Dec 21 06:18:24 i9-7920X ntpd[964]: Soliciting pool server 2001:418:3ff::1:53
Dec 21 06:19:31 i9-7920X ntpd[964]: Soliciting pool server 2a02:2a50:6::123
Dec 21 06:20:36 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:d800::1
Dec 21 06:21:41 i9-7920X ntpd[964]: Soliciting pool server 2a04:3543:1000:2310:d862:f5ff:fe4e:6e9a
Dec 21 06:22:45 i9-7920X ntpd[964]: Soliciting pool server 2001:a98:11::40
Dec 21 06:23:53 i9-7920X ntpd[964]: Soliciting pool server 2a04:3543:1000:2310:d862:f5ff:fe4e:6e9a
Dec 21 06:24:58 i9-7920X ntpd[964]: Soliciting pool server 2001:418:3ff::1:53
Dec 21 06:26:04 i9-7920X ntpd[964]: Soliciting pool server 2001:418:3ff::53
Dec 21 06:27:10 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:8007:1::30
Dec 21 06:28:16 i9-7920X ntpd[964]: Soliciting pool server 2001:da8:9000::81

As you can see, pretty boring. What else should/could I try?

Is the memory in XMP? You may have to apply a XMP profile cause the timings without XMP are extremely conservative and loose, which could simply cause random shutdowns.

I normally always enable XMP - itโ€™s a Crucial Ballistix 2666MT/s kit, Iโ€™ll double check!

Iโ€™ll bump the DRAM channel voltage for good measure.

Last crash only had a 1 hr 18 min uptime. It had XMP enabled (to 2666MT/s) + both DDR channels @ 1.4vdc.

Iโ€™ve now bumped it up to 1.48vdc (which is relatively high, but safe given 1.5vdc is considered max).

Sparingly increase the system agent voltage to 1V. If that doesnโ€™t work, increase VTT and donโ€™t touch System Agent from 1V.

1 Like

Cheers mate. Just checked on the RAM,

[root@i9-7920X ~]# dmidecode -t 17                                                                                                                                         
# dmidecode 3.1                                                                                                                                                            
Getting SMBIOS data from sysfs.                                                                                                                                            
SMBIOS 3.0.0 present.                                                                                                                                                      
                                                                                                                                                                           
Handle 0x004B, DMI type 17, 40 bytes                                                                                                                                       
Memory Device                                                                                                                                                              
        Array Handle: 0x0049                                                                                                                                               
        Error Information Handle: Not Provided                                                                                                                             
        Total Width: 72 bits                                                                                                                                               
        Data Width: 64 bits                                                                                                                                                
        Size: 8192 MB                                                                                                                                                      
        Form Factor: DIMM                                                                                                                                                  
        Set: None                                                                                                                                                          
        Locator: DIMM_A1                                                                                                                                                   
        Bank Locator: NODE 1                                                                                                                                               
        Type: DDR4                                                                                                                                                         
        Type Detail: Synchronous                                                                                                                                           
        Speed: 2666 MT/s                                                                                                                                                   
        Manufacturer: CRUCIAL                                                                                                                                              
        Serial Number: A41A7021                                                                                                                                            
        Asset Tag:                                                                                                                                                         
        Part Number: BLE8G4D26AFEA.16FAD                                                                                                                                   
        Rank: 2                                                                                                                                                            
        Configured Clock Speed: 2666 MT/s                                                                                                                                  
        Minimum Voltage: 1.2 V                                                                                                                                             
        Maximum Voltage: 1.2 V                                                                                                                                             
        Configured Voltage: 1.2 V

I noticed a couple reboot loops taking place; it did about 3 reboots within the span of 2-3 mins and it has been running stable since the last reboot for an hour or so (but Iโ€™ve stopped running the ZCash miner since).

See if that was using AVX. AVX is always a easy way to destabilize a barely stable system.

Donโ€™t think it uses AVX, just Cuda offloading onto the GTX1070. Barely no-CPU load by the miner.

Iโ€™m surprised you care enough to try and not just return it.

On an island mate, sending it back is another $100 DHL cost, but I may end up doing it. Moving it onto a different UPS to isolate any power issues.

Re. the pfSense box // 7700K / TUF Z270; another new build. Was working fine, totally deal. Wonโ€™t even go post 5vdc standby voltage. Will most likely spend the weekend taking that apart. Tried switching the AX860i PSU for a AX760W PSU (new), nada.

โ€ฆWait where?

I live in Sri Lanka, Colombo to be exactโ€ฆ

Decent music vid, sorry for the weird NSFW thumbnail though.

1 Like

Oh you live in Just Cause 2 ok.

1 Like

I tried installing the second kit of Ballistix RAM (total 32GB). Things I noticed -

  • Couldnโ€™t get it to POST once properly
  • However, on one bootup, it ended up in Asusโ€™ โ€œSafe Modeโ€ which lets me into the UEFI - I was able to enable XMP, bump DRAM channel voltage to v1.48vdc and it loaded up Fedora.

I had it run the miner for almost a day, but then we had a power-cut andโ€ฆ

Since it wouldnโ€™t POST at all like normal; yanked out the second kit, pushed on the original kit a bit to make sure (again!) that they were seated properly. Didnโ€™t notice any extra give or clicking (into-place) any further.

This time around, I disabled XMP, set DRAM channel to v1.48vdc, and manually entered the timings 16-17-17 and left the rest on auto. It boots fine now, and has been running for a while - 04:50:26 up 1 day, 9:20, 3 users, load average: 0.26, 0.24, 0.20. I also disabled Asusโ€™ MultiCore enhancement and left it on Intelโ€™s default-auto.

Previously, I had Asusโ€™ MultiCore enhancement on โ€˜autoโ€™ (which is on) - this pushes core-clocks to the max. It could be this boils down to power-draw, may be over heating VRMs โ€” seems unlikely, high clock but at idle? Remember, the miner just offloads all load to the GPU.

[mdesilva@i9-7920X ~]$ uptime                                                                                                                    โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
 04:50:26 up 1 day,  9:20,  3 users,  load average: 0.26, 0.24, 0.20                                                                             โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
[mdesilva@i9-7920X ~]$ nvidia-smi                                                                                                                โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
Sun Dec 24 04:53:46 2017                                                                                                                         โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
+-----------------------------------------------------------------------------+                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
| NVIDIA-SMI 384.98                 Driver Version: 384.98                    |                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
|-------------------------------+----------------------+----------------------+                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
|===============================+======================+======================|                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
|   0  GeForce GTX 1070    Off  | 00000000:17:00.0  On |                  N/A |                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
| 60%   59C    P2   123W / 125W |    847MiB /  8110MiB |    100%      Default |                                                                  โ”‚ยทยทยทยทยทยทยทยทยทยทยทยทยท
+-------------------------------+----------------------+----------------------+

FYI - the GTX1070 has been carefully underclocked to find that sweet-spot between Hashes/s vs. power-draw.

ASUSโ€™ Mult-core enhancement = Auto OC. You do want that off cause that was overclocking your processor.

1 Like

I always try to keep that off, but after the UEFI update missed it. Guess that was the cause for the instability hmm.

I really shouldnโ€™t have cut-corners on this build; used a spare Noctua Cooler I had a NH-L12. This is most likely the cause for the random reboots.

Today I ran the Phoronix Cryptography benchmarks and noticed reported CPU temps (in Fedora) were north of 100degC!! :exploding_head: The Corsair AX860i PSU is about adequate although Iโ€™d need to do more testing to comment on that (remember, another 125W is being pulled through the GTX1070).

Idle temps are around 40DegC per core.

So it makes sense for the MCE enhancement (OC) to push this over the edge. Got a 4 day 13+ hrs uptime since the last change (disabling MCE).

Iโ€™m also going to see if I can get the 32GB of the Ballistix RAM to work at some point.

@FurryJackman should I bump any of the Vcore settings or just leave everything on auto for the time being (apart from MCEโ€ฆ)

You need the U14 Noctua cooler badly, or a 280mm AIO. Seems stock is not enough for that cooler. Remember, itโ€™s 180+W TDP with turbo boost engaged. If you were going the AIO route. I recommend what I currently use, the Cryorig A80 with an included VRM fan. I got it before NCIX ceased to exist due to bankruptcy, so it may be hard to find now. You could get Fractal Design Celsius S24 as the alternative, but it will have no VRM cooling.

Iโ€™d always stick to G.Skill memory and Samsung B-Die if you go down the 32GB route. 2 2x8GB kits should do for that. The non RGB 3200 Trident Z kit ending in GTZB as the model number is a good bet. Thatโ€™s guaranteed Samsung B-Die.

1 Like

$230+ for a 16GB kit of RAM. Iโ€™ll be waiting for RAM prices to drop for a good whileโ€ฆ