I thought about getting ECC, but last time I checked the probability of a flip is extremely low. So I don’t think that will be a serious issue for my experiments.
One thing I decided to do was to run sensors to see if there’s anything else that might be causing the stability issues before I just order a new 128 GB ram kit.
I upgraded my kernel to 5.8.0-50-generic
and ran sensors
:
it8792-isa-0a60
Adapter: ISA adapter
in0: 785.00 mV (min = +0.00 V, max = +2.78 V)
in1: 1.50 V (min = +0.00 V, max = +2.78 V)
in2: 1.04 V (min = +0.00 V, max = +2.78 V)
in3: 1.97 V (min = +0.00 V, max = +2.78 V)
in4: 1.80 V (min = +0.00 V, max = +2.78 V)
in5: 1.50 V (min = +0.00 V, max = +2.78 V)
in6: 2.78 V (min = +0.00 V, max = +2.78 V) ALARM
3VSB: 1.66 V (min = +0.00 V, max = +2.78 V)
Vbat: 1.60 V
fan1: 1483 RPM (min = 0 RPM)
fan2: 1527 RPM (min = 0 RPM)
fan3: 1366 RPM (min = 0 RPM)
temp1: +37.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
temp2: -55.0°C (low = +127.0°C, high = +127.0°C) sensor = Intel PECI
temp3: +31.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
intrusion0: ALARM
k10temp-pci-00db
Adapter: PCI adapter
Tctl: +59.2°C
Tdie: +32.2°C
k10temp-pci-00cb
Adapter: PCI adapter
Tctl: +58.8°C
Tdie: +31.8°C
enp7s0-pci-0700
Adapter: PCI adapter
PHY Temperature: +55.0°C
nvme-pci-4100
Adapter: PCI adapter
Composite: +43.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +43.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +41.9°C (low = -273.1°C, high = +65261.8°C)
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: N/A
k10temp-pci-00d3
Adapter: PCI adapter
Tctl: +58.5°C
Tdie: +31.5°C
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +61.9°C
Tdie: +34.9°C
Tccd1: +58.2°C
Tccd2: +58.5°C
Tccd3: +59.8°C
nvme-pci-0900
Adapter: PCI adapter
Composite: +36.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +36.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +40.9°C (low = -273.1°C, high = +65261.8°C)
A lot of bizarre stuff here (negative temps???), and I have not mapped out what each of these are yet.
EDIT: Ran a build of LLVM on full load and it failed. Temps didn’t exceed 32C. Booted into memtest86 and there are now tons of errors.
One thing that is bizarre here is that I:
- Removed half the ram modules when the system first became unstable months ago.
- memtest86 returns zero errors
- system runs stable for a couple months
- system returns to unstable and now modules don’t pass memtest
Why would ram break after passing tests? The modules passed memtest on arrival as well. Possibly misconfigured or broken mobo breaking dram modules?