I’m running into an issue with my Threadripper build where intermittently, all cores will throttle down to ~545 MHz. It will remain throttled for anywhere from ~10 seconds to ~2 minutes.
System specs:
- Threadripper Pro 7985WX (64 cores)
- ASUS WRX90E-SAGE SE
- 256GB (8x32) Kingston 6000MT memory kit, running at 4800MT
- Corsair AX1600i PSU
- Crucial T700 4TB
- Arctic Freezer 4U-M
- GTX1080 GPU
- Ubuntu 23.10 (kernel 6.5.0)
I’ll notice it when I’m running an all-cores workload because the fans slow down audibly. If I’m logged in through ssh, the terminal gets noticeably sluggish and any commands run slowly. I haven’t spent much time yet at the keyboard+monitor, but the one time I was there while it throttled, the entire UI got extremely laggy, including a video I was watching at the time.
I don’t yet know if this happens while the system is mostly idle.
Things I’ve ruled out so far:
- It’s not the workload waiting on non-CPU resources. It has happened while I had two different 128-thread jobs running at the same time, plus my terminal session.
- There are no messages in either the system logs or the IPMI.
- Thermals are fine (at least the sensors I can read; there are a couple that aren’t exposed by the IPMI such as the chipset and VRMs. I hope ASUS fixes that.)
Here are four different occurrences, which I logged by polling /proc/cpuinfo
at a sleep 1
interval.
Sensor readings that were triggered immediately after the onset of the 14:43:01 event:
ID | Name | Type | Reading | Units | Event
2 | +VCORE_0 | Voltage | 0.67 | V | 'OK'
3 | +VCORE_1 | Voltage | 0.67 | V | 'OK'
4 | +VSOC | Voltage | 0.93 | V | 'OK'
5 | +VDDIO | Voltage | 1.11 | V | 'OK'
6 | +VDD_11_S3 | Voltage | 1.12 | V | 'OK'
7 | +CHIPSET_1.8V | Voltage | 1.86 | V | 'OK'
8 | +CHIPSET_1.05V | Voltage | 1.07 | V | 'OK'
9 | +12V | Voltage | 12.00 | V | 'OK'
10 | +5V | Voltage | 4.97 | V | 'OK'
11 | +3.3V | Voltage | 3.31 | V | 'OK'
12 | +5VSB | Voltage | 4.97 | V | 'OK'
13 | +3.3VSB | Voltage | 3.30 | V | 'OK'
14 | VBAT | Voltage | 3.25 | V | 'OK'
15 | CPU Package Temp | Temperature | 66.00 | C | 'OK'
17 | LAN Temp | Temperature | 60.00 | C | 'OK'
18 | DIMMA1_Temp | Temperature | 69.00 | C | 'OK'
19 | DIMMB1_Temp | Temperature | 70.00 | C | 'OK'
20 | DIMMC1_Temp | Temperature | 70.00 | C | 'OK'
21 | DIMMD1_Temp | Temperature | 69.00 | C | 'OK'
22 | DIMME1_Temp | Temperature | 67.00 | C | 'OK'
23 | DIMMF1_Temp | Temperature | 69.00 | C | 'OK'
24 | DIMMG1_Temp | Temperature | 71.00 | C | 'OK'
25 | DIMMH1_Temp | Temperature | 68.00 | C | 'OK'
nvme-pci-4100
Adapter: PCI adapter
Composite: +51.9°C (low = -0.1°C, high = +86.8°C)
(crit = +88.8°C)
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +65.8°C
Tccd1: +52.2°C
Tccd2: +54.0°C
Tccd3: +54.8°C
Tccd4: +52.2°C
Tccd5: +51.4°C
Tccd6: +54.0°C
Tccd7: +54.5°C
Tccd8: +51.9°C
For comparison, here are what the sensor readings normally look like under the same load:
ID | Name | Type | Reading | Units | Event
2 | +VCORE_0 | Voltage | 1.15 | V | 'OK'
3 | +VCORE_1 | Voltage | 1.16 | V | 'OK'
4 | +VSOC | Voltage | 1.04 | V | 'OK'
5 | +VDDIO | Voltage | 1.11 | V | 'OK'
6 | +VDD_11_S3 | Voltage | 1.14 | V | 'OK'
7 | +CHIPSET_1.8V | Voltage | 1.87 | V | 'OK'
8 | +CHIPSET_1.05V | Voltage | 1.06 | V | 'OK'
9 | +12V | Voltage | 12.00 | V | 'OK'
10 | +5V | Voltage | 4.97 | V | 'OK'
11 | +3.3V | Voltage | 3.30 | V | 'OK'
12 | +5VSB | Voltage | 4.97 | V | 'OK'
13 | +3.3VSB | Voltage | 3.30 | V | 'OK'
14 | VBAT | Voltage | 3.25 | V | 'OK'
15 | CPU Package Temp | Temperature | 67.00 | C | 'OK'
17 | LAN Temp | Temperature | 59.00 | C | 'OK'
18 | DIMMA1_Temp | Temperature | 70.00 | C | 'OK'
19 | DIMMB1_Temp | Temperature | 71.00 | C | 'OK'
20 | DIMMC1_Temp | Temperature | 70.00 | C | 'OK'
21 | DIMMD1_Temp | Temperature | 69.00 | C | 'OK'
22 | DIMME1_Temp | Temperature | 67.00 | C | 'OK'
23 | DIMMF1_Temp | Temperature | 70.00 | C | 'OK'
24 | DIMMG1_Temp | Temperature | 71.00 | C | 'OK'
25 | DIMMH1_Temp | Temperature | 69.00 | C | 'OK'
nvme-pci-4100
Adapter: PCI adapter
Composite: +51.9°C (low = -0.1°C, high = +86.8°C)
(crit = +88.8°C)
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +66.8°C
Tccd1: +64.5°C
Tccd2: +65.5°C
Tccd3: +66.2°C
Tccd4: +63.6°C
Tccd5: +63.9°C
Tccd6: +66.0°C
Tccd7: +66.0°C
Tccd8: +66.0°C
Other than trying to see if it also happens at near-idle, I don’t know what my next troubleshooting steps would be. Has anybody else experienced this?