When is 8 memory channels not 8 memory channels?

Threadripper DDR5 Throttling problem!?

I have this awesome gskill kit of memory for threadripper and eypc systems – 8 sticks, 512gb, 6400. This has quickly become top contender for 512gb of memory / 64 gb per dimm / for one of these high end workstation builds.

The below results show that the 9995WX (96 core, and the 64 core) can achieve the highest memory bandwidth whereas the 64 core TRX50 Threadripper CPU, while showing a modest performance uplift @ 6400, still can’t quite match a 7975WX 32 core @ 6400 with 8 memory channels.

It is most correct to say the memory bandwidth doesn’t exactly follow the socket or channels available on the platform, but instead follows the core count. And that (probably) 48 or more cores are needed, because of the # of chiplets involved, to achieve max memory banwidth.

Of course for Epyc, this isn’t so – we have Epyc “F” series CPUs with as little as 1 core per chiplet enabled.

Long live 8 (or more) chiplets for 8 memory channels!!

Results with the 9995WX (8 channel, 6400f):

root@amdrocm:/home/w/Linux# history |grep nr_hu
   93  history |grep nr_hu
root@amdrocm:/home/w/Linux# history |grep huge
   94  history |grep huge
root@amdrocm:/home/w/Linux# history |grep sys
   95  history |grep sys
root@amdrocm:/home/w/Linux#     sudo sysctl -w vm.nr_hugepages=4000
vm.nr_hugepages = 4000
root@amdrocm:/home/w/Linux# chmod +x ./cuda_13.0.2_580.95.05_linux.run ^C
root@amdrocm:/home/w/Linux# ls
mlc  mlc_script.sh  redist.txt
root@amdrocm:/home/w/Linux# ./mlc_script.sh
Running initial MLC to establish target bandwidth...
Intel(R) Memory Latency Checker - v3.12
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0         117.0

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      322238.8
3:1 Reads-Writes :      310136.1
2:1 Reads-Writes :      304908.7
1:1 Reads-Writes :      286958.1
Stream-triad like:      301840.5
All NT writes    :      286935.8
1:1 Read-NT write:      288508.9

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node              0
       0        322268.1

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  212.45  322487.4
 00002  213.74  322620.9
 00008  214.76  322318.3
 00015  209.25  322413.8
 00050  215.09  323994.2
 00100  225.98  324406.1
 00200  226.60  324009.1
 00300  222.09  323698.0
 00400  209.87  323047.2
 00500  152.58  307602.3
 00700  137.87  233079.9
 01000  132.78  170425.3
 01300  130.75  134462.0
 01700  133.89  106836.2
 02500  133.09   74461.4
 03500  128.21   54171.2
 05000  127.16   38504.3
 09000  125.84   21886.6
 20000  125.02   10215.0

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        16.1
Local Socket L2->L2 HITM latency        16.1

Target Memory Bandwidth: About 322gb/sec
Running MLC (2); Bandwidth 322377.5 MB/sec; 100.0% of original
Running MLC (3); Bandwidth 322480.7 MB/sec; 100.0% of original
Running MLC (4); Bandwidth 322433.1 MB/sec; 100.0% of original
Running MLC (5); Bandwidth 322405.0 MB/sec; 100.0% of original
Running MLC (6); Bandwidth 322550.1 MB/sec; 100.0% of original
Running MLC (7); Bandwidth 322430.6 MB/sec; 100.0% of original
Running MLC (8); Bandwidth 322468.5 MB/sec; 100.0% of original
Running MLC (9); Bandwidth 322466.0 MB/sec; 100.0% of original
Running MLC (10); Bandwidth 322463.7 MB/sec; 100.0% of original
Running MLC (11); Bandwidth 322472.8 MB/sec; 100.0% of original
 Running MLC (12); Bandwidth 322530.1 MB/sec; 100.0% of original
Running MLC (13); Bandwidth 322481.9 MB/sec; 100.0% of original
Running MLC (14); Bandwidth 322556.2 MB/sec; 100.0% of original
Running MLC (15); Bandwidth 322449.6 MB/sec; 100.0% of original
Running MLC (16); Bandwidth 322513.0 MB/sec; 100.0% of original
Running MLC (17); Bandwidth 322535.9 MB/sec; 100.0% of original
Running MLC (18); Bandwidth 322403.2 MB/sec; 100.0% of original
Running MLC (19); Bandwidth 322466.2 MB/sec; 100.0% of original
Running MLC (20); Bandwidth 322480.3 MB/sec; 100.0% of original

7975WX @ 6400 mt/s

Intel(R) Memory Latency Checker - v3.12
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          93.3

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      216582.6
3:1 Reads-Writes :      295526.7
2:1 Reads-Writes :      313226.1
1:1 Reads-Writes :      245388.7
Stream-triad like:      290518.8
All NT writes    :      122426.9
1:1 Read-NT write:      244245.9

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node              0
       0        216461.6

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  360.20  217970.4
 00002  355.01  217937.4
 00008  355.42  217965.2
 00015  354.84  217911.1
 00050  355.52  217931.8
 00100  356.75  217942.6
 00200  106.37  184581.9
 00300  103.13  124888.8
 00400  103.36   92040.3
 00500  101.21   74323.9
 00700  100.59   53659.9
 01000  100.15   37985.9
 01300   99.92   29483.3
 01700   99.84   22760.8
 02500   95.66   15747.9
 03500   95.47   11458.6
 05000   95.28    8234.4
 09000   94.99    4879.9
 20000   94.70    2570.3

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        18.4
Local Socket L2->L2 HITM latency        18.4



9970X (32 core, 4 channel / 256gb configuration)

Running initial MLC to establish target bandwidth...
Intel(R) Memory Latency Checker - v3.12
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          98.9

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      189341.6
3:1 Reads-Writes :      174562.8
2:1 Reads-Writes :      174456.9
1:1 Reads-Writes :      175813.7
Stream-triad like:      175140.0

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node              0
       0        189460.7

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  177.81  189504.1
 00002  169.49  187525.9
 00008  163.60  189010.9
 00015  176.44  189444.9
 00050  178.06  189840.1
 00100  170.74  189873.7
 00200  171.04  189770.9
 00300  154.43  186897.0
 00400  115.51  151963.7
 00500  113.11  126630.1
 00700  110.46   94382.4
 01000  106.84   68299.5
 01300  107.99   53926.4
 01700  104.91   41803.9
 02500  107.99   29102.0
 03500  108.57   20975.2
 05000  133.54   14158.9
 09000  109.96    8591.2
 20000  109.78    4249.9

Measuring cache-to-cache transfer latency (in ns)...
Unable to enable large page allocation
Using small pages for allocating buffers
Local Socket L2->L2 HIT  latency        18.1
Local Socket L2->L2 HITM latency        18.6

Target Memory Bandwidth: About 190gb/sec
Running MLC (2); Bandwidth 186161.3 MB/sec; 98.2% of original
Running MLC (3); Bandwidth 188476.8 MB/sec; 99.5% of original
Running MLC (4); Bandwidth 188806.1 MB/sec; 99.6% of original
Running MLC (5); Bandwidth 187688.7 MB/sec; 99.0% of original
Running MLC (6); Bandwidth 184859.1 MB/sec; 97.5% of original
Running MLC (7); Bandwidth 188693.9 MB/sec; 99.6% of original
Running MLC (8); Bandwidth 187970.9 MB/sec; 99.2% of original
Running MLC (9); Bandwidth 184880.7 MB/sec; 97.6% of original
Running MLC (10); Bandwidth 188165.8 MB/sec; 99.3% of original
Running MLC (11); Bandwidth 188652.5 MB/sec; 99.6% of original
Running MLC (12); Bandwidth 188066.8 MB/sec; 99.2% of original
Running MLC (13); Bandwidth 188746.5 MB/sec; 99.6% of original
Running MLC (14); Bandwidth 188655.2 MB/sec; 99.6% of original
Running MLC (15); Bandwidth 188888.1 MB/sec; 99.7% of original
Running MLC (16); Bandwidth 187420.9 MB/sec; 98.9% of original
Running MLC (17); Bandwidth 186783.4 MB/sec; 98.6% of original
Running MLC (18); Bandwidth 185938.2 MB/sec; 98.1% of original
Running MLC (19); Bandwidth 188119.9 MB/sec; 99.3% of original
Running MLC (20); Bandwidth 184304.3 MB/sec; 97.3% of original

7980X (64 core, 256gb 4-dimm)

./mlc_script.sh
Running initial MLC to establish target bandwidth...
Intel(R) Memory Latency Checker - v3.12
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          78.7

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      194581.7
3:1 Reads-Writes :      174510.3
2:1 Reads-Writes :      170525.6
1:1 Reads-Writes :      164374.6
Stream-triad like:      168151.9
All NT writes    :      170939.1
1:1 Read-NT write:      163267.2

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node              0
       0        194468.2

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  575.31  194593.6
 00002  575.85  194355.5
 00008  571.85  194675.6
 00015  577.73  194682.4
 00050  562.69  194236.2
 00100  554.93  194250.7
 00200  544.65  194644.4
 00300  422.93  194485.3
 00400  109.55  170936.4
 00500  103.88  139046.1
 00700  100.61  101083.4
 01000   98.49   71824.4
 01300   97.31   55883.1
 01700   96.51   43070.0
 02500   91.74   29704.6
 03500   91.19   21483.2
 05000   90.79   15285.6
 09000   90.43    8831.4
 20000   90.23    4373.4

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        17.6
Local Socket L2->L2 HITM latency        17.5

Target Memory Bandwidth: About 195gb/sec

Enter, The Throttling Problem

So when we did our testing in the July Video, we were testing mainly Asus, Gigabyte and ASRock. Every board showed proper throttling behavior, and it was interesting that heatspreaders helped as much as they did.

What I didn’t realize then was that Asus throttling happens on 7000 series threadripper CPUs, but not on 9000 series CPUs.

The memory will sail well past 95c, 110c, even 125c on Asus TRX50 and WRX90 boards when throttling is not working correctly. This will permanently damage most memory.

The GSkill memory here adds a safety fuse, which will blow, but that still leaves you with a dead dimm that must be repaired by replacing the fuse.

Generally, this gskill kit doesn’t use as many watts at 6400 as our JEDEC 1.1v SK kit – except under the most extreme circumstances.

Color me surprised! But Asus is aware of the issue and working on a bios update due soon to resolve the issue.

We’ll update this thread when that happens.

6 Likes

It looks like something fairly substantial changed in the way latency was managed in a memory BW loaded state between Zen 4 and Zen 5 TR. That is a massive improvement in latency.

On TR nonPRO, both procs have the same inject delay inflection point, but Zen 4 TR is triple the latency of Zen 5 TR.
On TR PRO Zen 4 still hits a latency wall, but it isn’t as severe (I wonder if this is because the IO die has less chiplets vying for attention on the low die count sku).

turns out that’s one of the sus bios options I covered on zen5 desktop launch. it’s good for gamers the way it is on zen4, for some games, but not all., and on by default on zen5.

makes the system way more snappy under load always.

1 Like

Reference of the chiplet / bandwidth situation:
https://old.reddit.com/r/threadripper/comments/1azmkvg/comparing_threadripper_7000_memory_bandwidth_for/kseqkrg/

1 Like

It makes me think that the architectural nuances will begin to show their heads here at this scale. Specifically, I wanted to mention the memory mapping/interleaving across channels and the different modes that are available. And how the bandwidth test is set up too. What if cores waste interconnect bandwidth in these tests by thrashing caches? (allocation and access is very likely happening in powers of 2, quite the pitfall territory).

preview of our handy dandy fan brackets @GigaBusterEXE

4 Likes

That is very cool, I might do something like that for the Alphacool Threadripper AIO’s too. Though my AIO is currently out of service and needs maintenance since it is a couple years old and I need longer tubes.

2 Likes

Cool fan bracket! I need something similar for the Thermaltake AW420 though.

My v-color ram would get dangerously close to 95c on Prime95 tests. I had to add heatspreaders and attach heatsinks to the top of them to get it somewhat under control. Now they top at 80/82.

The only reason I haven’t added ram fans yet is because I lack proper mounts.

P.S. Good to know about issues with the wrx90 asus board. Will keep an eye for BIOS updates.

Hey all, I’ve been watching the RAM temp issue across several recent videos and created an account to share my experience/solution to hot RAM on Threadripper.

I got a Threadripper 7960, ASUS TRX50 Sage Wifi, and 192GB of Vcolor RAM running at 7000 MT/s for a great price on Ebay a few months back. I was legit surprised at how hot the RAM was running though, hitting over 90 degrees during Prime 95.

I have little to no experience modeling stuff, but I found a basic 3d printed fan bracket and modified it to fit my needs on the TRX50. It’s designed for a 60mm fan and, instead of just blowing air AT the ram, it’s a funnel design that blows air THROUGH the ram. My worst-case temps dropped 21 degrees to a max of 69 (nice) during Prime 95 stress testing. Normal use temps are down to the mid 40’s.

I tried to attach the STL files for 60mm and 40mm versions, but I can’t share files yet with my new account. The original conversation, including file links is over on Serve the Home’s forums under post 481505 if interested!

6 Likes

fixed you up, should be able to post now. Post pics! :smiley:

Back when I built my SPR-WS homelab machine, I remember trying some bykski heat spreaders for RDIMMs, at the time they interfered with the PMIC inductors.

Memory idled between 40-50C, which isn’t bad other than the edge DIMMs were active hotspots. Since I had a waterblock on the CPU, I just decided plopping a 120mm was good enough.

Maybe it’d have been smarter of me to spread that around back then. In general we can’t take for granted relying 100% on ambient for either storage or memory. Pushing the speeds with Gen5 (and soon Gen6) and DDR5 going ever faster active cooling really is a must. Servers basically rely large amount of air flowing through the memory, and even there it’s not enough and they specifically duct things around to guarantee things get cooled properly. So DIY also gotta pay special attention there, I do like those 3D printer brackets, they’d have been better than what I did anyway.

When is 8 memory channels not 8 memory channels?

When you can’t afford populate them…