Slow Performance with Corsair MP600 PRO NH PCIe-4

server · December 27, 2023, 7:17pm

Hi All,

I am running a Linux server with the following specs:

NVMe Drives: 2x Corsair MP600 PRO NH 8 TB PCIe4 NVMe in Software (mdadm) RAID-1
Motherboard: AsRockRack B650D4U
CPU: AMD Ryzen 7950X

I notice a significant slowdown, regardless of disk measuring tool (i.e. I tried both dd and fio) after the server has been powered on for a few hours. If I reboot the server, the disk I/O results will be up to the 2-3 GB/s range, but if i try again a few hours later, its down to the 300MB/s range.

To illustrate, here’s what its suppose to look like:

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/md126):

Block Size | 4k (IOPS) | 64k (IOPS)
------ | — ---- | ---- ----
Read | 903.14 MB/s (225.7k) | 2.44 GB/s (38.2k)
Write | 905.52 MB/s (226.3k) | 2.45 GB/s (38.4k)
Total | 1.80 GB/s (452.1k) | 4.90 GB/s (76.6k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | — ---- | ---- ----
Read | 3.08 GB/s (6.0k) | 3.14 GB/s (3.0k)
Write | 3.24 GB/s (6.3k) | 3.35 GB/s (3.2k)
Total | 6.33 GB/s (12.3k) | 6.49 GB/s (6.3k)

And here’s what it looks like after the server has been powered on for more than a few hours:

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/md126):

Block Size | 4k (IOPS) | 64k (IOPS)
------ | — ---- | ---- ----
Read | 85.42 MB/s (21.3k) | 328.25 MB/s (5.1k)
Write | 85.65 MB/s (21.4k) | 329.98 MB/s (5.1k)
Total | 171.07 MB/s (42.7k) | 658.23 MB/s (10.2k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | — ---- | ---- ----
Read | 412.86 MB/s (806) | 414.53 MB/s (404)
Write | 434.80 MB/s (849) | 442.14 MB/s (431)
Total | 847.67 MB/s (1.6k) | 856.67 MB/s (835)

At first, I may have thought, perhaps its an issue once the NVMe’s start getting more full, because in the beginning stages when I first deployed this server, the I/O tests were consistently good. However, some of the other servers we have with the exact same build, are 80-90% at capacity and still have decent disk I/O performance, so I’m not certain that might be the case here. Also, I noticed the issue occurs on this server when the NVMe is only at 30% usage already so I don’t think its an issue with how much storage capacity is being used.

iotop show less than 50-100MB/s usage at any given time. CPU usage is low.

Total DISK READ : 768.05 K/s | Total DISK WRITE : 11.91 M/s
Actual DISK READ: 768.05 K/s | Actual DISK WRITE: 12.09 M/s

Firmware looks to be up to date according to nvme list:

[root@server ~]# nvme list
Node SN Model Namespace Usage Format FW Rev

/dev/nvme0n1 A5LIB340001QRC Corsair MP600 PRO NH 1 8.00 TB / 8.00 TB 512 B + 0 B EIFM51.3
/dev/nvme1n1 A5LIB340001PT7 Corsair MP600 PRO NH 1 8.00 TB / 8.00 TB 512 B + 0 B EIFM51.3

Here are the temperature reading/smart log data:

[root@server ~]# nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 63 C (336 Kelvin)
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 115,067,351
data_units_written : 36,177,997
host_read_commands : 925,472,295
host_write_commands : 831,088,874
controller_busy_time : 2,688
power_cycles : 3
power_on_hours : 1,185
unsafe_shutdowns : 1
media_errors : 0
num_err_log_entries : 4
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
[root@server ~]# nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning : 0
temperature : 61 C (334 Kelvin)
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 137,448,840
data_units_written : 20,550,570
host_read_commands : 1,113,104,387
host_write_commands : 810,468,119
controller_busy_time : 2,616
power_cycles : 3
power_on_hours : 1,185
unsafe_shutdowns : 1
media_errors : 0
num_err_log_entries : 4
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
[root@server ~]# dmesg | grep nvme
[ 1.234403] nvme nvme0: pci function 0000:0c:00.0
[ 1.234415] nvme nvme1: pci function 0000:09:00.0
[ 1.257469] nvme nvme1: Shutdown timeout set to 10 seconds
[ 1.260143] nvme nvme0: Shutdown timeout set to 10 seconds
[ 1.535500] nvme nvme1: 32/0/0 default/read/poll queues
[ 1.538258] nvme1n1: p1 p2 p3 p4 p5
[ 1.584838] nvme nvme0: 32/0/0 default/read/poll queues
[ 1.587866] nvme0n1: p1 p2 p3 p4 p5
[root@server ~]# cat /proc/mdstat
Personalities : [raid1]
md123 : active raid1 nvme1n1p4[0] nvme0n1p4[1]
52160 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md124 : active raid1 nvme1n1p5[0] nvme0n1p5[1]
7661665280 blocks super 1.2 [2/2] [UU]
bitmap: 19/58 pages [76KB], 65536KB chunk

md125 : active raid1 nvme0n1p3[1] nvme1n1p3[0]
1047552 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 nvme0n1p1[1] nvme1n1p1[0]
83885056 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
67107840 blocks super 1.2 [2/2] [UU]

unused devices:

As you can see above, temperatures look fine as well for both NVMe’s - so in my mind, I don’t think it’s a temperature throttling issue (unless I’m missing something here).

Already tried updating to kernel-lt (5.x) as well as kernel-ml (6.x) - same symptoms exist.

What am I missing here? I already verified, nothing crazy in terms of resource usage (iotop and top look fine), the RAID array is not rebuilding, etc. pcie_aspm is already set to performance as well:

[root@server ~]# cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq
[root@server ~]# cat /sys/module/pcie_aspm/parameters/policy
default [performance] powersave powersupersave
[root@server ~]# cat /sys/block/nvme0n1/queue/write_cache
write back
[root@server ~]#

Thanks in advance for any help or guidance here.

jode · December 27, 2023, 9:05pm

I have no experience with Corsair MP600 PRO, but my WD SN850’s throttle above 55C.

MazeFrame · December 29, 2023, 12:22am

I have an MP600 Pro (1TB, heatsink).

Filled to roughly 1/3rd, if you give me the fio-command, I can run the test here for comparison.

Could be as simple as that.

server · January 1, 2024, 1:55am

Hi MazeFrame, thanks for the help. Here is the command:

curl -sL yabs.sh | bash -s – -ig

You can also run a simple dd test:

dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync;unlink test

Let me know what you see.

server · January 1, 2024, 1:56am

Thanks for the feedback, I shall look into some angles on reducing the heat on the NVMe’s and seeing if that helps.

Although based on Corsair’s website, the operating temperature is "
0°C to +65°C" so not sure why it would be throttling at 63C/61C respectively.

MazeFrame · January 1, 2024, 2:06am

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/nvme0n1p2):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 1.38 GB/s   (346.2k) | 1.86 GB/s    (29.1k)
Write      | 1.38 GB/s   (347.2k) | 1.87 GB/s    (29.2k)
Total      | 2.77 GB/s   (693.5k) | 3.73 GB/s    (58.3k)
           |                      |                     
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 2.51 GB/s     (4.9k) | 2.73 GB/s     (2.6k)
Write      | 2.64 GB/s     (5.1k) | 2.91 GB/s     (2.8k)
Total      | 5.16 GB/s    (10.0k) | 5.64 GB/s     (5.5k)

maze@TheFrame ~ % dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync;unlink test
16384+0 Datensätze ein
16384+0 Datensätze aus
1073741824 Bytes (1.1 GB, 1.0 GiB) kopiert, 0.477096 s, 2.3 GB/s

server · January 1, 2024, 2:08am

Hmm, yeah, that’s the expected result.

What temperature is your NVMe running at, when you check using nvme smart-log ?

What sort of heatsink are you using? (I don’t have one at the moment).

server · January 1, 2024, 2:13am

nvme smart-log /dev/nvme0n1p2

MazeFrame · January 1, 2024, 2:28am

sensors has composite temp at 30°C

It is the MP600 with the heatsink and it sits in the airflow from the CPU cooler (can be seen here).

server · January 1, 2024, 2:43am

Heh yeah, you’re getting significantly lower temperatures. Will look into that, thanks

server · January 13, 2024, 10:44pm

Tried installing M.2 heatsinks and still facing the same symptoms. After a couple days, the server’s Disk I/O throttles again to the 500 MB/s level. Upon rebooting the server, it’s fine again and is able to achieve 2-3 GB/s on an I/O test, until a couple days later where it drops back down again. Even when temps are at 45C and 51C respectively on the NVMe.

Any other ideas here?

system · October 13, 2024, 4:44pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.