Asrock X670E Taichi - disparity with SSD speed (SK Hynix P41 Platinum)

I have a weird issue with using NVMe SSDs on Asrock X670E Taichi.

I have two identical drives: SK Hynix P41 Platinum 2 TB.

One in slot M2_1 and another in slot M2_3 (though same issue is happening with M2_2 too).

One in M2_3 is significantly slower. fio shows consistently lower write speeds for it.

I dug into sudo lspci -vv, and I see this difference.

Faster one:

5c:00.0 Non-Volatile memory controller: SK hynix Platinum P41/PC801 NVMe Solid State Drive (prog-if 02 [NVM Express])
	Subsystem: SK hynix Platinum P41/PC801 NVMe Solid State Drive
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 26
	IOMMU group: 33
	Region 0: Memory at df500000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x4

Slower one:

57:00.0 Non-Volatile memory controller: SK hynix Platinum P41/PC801 NVMe Solid State Drive (prog-if 02 [NVM Express])
	Subsystem: SK hynix Platinum P41/PC801 NVMe Solid State Drive
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 26
	IOMMU group: 30
	Region 0: Memory at a0400000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 26W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x4

Note how faster one has

			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W
...
			MaxPayload 512 bytes, MaxReadReq 512 bytes

And slower has:

			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 26W
...
			MaxPayload 128 bytes, MaxReadReq 512 bytes

Can anyone please explain is the issue with the slot or SSDs are somehow different? Why does one has lower power limit and lower MaxPayload and can anything be done about it?

Thanks!

Just to show the difference with fio.

For faster one:

sudo fio --randrepeat=1 --ioengine=io_uring --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=2G --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.36
Starting 1 process
test: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [w(1)][-.-%][w=584MiB/s][w=150k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=9170: Thu Mar  7 01:54:33 2024
  write: IOPS=137k, BW=535MiB/s (561MB/s)(2048MiB/3826msec); 0 zone resets
   bw (  KiB/s): min=463624, max=614664, per=100.00%, avg=549829.71, stdev=51997.08, samples=7
   iops        : min=115906, max=153666, avg=137457.43, stdev=12999.27, samples=7
  cpu          : usr=6.93%, sys=41.86%, ctx=480263, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=2048MiB (2147MB), run=3826-3826msec

Disk stats (read/write):
  nvme0n1: ios=0/500750, sectors=0/4013070, merge=0/0, ticks=0/8810, in_queue=8818, util=97.46%

For slower one:

sudo fio --randrepeat=1 --ioengine=io_uring --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=2G --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.36
Starting 1 process
test: Laying out IO file (1 file / 2048MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=138MiB/s][w=35.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=9572: Thu Mar  7 01:55:49 2024
  write: IOPS=37.4k, BW=146MiB/s (153MB/s)(2048MiB/14032msec); 0 zone resets
   bw (  KiB/s): min=114896, max=201424, per=100.00%, avg=149458.29, stdev=27123.73, samples=28
   iops        : min=28724, max=50356, avg=37364.57, stdev=6780.93, samples=28
  cpu          : usr=2.74%, sys=13.59%, ctx=692700, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=146MiB/s (153MB/s), 146MiB/s-146MiB/s (153MB/s-153MB/s), io=2048MiB (2147MB), run=14032-14032msec

Disk stats (read/write):
  nvme1n1: ios=0/535215, sectors=0/5212586, merge=0/0, ticks=0/18356, in_queue=19284, util=98.81%

Answer i pretty identical for all consumer platforms to date , slots are not equally provisioned. Only first slot is directly connected to CPU and has x4 pcie lines allocated directly.

All others go through chipset and share common x4 line with all other devices. Third one is daisy chained via second one, so performance might be even worse, depending on current system activity.

2 Likes

Well, yeah, I already looked at that diagram and I get that it goes through the chipset but does it mean it gets worse throughput or that MaxPayload size is expected to be lower, especially if there are no other devices using bandwidth?

I.e. I can expect latency to be worse due to the longer path, but not throughput? Or that’s a normal thing for chipset connection?

Oops, asrock wasted one of the pcie x4 connection from cpu on the usb4 controller.

Right now, the motherboards are designed by marketing team, not engineers.

I guess that’s not necessarily a bad thing? You could connect some USB 4 card like 10 Gbps SFP+ one or something? Or a whole USB 4 dock and use all that banwidth somehow. I.e. you can see it as an alternative to fast PCIe connector.

But the fact that secondary SSD are so nerfed in comparison to the primary one seems pretty wrong.

The chipset is connected to cpu only by pcie gen 4 x4. There are also some overhead.

If the throughput is important for you, such as raid setup, return the board and find a board with 2 m.2 slots directly from cpu.

It’s not a huge deal, but I still don’t get it. Primary one also is PCIe 4 x4 because my SSD is PCIe 4, not PCIe 5. So being directly connected to the CPU should help latency, not throughput if the number of lanes is the same, unless I’m missing something.

There is the MaxPayload issue though. May be chipset limits it and that can affect throughput even with the same number of lanes.

I’m just curious why write speed is more than 3 times worse here than with the primary one.

I would call that bad design. Asrock can just put a pcie slot with gen 4 x4, instead of hard wiring a usb 4 controller.
X670e can provide usb 3.2 20gb connection natively without occupying extra pcie lines. There are little differences between usb 3.2 20gb and usb 4. USB 4 provides pcie tunneling and DisplayPort passthrough besides normal usb 3.2 20gb function.

Under your fio test, and under most real world benchmarks workloads, latency affects the throughput of storage.

The MaxPayload size difference is a red herring, it likely isn’t causing the disparity in performance. The extra latency and overhead caused by the first chipset is causing the disparity in performance, the second chipset is even further downstream of the first and will have even worse latency and overhead… This latency wouldn’t be as noticeable if “slow” devices were all that was attached to the chipset.

2 Likes

We know there is significant and varying performance impact, but technical details how and how much each variable ipacts final observed performance is pretty much black boxed magic.

  • how much impact does pcie switch alone incur if there is no bandwidth contention?
  • how much impact is there if there is bandwidth contention?
  • does impact scale linearly or worse depending on number of switched streams/active devices?
  • is switching wasteful for low bandwidth devices/streams? I.e overhead per device.

Maybe there is some research paper in academic sector, but I doubt it.

OEMs themselves will definitely not do expose on themselves being cheap. I am pretty sure good PLX switching chips and well though out design might be capable of alleviating this, but the chips alone would cost more than the board itself.

Very high end thread ripper boards utilize them in fascinating layouts (each x16 slot has x8 lined up directly and x8 is switched). But the those boards start at 1000USD+.

God damn you Broadcom.

1 Like

Just interesting here that whole secondary connection slashes I/O write performance for the SSD by more than 3 times. Not something I’d expect but I guess good to know that’s how it often happens on these types of motherboards.