PCIe switches, DMA vs non-DMA, and NVMe - PEX8749 vs 48/47/24, etc

Hi all, I’m working on a pretty amusing W680 build that is configured well beyond rational justification. Let’s just say that multiple PCIe switches are involved, along with a lot of Optane (like 4x 960GB 905p, along with spare drives for up to 8x 118GB P1600X, 2x 380GB 905p, and 1x 400GB P5800X). While the W790 platform is exciting, I currently have a 12900K in this machine and could grab the 13900KS out of another if I want to consolidate a gaming VM into it - and there is no touching that single-core performance. Plus I’d missed the memo on RDIMM incompatibility so I’m bought into DDR5 UDIMM since before that was common knowledge.

While one of the switches is a PEX88048, and I also have some PEX8749s (all of which support DMA), there are some convenient m.2 carrier card formats that use PEX8747/8748 or other variants which do not include the DMA engines. Just select the Specifications tab and see under DMA.

While the point is moot for U.2 drives, the signal integrity is better for m.2 drives to use a carrier instead of adapters and cables, so the lack of a DMA-support PCIe 3.0 switch in an m.2 carrier format is annoying (I don’t see any based on 8749). Happy to link a bunch of the cards I’ve looked at.

There are of course some PCIe 4.0 carriers, like the Accelsior 8M2 or Rocket 1508 that support DMA, but these are expensive for the use case (they will not be running at 4.0 regardless, so the DMA engines would be the only potential benefit).

Anyway, this leads me to several questions that I’ve extensively searched for answers on, with no clear information:

  • Is it a reasonable assumption that an NVMe drive behind a PCIe switch will end up using the DMA engines, if available? (assuming a modern Linux 6.x kernel, Proxmox in my case)
  • Is any information available on the CPU cost / performance impact of lacking the DMA engines?
  • Is there any lspci or lshw incantation to check if the kernel is recognizing the DMA engines on a particular switch? (even lspci -vvPPQnn didn’t yield an instance of “DMA” on my system with two switches known to support it)
  • Are there strace or other mechanisms to be able to verify that DMA-related interrupts are occurring in the activity of a given NVMe? (I am relatively closer to figuring this out, but still not quite to a functioning answer)

Thanks for any kind wisdom or ideas on this subject! Also didn’t find a clear category this belonged in, so contemplated both Motherboards and Enterprise - feel free to suggest a topic move.

I’ve been toying with AIC HBAs with cables that use PCIe Switches (Broadcom P411W-32P with PEX88048 and Delock 90504 with PEX8749) for quite a while.

I can’t say anything about DMA (if there’s easy-to-follow instructions I could test it for you) and the one with the PEX8749 works fine after a firmware update.

The PEX88048 on the other hand reliably crashes a system after it wakes from S3 standby/sleep.

When you have your system I’d be very curious if the AIC with the PEX88048 causes the same issues for you.

I’m mostly using AM4 X570 systems with 5900X/5950X and 128 GB ECC memory with the PCIe switches.

Sure! I have been following your threads closely :slight_smile: . They are highly relevant to my interests!

I do have a couple X570 machines available to test with, too (ECC of course, 5800X / 5950X). I have had great success with the PEX8749 on those systems, as well as W480 (with Xeon W-1290P) and C246 (Xeon E-2288g) in the past.

As of yet I have only tried the PCIe 4.0 switch, in my case a Rocket 1580, with the W680 system. It has not had any issues whatsoever, although I have not been testing sleep / suspend as a use case. While this unit has an MSRP of $699, I got insanely lucky and snagged one for just over $400 with a weird Amazon pricing glitch. BTW, please do let me know if you come across any open-box or other deals on the P411-32W. Even despite the firmware shenanigans, I have the sense it is clearly more “open” to tinkering and would like to have one of each if the opportunity arises.

Among many parallel subjects here and one I wanted to ask you in particular is lane allocation. I have been able to reprogram the EEPROM of my PEX8749 switches to set up lane counts in a variety of allocations, which is fantastically useful for many reasons (including … a x16 for a second switch, LOL!). It has been terribly disappointing that I can’t confirm any PCIe 4.0 switch has this capability exposed.

PLXMon doesn’t seem to be able to do it to the Rocket 1580, and Broadcom only publishes x1, x2, x4 supported for your card…but I wondered if it technically supports x8 and x16, just outside the envisioned use case. I have a number of PCIe slot adapters from the artful C-Payne PCBs and some pretty decent quality AliExpress ones too, so the physical connectivity is not an issue to support x8 or x16.

The following EEPROM configuration is for the DiLinker LRNV9349-8I, which I’ve found to be very reliable and generally awesome. The fan is quite quiet too.

NVMe drives have their own DMA engines built in. The drives read/write memory themselves. A PCI switch would just route those memory accesses to your host memory. An NVMe drive wouldn’t know or care if a PCI switch has an additional DMA engine built in. I would guess those DMA engines built into the switches are probably for some specific application that knows how to use them, but wouldn’t be used by normal NVMe drives.

1 Like

Thank you — I had wondered about this, as I thought it would be necessary in the CPU-direct-attach case too. But, to be honest, it was not yet clear to me whether the switch would interfere with such operations.

A related curiosity is whether the ASMedia switches’ typical max payload size of 512 bytes will have any penalty compared to the PEX’s 2KB. I’m relatively close to testing this out (I figured in trying to test the DMA question, I might as well ensure I could detect a difference between the 8749 and an ASM2824 first).

If anyone knows of a historical or specialized use case for the DMA engines in the 8749, I’d be interested in conceptualizing how they function in practice. Perhaps they can be used with multicast, fetching data from RAM to dispatch out to multiple endpoints without CPU involvement?

From a closer look at the lspci output, I think the answer to this will prove no, as most devices I’ve looked at are claiming 256-512b max payload size.

Much more interesting however is this slide deck about NVMe Controller Memory Buffer, especially slide 4 and 7:

This is interesting because it even includes an example of drives connected to a switch, and are able to use the address region mapped as CMB to perform a p2p transfer without registering upstream traffic on the switch.

However, in my system, unfortunately these CMB buffers are not being mapped - and I’m not sure why. There is only a 16K, required address region. This doesn’t mean that DMA is disabled (still curiously digging how to introspect it), but it does make me wonder how CMB is intended to be used. It may be specific to SPDK or other user-space implementations, and I’ll try researching in that direction.

For an Optane P1600X, here is the solitary 16K BAR (distinct from slide 4 in the CMB deck):

Subsystem: Intel Corporation Device [8086:380a]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 17
NUMA node: 0
IOMMU group: 33
Region 0: Memory at 82600000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3

Very interesting post about the re-configuration of the pex8749. As i check with PlxMon, i’m not sure where do you insert the value you listed over your table…
Was it you remove all existing port 8-11/16-19 and then add each with offset : 007C and value from your table ? Like for 1 8x and rest at 4x it will be :
port 8 / offset 007c / val: 000000E1
And no 20/21 to be added or modified ?

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.