Random thoughts on PCIE switching

With most consumer cpus limited to 20 or 20+4 pcie lanes it is easy to run out of lanes. If someone needs 16x worth of bandwidth to their gpu options become locked into 4x devices. PCIE switching can get you more lanes, but your still limited to those 4x lanes to the cpu, and from what I understand link speed is end to end. IE… if you fanout those 4 lanes, for this example lets say they’re 4th gen, to 8 lanes and connect a gen 3 device. The uplink will also negotiate down to gen 3. How is it that a chipset can connect to the cpu gen 4 and pcie slots off of the chipset at lower generations don’t downgrade that uplink to the cpu, but there is no switching/muxing/whatever to convert one gen of pcie to another (at least moving backward).

What got me thinking about this was I have a bunch of old NICs that mostly are pcie 3. I either have to handicap my gpu or my nic even though I have enough bandwidth. It just can’t be efficiently utilized because of the way pcie works.

Is it just not worth anybody’s time (too niche) or would it be a threat to non-consumer sales to develop this?

1 Like

I think it’s both.

95% of customers don’t connect anything other than GPU to their PC. That’s why we have motherboards with useless x16 PCIE 5.0 slots noone can utilize.

HEDT are too profitable to allow peasants to have PCIE slots. That’s why you see motherboards with 4 m2 slots but x1 electrical pcie slots.

2 Likes

I made a similar post in the storage subforum. M.2 name SSDs are cheaper than sata SSDs. But depending on the motherboard if you populated all 4 m.2 slots you take away lanes from the CPU.

I have old pcie3 m.2 name SSDs. So they dont need 4 lanes of PCIe 4 or PCIe 5 by themselves. So a PCIe switch would be nice. The problem is that PCIe switch chips are expensive. So for my “problem” with m.2 SSD it doesn’t make economical sense.

But I like the idea. I have an old rtx2080 I’m not using that card shouldn’t need many PCIe 5 lanes(converted to PCIe 3) to keep it busy then I can have a second card to pass through to a VM.

Yes, the current situation with consumer CPUs, or rather CPU motherboards, is frustrating.

Current gen cpus have, and arguable need, 20+ gen5 PCIe lanes to provide enough data to keep the cpu busy. AMD clearly realized that this class of cpus is good enough for many workloads with their release of the 4xxx server cpus compatible with AM5 boards.

There are rare boards that allow flexible bifurcation (4x4x4x4x or 8x8x or 8x4x4x). Combined with relatively cheap add-on cards (split physical 16x lanes into 2x pcie lanes; 8x PEX to 4x 4 lanes gen3 u.2) this can enable efficient use of gen3 NICs in combination with a list of relatively fast nvme storage enabling server workloads that current mobos don’t support ootb.
However, by widening the PCI bottleneck I expect you’ll run into the dual channel memory bottleneck on that platform.

So, the consumer platform looks tantalizingly like it can be used for server workloads, and it can offering significant savings over more capable hw platforms. But it will always be limited to entry level workloads and a game of compromises when pushing the boundaries.

Businesses should look at HEDT platforms for mid-level workloads to make their lives easy.
Advanced amateurs can do the same as these platforms are expensive but obtainable. It’s hard to see amateur (homelab) workloads that require the levels of performance that current gen HEDT (AMD TRX50; XeonW 2xxx) offers.
Server level hardware (XEON, EPYC) is reserved for businesses that understand their needs and can afford the performance afforded by these platforms.

1 Like

As you point out, switching with speed shifting’s provided to consumers via chipsets. Broadcom’s switch niche seems too small for Intel to much care about (see the Stratix-Agilex scalable switch) and misaligned to ASMedia.

There’s a couple other ways to address this.

  • Update 3.0 x8 devices to 4.0 x4 or 5.0 x2. Or similar.
  • Provide a 4.0 x8 chipset slot from DMI x8 (Intel Z) or 5.0 x4 (Promontory).

That neither of these has happened seems to me to indicate a market deemed too small to be a priority, as does the general lack of PCIe 4.0 x1, x2, and non-NVMe x4 devices. Only ones I can think of are Acquantia’s x1 10 GB NICs.

There’s Intel’s x8/x8, the x8 switch on AMD Taichis, and past that I can only think of one motherboard offhand offering x16 + x4(16) + x4(16) slots. I’m pretty sure if Intel or AMD supported it there’d be 5.0 x16 + 4.0 x8 motherboards for dual GPU which could also be used as GPU + legacy server NIC.

I think it’s worth also noting x8 or even x4 isn’t necessarily much of a GPU handicap. Scaling tests often find negligible differences and -2% (x8) or 6-7% (x4 equivalent) on a 4090’s fairly minor. So x8/x8 to full rate a 50 or 100 Gb NIC’s not terrible.

Not sure about this. If HEDT was making good money Intel wouldn’t have exited the segment. Or shrunk it to Xeon W, depending how you define things. AMD also presumably wouldn’t be as on again, off again with X (versus WX) Threadrippers. From what I read, a lot of the peasants are unhappy with LGA1700 and AM5 PCIe motherboard prices and aren’t going to pay for LGA 4677, sTR5, or RDIMMs.

It seems to me AMD and Intel have something of a bundling problem here but either haven’t really recognized it or are overestimating workstation upsellability. For example, Storm Peak’s non-Pro IO die (quad channel, 48 5.0 + 24 4.0 lanes) doesn’t need to be tied to a package and socket capable of 12 CCDs, 8 channels, and 128 lanes.

Based on phrasing and content I’m getting a LLM vibe from this post. If that’s in the right direction, a reminder of the AI rule seems appropriate. Anyways…

I’ve not exhaustively surveyed Intel boards but it’s my impression x8/x8 and x8/x4/x4 support’s typical of upper end Z. Every recent-ish AMD board I’ve checked supports several x16 bifurcation patterns, including x4/x4/x4/x4. I suspect the OP’s more concerned with x16 GPU + x8 NIC or maybe x8/x8 than with the NVMe possibilities, though.

Widening’s often unnecessary. My experience is moving ~7 GB/s with a 4.0 x4 NVMe easily saturates 12-16 core CPUs and dual channel DDR if the data wants even fairly minimal processing or effort to produce. That’s mostly because it’s seldom practical to keep code single pass or maintain close enough locality that data flows through L3 only once. For example, a single Zen 5 core clocked at 5.2 GHz demands 1 TB/s data bandwidth to fully utilize its 2x512 load + 1x256 bit store units but has less than 250 GB/s available when operating on 64+ kB of data.

If it’s pure IO though, then ~84 GB/s copy bandwidth is possible with dual channel DDR5-5600. DMAing full rate 100 GbE onto a 5.0 x4 NVMe takes 24 Gb/s (12 GB/s from source plus 12 to the sink). 84 GB/s would be more like full rate DirectStorage between three 5.0 x4 NVMes and three dGPUs.

To peasants like me HEDT looks like a money grab, to AMD/Intel every sale of a HEDT system to a business is lost Xeon/Epyc revenue.

Nope. Hand written. I basically know AMD land and tried to be inclusive of other options in my response.
I think we’ll see similar discussions for a while.

B550/X570 AM4 boards typically enable 4x4x4x4x bifurcation on their 16x slot.
I have a more recent board that enables 8x8x and while first appalled I found I could make it work using a $50 adapter.
I have read about 8x4x4x bifurcation in consumer mobos but no experience.

I found 4x 4.0 nvme to be bottlenecked to 12GB/s in random read/writes on AM4. Maybe things are better on newer platforms.

1 Like
Hard tangent:

I don’t know why but this triggered my OCD and sent me looking for a post back from May of last year where a similar thing happened:

To actually add something useful to the conversation:
To me it seems like the larger, more useful PCIe switches are basically at parity or exceed the cost of just buying a “bigger” platform, which definitely limits their appeal.
An entire current generation 80 lane Xeon or 64 lane Epyc CPU can be had for ~400USD, which is less than the cost of just the PCIe switch chip alone with no CPU functionality.

1 Like

I think you’re referring to ES or similar pre production samples (otherwise I am interested in a link).
While you may be comfortable with using these chips (they never were products) many will not for their limitations (require special BIOS versions, may not have all the features of the production version, etc.).
If they provide what you need and you have the means to get them working they are certainly a steal.

The other aspect here is that it seems hard to even find suitable products with gen4/5 PCIe switches - money no object.
I know that C-Payne is often referenced here. But I have a hard time finding additional products accessible to consumers.

I’m not sure why I used the word “entire”, I think I was trying to emphasize that it was a CPU aswell as PCIe connectivity; one would still need to buy the motherboards for them which adds cost (although PCIe switches will have the same platform costs).
But these are prices for current gen, high-ish PCIe lane count processors:

https://ark.intel.com/content/www/us/en/ark/products/236639/intel-xeon-bronze-3508u-processor-22-5m-cache-2-10-ghz.html

1 Like

I was thinking about HEDT and workstation market as a one think. Intel exited HEDT because they weren’t really competitive but AMD is doing much better in workstation segment. Non-pro threadriper 5k was supposedly canceled in favor of pro one because it would make them more money.

I think that hobbyists are too small group for them to care when. HEDT became workstation for business, and the have deep pockets.

2 Likes

This is a very interesting concept and I’d be stoked to see something like that working.
But bandwidth is achieved through frequency so messing with timing and signaling to downgrade a faster connection seems almost impossible to me.

I think switching would be just as effective when multiplying lanes for slower devices.

Thanks for the links. I had basically discounted the low core count CPUs at the time of release and since then basically forgotten about their existence.
Both AMD and Intel really made sure that nobody has an excuse not to buy their server products with long price lists starting with the ones you linked all the way to the astronomical. :slight_smile:

Yeah. The 7950X to 7960X cost riser’s like US$ 850ish with then something like another US$ 300-500 on the mobo and US$ 2ish/GB for (non-ECC) UDIMM to RDIMM. The high end of that’s roughly the low end of the more capable switches I’m aware of and, while switches can be moved from build to build over time, it seems to me kind of a niche situation where the cost-performance-hassle tradeoffs work out favorably. Even without external constraints.

At my work it’s also pretty hard to keep the cost of a higher end desktop with a capable switch under the capital asset threshold. I’m not sure I could convince accounting to let us part out an asset and, if it is possible, it might well be cheaper to just buy another entry level Threadripper than to do the necessary paperwork.

PCIe negotiates up from slower to faster speeds at startup and dynamic downgrades are a thing, so this is mostly already supported. Link speed shifts probably aren’t necessary, though. I’m unsure how deep buffers need to be for PCIe speed translation but, from what I’m aware of, 4 kB send plus 4 kB receive per port should cover it. So it doesn’t look like a big deal. Haven’t really dug into it, though, but in data rate matched cases (like 3.0 x8 to 4.0 x4) the required buffer is only a few bits.

Figured as much, yep. Don’t disagree about the hobbiest-enthusiast niche and business pricing but I can also say large OEMs regularly upsell workstations. We pretty much always tell them we’re not interested in paying +300% for maybe +30% performance and usually they want more like +300% for -20%. At least some (and I suspect quite possibly most) of the businesses we work with have similar interactions.

1 Like

What I’m unsure of is that all this negotiation should happen on a separate chip that should then negotiate the full bandwidth to a reduced but faster number of lanes on the CPU side. I’m not an expert when it comes to this sort of things so I might be wrong.
It’s the signal handling between two different negotiated speeds on a different number of lanes that makes me doubt about it.

It’s seldom of difficulty. PHYs have been clocking bits in and out of buffers this way for decades.

It’s a little more interesting over links like USB where the transmit clock has to be recovered from the data. But that’s also been well solved for a very long time; see phase lock loops. The same tech’s used to clean up clocks on redrives.

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.