Building a poor mans P5800x

Camofelix · December 7, 2021, 3:33am

Something really weird/interesting that I’ve been looking for is an x8 pcie 3.0 to 4 m.2 x2 card, or alternatively x16 pcie 3.0 to 8 m.2 x2 card. Why something so strange and needlessly specific?

The older optane m.2 32gb modules aren’t completely insane pricing wise these days, and I’d love to play with the idea of a raid 10 of either 4 or 8 of those drives together on all CPU lanes as a sort of poor mans p5800x

We 'd use more PCIe (x16 3.0 vs x4 4.0, or roughly double), lose some of the latency benefits because of software raid overhead in certain scenarios , but capacity itself would be comparable and speeds show be in the same ball park depending on work load (if we can make it work of course)

Ideally I’d use the 58GB optane 800p modules, but the form factor for those disks isnt ideal and compatibility would be rough.

Has anyone tried anything like this and/or has any advice on the best way to go about this?

I’ve come across this card from Aplicata https://www.aplicata.com/quattro-400/

It uses PLX chips unfortunately instead of allowing direct lane access.

They have a 410 which is x16, but also only supports 4 drives (but uses system bifurcation thankfully)

In an x16 form factor, the closest thing I’ve been able to find is this Asus AI accelerator card for the google corral TPU m.2 modules. (also has fans and a 6 pin connector, but running those are optinal in this case and could be disabled to run in single slot full hight) ASUS IoT: Industrial Motherboard, Intelligent Edge Computer, Single Board Computer

It also uses a PCIe switch unfortunately, but does have the same amount of bandwidth in as bandwidth out, presumably for systems that don’t support bifurcation.

Another option is this card from amfeltec which supports 6 devices on PCIe x16 in single slot full height without any modifications https://www.amfeltec.com/pci-express-gen-3-carrier-board-for-6-m2-or-ngsff-nf1-pcie-ssd-modules/

wendell · December 7, 2021, 4:47am

I have tried all this. It was disappointing all the way around.

It ads a lot of overhead software raiding things. you’re better off getting a900p or 905p which has 90% of the latency advantage of the p5800x just not as much throughput.

But For Normal Person Use Cases that’s totally fine.

A close 2nd is theH20 which has a lot of tuning over the h10 and on an intel chipset can work really well as a cache. The latency here is kind of the exception vs the rule though.

Honestly, you don’t “feel” the p5800x until you’re running up against memory limits or you have truly random I/O.

The poor mans p5800x is to add an obscene amount more ram than you need and hope your operating system has a decent ram-based block device cache

Camofelix · December 7, 2021, 1:06pm

Gotcha.

Use case is a personal CFD and HPC development workstation before pushing to different clusters.

Currently testing Alder Lake 12700k with ecores off, AVX 512 on with different stacks and compilers. There’s a little bit of mad science going on with trying to trick the compiler to do weird things for me, including forcing ICC and ICX to mark the code as sapphire rapids but disabling the AMX cluster tiles so as to keep everything inside the golden cove cores.

I’ve been documenting it all on twitter here:

twitter.com

Felix LeClair

@FelixCLeClair

Can confirm #avx512 is functional on @msigaming z690 motherboards as of latest BIOS! Results are in the link below using the 12700k as compared to results from @michaellarabel using the 12900k. i7 +AVX512 beats the i9 with everything on in some HPC tasks https://t.co/3EHQPLzlaD

Felix LeClair @FelixCLeClair

Been thinking about how to get the most out of #GCC, #ICC and #ICX on Alder Lake with AVX 512, even with the instructions being normally disabled. It merits investigation, but I wonder if using sapphire rapids optimizations with AMX off (optimizing for golden cover) might be 1/x

3:35 PM - 4 Dec 2021 22 4

(Thanks again for the help/tip with vtune on twitter the other day!).

The idea with the above was using some of the CPU lanes to make a poor mans p5800x, which in turn in a workstations DCPMM.

But if I understand correctly, that’s no dice.

Thoughts on a 90xP as massive cache/swap/working drive for this use case

As for software raid I wonder if using the chipset lanes of the z690 chipset with Vroc might be of use. That way we can take advantage of the VMD offload support provided to keep the load off of the cores.

we take a latency hit for using the DMI, but IIRC the VMD on chipset can use DMA without going through a core to transfer data, so we might get some of that back in not dealing with context switching and interrupts.

MazeFrame · December 7, 2021, 3:04pm

Wouldn’t it be nice to be good (more God among Men) with VHDL so you could implement software RAID in an FPGA…
Then again, FPGA that can handle >40 Gbit/s is expensive AF.

wendell · December 7, 2021, 3:14pm

ah its not throughput thats the killer. its latency.

Is this even doable, back of the envelope, while adding less than 2 microseconds overhead? We’re on the order of about 10 microseconds for optane and on the order of around 100 for nand. (as a practical matter, though, it isn’t a 10x gain more like 2x 3x)

MazeFrame · December 7, 2021, 3:23pm

No idea. Will ask someone with a better chance of knowing than me digging through hundreds of pages of specs.

Main reason for me guessing it being possible is that high-end FPGA accelerator boards have dual 100G QSFP slots, so the capacity to go fast has to be there.

Edit: Lul Wut?

Camofelix · December 7, 2021, 9:59pm

Any thoughts if VMD using the chipset might help offload some of the overhead/processing cost?

Finding VMD information specific to alder lake outside of the video you’ve posted is quite hard at the moment.

As far as I can tell from this paper from lenovo which discusses VMD/VROC https://lenovopress.com/lp1466.pdf

and the intel documentation white-papers on the z690 chipset (Intel Z690 Chipset Product Specifications) specifically https://cdrdv2.intel.com/v1/dl/getContent/655258?explicitVersion=true

The hardware blocks should be capable of dealing with most of the overhead if not trying to pass through to ZFS or some other higher level filesystem/volume management/ alien technology.

Platform is debian Linux with mainline kernel

wendell · December 7, 2021, 10:01pm

probably this will work best on linux vs windows but the rst stuff in windows is pretty good for what you’re talking about.

it no longer supports 900p/905p for acceleration as it once did. and otherwise mixed device types is a nogo.

I was reasonably satisfied with a raid 0 of two 375gb P4800x however

Camofelix · December 7, 2021, 10:44pm

Gotcha. So if I understand correctly, acceleration is no longer an option, but pure optane raid still is? And pure raid CAN be completely offloaded to the integrated VMD in this case.

So a 4 drive raid 10 of 900p’s, all via the chipset for example, shouldn’t add any processing overhead, but would act as a very high speed single volume, low latency device. Latency would be higher than a single drive on CPU lanes since we need to hop through both the PCH and deal with the extra hardware block, but that should be manageable.

Would come out to 16 pcie 3.0 lanes, 560 GB usable (4* 280GB drives in raid 10) before formatting. Slight concern about that taking up the same amount of bandwidth as is provided to the PCH (PCIe 4.0 x8 to cpu) shared with all other downstream devices, but limiting that to a minimum by disabling devices in bios shouldn’t be an issue

The same could not be done with 800p’s since I would need 2 VMD domains, but alder lake desktop only supports the one (not to mention I’d be down to 4*x2 instead of either 8 *x2 or 4 *x4 optane devices in bandwidth.)

extra stuff that's just context but not super important

Use case is a personal CFD and HPC development workstation before pushing to different clusters.

Currently testing Alder Lake 12700k with ecores off, AVX 512 on with different stacks and compilers.

Off topic, but there’s a little bit of mad science going on with trying to trick the compiler to do weird things for me, including forcing ICC and ICX to mark the code as sapphire rapids but disabling the AMX cluster tiles so as to keep everything inside the golden cove cores (since this keeps the cost prioritization for AVX512 vs standard avx and avx2 as opposed to the brute force add -mavx512[specific instruction here approach] of -march=native)

Veidit · December 8, 2021, 9:44am

Did you find some PCIe x16 with four NVME slots? I got a PCIe x8 with two slots but I have a few x16 slots on my server that I want to use.

Camofelix · December 8, 2021, 4:43pm

if your board supports bifurcation and/or VMD then this should be trivial by using the inexpensive Asus or asrock accelerator cards.

My concern is what Wendell said above

Log · January 3, 2022, 7:15pm

FYI: Don’t update the bios anymore on Alder lake systems if you want to keep AVX 512, Intel is apparently strong-arming vendors into disabling it.

wendell · January 3, 2022, 7:23pm

ahh just like when they strong-armed vendors into disabling registered ecc from working on x299

Camofelix · February 3, 2022, 3:50pm

For now anyway I’ve been using it on the daily as a workstation.

Disabling the e-cores also comes with the bonus of increasing the ring frequency by 50% and increasing the effective available L3 cache per core, so that’s another bonus

On the Bios front, haven’t been able to test myself. MSI hasnt released a BIOS revision for my board since the version that enables AVX-512, so I’ll have to wait and see.

system · November 4, 2022, 9:50am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.