Need help with an AI setup

Something you might consider is going for a hero card setup. You buy one beast of a card that does the bulk of the work, and secondary cards are used to store the rest of the model in memory, and a light amount of processing.

So, you could go 1 5090 and 1 3090 for 56GB or memory or 1 5090 and 2 3090 for 80GB of memory, and cheaper than the RTX Pro 6000.

It is more important for speed to get the whole model, and the context, in the vRAM than the speed of the PCIe lanes, so long as you have at least gen 3 x4 for the other GPUs, at least if you look at the r/locallamma posts from folks that are doing 2 machine setups with 8 3090 GPUs each. Even 10Gb/s networking is enough for that workload, accepting multiple lots of requests at once.

Now, you could also go with one Pro 6000 (Max Q is almost as fast for half the power and 1K cheaper or so) and then get smaller card for the RAM as well, if it still isn’t enough.

Many MB have problems going above 8 GPUs when they are all asking for 32GB of memory space from the bios. So be aware.

You can also go really cheap on the secondary cards. I would try to stick with one vendor. I have no idea how mixing Nvidia and AMD with LLMs would work, other than Context would probably be fine to split between them.