Hi all, I’ve been going round in circles a bit and think I’m missing some options.
Basic requirements:
64 pcie lanes (gen 4), 3 slots at x16.
good memory bandwidth (8 channel ddr4 rdimm, or 4 channel ddr5)
Considering:
5955wx or 5975wx on wrx80e sage II or supermicro M12SWA-TF (good deals avail)
1P epyc 7003 series probably on a romed8-2t (good deals avail)
xeon w2400/w3400 build ($$$ in uk)
Scares:
seen multiple issues on forums w/ wrx80 builds
seen multiple scares around epyc w/ windows 11 not working
worried I’m missing something going for a server build and I’ll run in to unexpected issues
Case, psu, cards, nvmes etc all sorted
Question:
am I missing some boards here, finding a workstation board w/ 64 pcie lanes and 3 x16 slots seems really hard to do, 2 @ x8 won’t do, and 7 (6 in reality) @ x16 is overkill but seems like only option.
Context / Usage:
I’m a developer, using windows 11 + wsl, typically I’m just doing normal programming work, under wsl I’m working on largish databases, ~500gb to 2tb, mongodb + some vector databases, and in another wsl I run different ai setups, usually just running data through a few different multi models, stock llms, whisper, vision models, doing various tasks over data sets.
I currently have 2x 4080 super + 1x 4070 super, I load in smaller models to each then run data through different processes - I can’t see me going beyond a 3x gpu setup (3090 ti, 4090 at most). Beyond that I’ll just use on demand instances. Thus PCIe 5 is nice to have but i only need 4.
In production the dbs run perfectly fine w/ high traffic balanced over 5950x w/ 256gb ram. My local use case w/ GPU bound tasks is where the issue is arising, specifically pci lanes and being stuck in x4 mode.
The first thing we need to understand is the general configuration you are using for development. In other words, are you using a client/server configuration or are you doing all the work locally on one machine. If the latter is the case and LLMs are the workload then I’d say you want a Threadripper system (and I’d opt for a 7000 series) but if the former is true then an Eypc system is probably the better choice. With that being said, the issues you are describing are not a given, but they are possible with any new build. If it were me and I were building this, I’d opt for a client/server configuration with a consumer grade client machine, but this really has more to with what type of development you are working on.
Now for my personal recommendation on the hardware. If you want a server (which it sounds like you do) then build a server. Thus, an Eypc (7004 series or newer) system would be my choice. Load that up with half a terabyte of R-DIMMS, setup redundant storage pools of whatever size is required for your workload, and load up all those GPUs in said server. Then interact with said server via a client machine.
Client/server currently and most of the time, I’m fortunate to have multiple remote development and build servers, and a couple for utilizing vertex, openai etc on mass too for bigger job runs. I think I have about 30 bare metals, and multiple aws instances at my disposal. 0 GPU, but can get them on demand.
However, I’d like the ability to also develop locally (for slower job runs with limited value, to learn, a capable dev environment for personal projects with LLM usage if I ever get time, or more likely getting in to fine tuning and training), and it’s the GPU-requiring dev loads where I’m coming unstuck, secondary is memory bandwidth.
This is the approach I’ve always taken too, currently on i9 w/ z790, never noticed anything lacking until I moved to a 2 and 3 GPU setup.
Perhaps I’m throwing myself off here by wanting to build a machine that can do anything I throw at it, rather than isolated devices with set purposes, it’s probably just a want, but I know that if I have a single device with multiple capabilities I’ll use them intermittently.
Also, something really bugs me about not being able to do things locally, I don’t want to have to rely on a major player or remote instance, especially not for proof of concept or hacking together things to give them a try.
I agree that local resources are preferable, which is why my own network includes my own servers. That was really my point. If you need a server then build and run one, if you need a workstation then do that, if you need a desktop then build that. Which brings me back to comment. For development purposes such as yours, a client/server configuration with an on-premise server is ideal, as you can work through development issues on your own LAN without having to deviate from the production model that such things will be deployed in.