I am in the process of planning a private AI small buisness solution based off of strix halo. The goal being having a private cloud running different kinds of language/vision/speech/generative models. I am not aiming for the biggest models ever but for large enough models to be useful. In my experience anything below 20b-30b in the lanugage model space is really limited in its abilities.
Here is what I have planned so far:
4x Minisforum MS-S1 AI Max + 4x Radeon AI Pro R9700
Each MS-S1 Node should be connected to a graphics card each using the an Razer Core X V2 via USBV4 with 80Gbps. The cluster should be connect via the 2x 10GbE interfaces of the nodes using bonding for a combined 20GbE interface.
In total this would be add up to 4x128GB unified memory + 4x32GB of dedicated GPU vram. At a price point of around 20k that is kind of amazing. Though there will be latency/bandwith limits for sure. Though I don’t think it will be all to bad because it does not need to service to many users and as long as performance is consistent and the output useful I don’t think it will matter all to much.
I have quite some experience doing linux plumming and don’t mind fiddling with drivers to get a bleeding edge stack up and running.
For software I am thinking about running proxmox, mostly because I have quite some experience managing a cluster of proxmox instances and am very comfortable with the interface and setup. The idea being having different virtual machines running llama.cpp or vllm. Both of which I have tested with my current desktop PC (AMD CPU + GPU) already but my hardware resources there are too limited to run anything sensible.
So what do you guys think about this setup? Are you interested in seeing how this plays out. I still need to make sure that I get the funding for the project locked in. But if all goes as planned I will for sure post updates to this thread
At a first glance really not sure the AI is that a good value when building a cluster.
You can toass more RAM and faster bandwith by going Epyc.
10k used or maybe 15k new should be fine for 12 channel systems.
The inferenece of the AI max is kinda slow when you hit the RAM with 120b models, too slow for a real use.
For 30, fine tought.
Adding in the balance the use of GPU, then just build a computer (or two) with the same amount of GPU.
You could probably fit the four in a TR build.
I get what you are saying, but a single server setup does have drawbacks as well. As far as I am aware Epyc Chips do not have built in GPU or NPU. So on that front the MS-S1 does have advantages if the workload fits on a single node.
In a single server system I would be limited to GPU compute which would then equal 128GB VRAM, so it might be faster for a single huge model. And going with bigger GPUs will quickly increase the budget exponentially. What use would more RAM in a single server really be? Pinned memory for async GPU usage is quite limited in usabilty as far as model compatibility is concerned and will also hit on performance.
I have tried to find more information about a comparable build but information is thin on this kind of setup. I glad for any input if somebody has attempted something similar
Forget about NPUs. They may be somewhat interesting for mobile devices since they allow you to do some specific processing more power efficient. iGPUs are a lot more powerful, even on laptops, they just use more power.
Strix Halo, Nvidia Spark, and similar stuff like Apple silicon with unified memory is somewhat interesting sometimes, but IMO not at your budget. They have much less compute/memory bandwidth compared to a ‘proper’ GPU (M3 ultra perhaps excluded). So even if you can fit pretty large models, you might not like the performance. And keep in mind that through clustering or by multi-GPU you can fit larger models but performance is generally not improved since layers are distributed but those layers need to be worked through sequentially.
So my advice is to think about which models you want to run, and what would be acceptable performance (keeping in mind ‘thinking’ models process sometimes a lot of tokens before you get answers), and comparing benchmarks for Strix Halo, Radeon pro, and probably also 5090s/RTX6000s. For your budget you can consider getting a single system with 1-2 RTX 6000s or 1 RTX6000 and a couple smaller cards IMO. It will be easier to set up and more reliable than clustering with eGPUs. Screwing around with eGPUs when you have 20k to spend seems … crazy when you could get a single machine with professional grade features like ECC, IPMI, etc.
By the way, on e.g. an RTX6000 you can also load multiple models at the same time, if they fit in VRAM…
Single machine seems like a recipe for downtime but unless these are going through different isp connections-and frankly geographically dispersed-there is a single point of failure anyway.
But I think taking a loan or soliciting investments without developing a minimum viable product to showcase, and validating your chosen (bleeding edge, as you say) hardware, is not a great idea. Hopefully you are planning for that.
You should also ensure that your business plan complies with any applicable software licenses.
Relying on alphabet-soup brand PCs with basically no support + external GPUs + a mix of iGPU/eGPU + slow interconnect for business purposes sounds… well, risky.
I would suggest going back to the drawing board on the hardware. And even before that, do more research on the models you’ll need to run to get a better idea of your hardware requirements.
I hear your concerns and I mean I am at the drawing board stage of this project. And I am very much concerned that I there are better options out there.
I have started looking more into NVIDIA’s offerings. I will mock up an comparison machine at roughly the same price point.
In the end all projects should be about the tradeoffs. And in general spending less money is an option, especially in times when money is tight in a lot of places!
The goal hear is not to build the next product for the world based on generative technologies. In fact I am kind of annoyed of these chat interfaces that I constantly have to steer to the output I want.
This goal here is more about having an internal platform to implement AI based workflows and automations while remaining flexible enough to experiment.
So I am absolutely aware that I am just starting out. I have not completely written of the Idea of a cluster based solution. But probably maxing to a total of 4x + GPUs in external enclosures is just pushing the envelope way to far. But if I reduce the cluster to x2 and use older infiband hardware as an interconnect it might be a lot cheaper and perform reasonable enough for my purposes.