Serving LLMs on MI300A node

Hello everyone, I’m Carlos,

Basicly, in the research lab we have a 4xMI300A (4x128 GB HBM 3) compute node that is almost always idle.

I was thinking of trying to run some LLMs on it to try and see if it could be useful to let AI work on sensitive data and generally see what it can do out of curiosity.

For now, I was able to run qwen-2.5-72B on amd’s rocm/vllm: rocm6.4.1_vllm_0.10.0_20250812 docker image, and it seemed to run at >100 tok/s with tensor parallel 4 which I have really no idea of how good or bad that is.

Then, I’ve tried to run a larger model, qwen-3-235V-a22b (fp8 quantization) but with no luck, some errors occur.

There is not much info out there about these MI300A so I was wondering if anyone with experience on AMD cards maybe has some advice?

More generally, I’m looking for some recommendations on serving LLMs and what this node could be useful for outside of our main research.

docker would go a long way, and amd had some good blog posts about setting up workloads.

you can offer jupyter hosting services to your internal team and that’s really nice too.