Good to know, did CUDA not work in WSL2 or was CUDA 12.8 not compatible on the Windows side? (just curious)
BTW, if you guys are going to do simple testing, doing a regular llama.cpp CUDA build and testing a llama or llama2-7b q4_0 quant with llama-bench
would be a good pipecleaner (it’s basically a “hello world”/benchy that’s fast to run and lets you compare against all Macs and other devices pretty easily.
llama.cpp is a good engine to test with since it’s probably the most used local inference engine (Ollama, LM Studio, Jan, Mozilla, and others all tend to use it). If you’re looking for something to stretch its legs or more rigorous testing, here’s an example of my using vLLM’s benchmark_serving to test Qwen2.5-Coder-32B’s speculative decoding perf between a W7900 and 3090.
If you are going to do more involved testing, on the LLM side, I think vLLM and sglang are both good targets with vLLM’s benchmark_serving.py
or sglang.bench_serving
. Here’s some recent testing I was doing on throughput vs TTFT on a concurrency sweep w/ mean, and p50 (median)-p99:
Note, for this testing it’s probably best to check everything into a repo for reproducibility, it’s very easy to misconfigure/screw up. Also you’ll want to dump system info and all library versions as per is still changing dramatically across versions. It’s not as useful to have llama.cpp number w/o a build#, or a full version number for your PyTorch and vLLM/sglang (also Python, CUDA, drivers etc).
Also I highly recommend mamba for managing your environment. It is able to manage not just Python but also CUDA and other system level libs. You could try to target a specific docker container and maybe save yourself some headache.
For training, I think something simple like a basic pytorch-lightning
, trl
or torchtune
script would be easy.
BTW, if you want to test efficiency, you can set PL and do a simple inferencing sweep pretty quickly, eg testing a 3090:
Here’s all my code for the sweep and charts (as well as some percent and percent delta charts).