Currently we need the bitandbytes library for python when loading 8bit LLM models.
arlo-phoenix has done a great job on a fork, but we want to take this prime time with support in the main library.
If someone has better C or GPU programming skill, it would be a great project to review or give advice on!
See the PR here:
This helped to get started, given the files 8000 lines seeing the commit gave a good start to grok it
Looking into it they have a few checks to stop older cards using the wrong feature set.
Might start dumping research here
Code for 4bit /w hipblas:
Hard mode - do Rdna assembly route:
MLC can compile models for faster inference (apache tvm)
Has pre-built wheel and models
https://llm.mlc.ai/docs/prebuilt_models.html#llama-2
vLLM seem focused on throughput, via batching requests
https://docs.vllm.ai/en/latest/getting_started/amd-installation.html