Better 8 bit support on AMD devices!

Currently we need the bitandbytes library for python when loading 8bit LLM models.

arlo-phoenix has done a great job on a fork, but we want to take this prime time with support in the main library.

If someone has better C or GPU programming skill, it would be a great project to review or give advice on!

See the PR here:

This helped to get started, given the files 8000 lines seeing the commit gave a good start to grok it

Looking into it they have a few checks to stop older cards using the wrong feature set.

Might start dumping research here

Code for 4bit /w hipblas:

Hard mode - do Rdna assembly route:

MLC can compile models for faster inference (apache tvm)

Has pre-built wheel and models
https://llm.mlc.ai/docs/prebuilt_models.html#llama-2

vLLM seem focused on throughput, via batching requests
https://docs.vllm.ai/en/latest/getting_started/amd-installation.html