Better 8 bit support on AMD devices!

Iron_Bound · January 3, 2024, 8:44pm

Currently we need the bitandbytes library for python when loading 8bit LLM models.

arlo-phoenix has done a great job on a fork, but we want to take this prime time with support in the main library.

If someone has better C or GPU programming skill, it would be a great project to review or give advice on!

See the PR here:

github.com/TimDettmers/bitsandbytes

Add ROCm support

TimDettmers:main ← arlo-phoenix:rocm

opened 01:22PM - 08 Sep 23 UTC

arlo-phoenix

+141 -17

Inspired by the llama.cpp ROCm port, I decided to try and use a similar approach… for bitsandbytes and worked through the different hipified cuda functions/classes and just redefine them with the HIP equivalents. This only happens if BITS_AND_BYTES_USE_ROCM is set and merging this shouldn't affect the Cuda code at all. It's also easier to maintain than keeping a parallel hip code base alive. This PR adds the target hip to make and works with the most recent version (0.41.1) with ROCm 5.6. For installing just do ```bash # Your ROCM_TARGET can be found with rocminfo | grep gfx ROCM_TARGET=gfx1030 make hip pip install . ``` It won't pass all tests as some are igemm or Cuda specific, but all optimizers work in both 8bit and 32bit. I also used this a lot with llama 4-bit interference, that also works. The tests that fail are beside those test_autograd.py and anything with double_quant in its name, I assume that also has to do with matrix multiplication and is expected to fail. We also need to document on what GPU's this actually works. One limit is that it needs to support a Wavefront Size of 32 (can be checked with `rocminfo | grep "Wavefront Size"`). I currently force it to 32 by setting a define as it defaults to 64 on gfx1030 and I haven't found another way to set it correctly, so it should compile even if the GPU only supports 64. As most projects use the smallest BLOCK_SIZE 64 for 4bit stuff this will cause issues so there needs to be a guard, best case scenario directly in the python code. I also added the option to just disable BLOCK_SIZE 64 on another branch (https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6/tree/main), but this still froze my desktop when using kQuantizeBlockwise so that is not a good solution (that's why it's not included in this PR). Besides that igemm / Matrix core support for the more recent AMD GPU's is still impossible because of missing instructions in [hipBLASLt](https://github.com/ROCmSoftwarePlatform/hipBLASLt). I'm making this a draft for now, as it is still not well tested and I haven't really updated the documentation yet. From an actual code standpoint not much will change on my side as I only own a gfx1030 GPU and thus can't test igemm support. Closes #47, closes #107, closes #681

Iron_Bound · January 4, 2024, 2:16pm

This helped to get started, given the files 8000 lines seeing the commit gave a good start to grok it

github.com/ggerganov/llama.cpp

CUDA: fixed tensor cores not being used on RDNA3

ggerganov:master ← JohannesGaessler:cuda-fix-rdna3-tensor-cores

opened 10:30AM - 30 Dec 23 UTC

JohannesGaessler

+24 -23

Fixup for https://github.com/ggerganov/llama.cpp/pull/4682 . This PR partially r…everts the changes made by the other PR by moving the `USE_TENSOR_CORES` check back to the old position. The HIP FP16/MMQ logic is now based on GFX versions. @sorasoras please check whether the performance regression is fixed.

Looking into it they have a few checks to stop older cards using the wrong feature set.

github.com

ggerganov/llama.cpp/blob/a91928014fcf51fe297823fcff0788de4f14206e/ggml-cuda.cu#L8665


      
          
          
    const bool split = src0->backend == GGML_BACKEND_GPU_SPLIT;
          
          
    int64_t min_compute_capability = INT_MAX;
              for (int id = 0; id < g_device_count; ++id) {
                  if (min_compute_capability > g_device_caps[id].cc && g_tensor_split[id] < (id + 1 < g_device_count ? g_tensor_split[id + 1] : 1.0f)) {
                      min_compute_capability = g_device_caps[id].cc;
                  }
              }
          
          
#if defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
          
          
    const bool fp16_performance_good = min_compute_capability >= CC_RDNA1;
              bool               use_mul_mat_q = ggml_is_quantized(src0->type);
          #ifdef CUDA_USE_TENSOR_CORES
              use_mul_mat_q = use_mul_mat_q && min_compute_capability < CC_RDNA3;
          #endif // CUDA_USE_TENSOR_CORES
          
          
#else
          
          
    const bool fp16_performance_good = min_compute_capability >= CC_VOLTA;

Iron_Bound · January 7, 2024, 4:34pm

Might start dumping research here

Code for 4bit /w hipblas:

github.com/turboderp/exllama

Add ROCm support

turboderp:master ← ardfork:rocm

opened 09:49PM - 23 May 23 UTC

ardfork

+85 -4

This PR add support for ROCm. It is currently based on 7d8ca43532a2c7326d9e74…a517cec3fe9eb71fed, I will base it on a more recent commit when cuda_compat is added for atomicAdd of float2, since you will probably need to also implement that for older nvidia GPU as it require compute capability 9.x or higher. Before being ready for merging, it have two parts that need fixing, they are commented with a "FIXME". First is about compiler flags, pytorch doesn't give us full control of them, from what I tried, gpu-rdc is needed to compile, but pytorch add -fno-gpu-rdc as the last flag. Secondly I had to stub _cuda_raise, I didn't investigate that much about it, but it doesn't work on HIP, a fix or a workaround need to be implemented.

Hard mode - do Rdna assembly route:

github.com

tinygrad/tinygrad/blob/master/extra/assembly/assembly_rdna.py

import yaml
from typing import Tuple, Set, Dict
from tinygrad import dtypes
from tinygrad.codegen.assembly import AssemblyCodegen, Register
from tinygrad.codegen.linearizer import UOps
from tinygrad.ops import BinaryOps, UnaryOps, TernaryOps
from tinygrad.runtime.ops_gpu import ROCM_LLVM_PATH

# ugh, is this really needed?
from extra.helpers import enable_early_exec
early_exec = enable_early_exec()

boilerplate_start = """
.global _start
_start:
.rodata
.align 0x10
.global code.kd
.type code.kd,STT_OBJECT
.amdhsa_kernel code"""

This file has been truncated. show original

Iron_Bound · January 9, 2024, 12:37am

MLC can compile models for faster inference (apache tvm)

Has pre-built wheel and models
https://llm.mlc.ai/docs/prebuilt_models.html#llama-2

vLLM seem focused on throughput, via batching requests
https://docs.vllm.ai/en/latest/getting_started/amd-installation.html