Hi I’m progressEdd, I build AI and AI Accessories.
I wanted to formally start my own corner for exploring open source models and anything AI related. Stick around if you are interested.
Hi I’m progressEdd, I build AI and AI Accessories.
I wanted to formally start my own corner for exploring open source models and anything AI related. Stick around if you are interested.
With the release of llama 3.1, I decided to revisit this code. I learned the availability of new models through the ollama and lmstudio discord servers
Initially I thought the issue was with the system prompt of lmstudio, but when I tried ollama with the same model
The outputs between the two models were the same.
I was going to post a thread as a help/support thread within the autogen discord, but it wouldn’t be accessible to the public it’s hard to find stuff on discord servers.
I am a computer janitor getting into coding, and lordy this stuff has captured my interest. I have a locally hosted Llama3.1:8b running in a box, and I am learning how to configure a RAG. I will be around to see how you are doing here! Nice to meet you!
LLMs and Image Diffusion models are definitely fun to play with locally to push the limits of your build!
I’m running on Linux as there are a variety of inference engines available. Typically I use good old llama.cpp by git pulling it and doing build. (Building llama.cpp with make -j($nproc)
was a reliable way to crash my old degraded i9-14900k btw hah…)
The biggest pro of llama.cpp is support for partial offload of GGUF models. With 96GB RAM and 24GB VRAM I can run Llama-3.1-70B IQ3_XXS at just over 7 tok/sec (maybe 8k kv cache fp16). I tried litellm but wanted llama.cpp specific support so made my own python async library for streaming llama-cpp-api-client. Not perfect, but was able to build a simple discord bot with it and kick the tires on some models.
I may try Mistral-Large or Deepseek-v2 / v2.5 in the biggest quant that will memory map into 96GB soon. It is unfortunate that I’m limited to 2x DIMMs to get full memory bandwidth, as these big LLMs gobble up RAM fast when inferencing on CPU but are so bottle-necked by bandwidth. Will be lucky to get 2 tok/sec with the big models probably hah…
If you have enough VRAM to avoid offload, (or use 8B models), you can get faster inference with other engines. Pretty good recent benchmark on r/localllama the other day.
For Image Diffusion I’m using ComfyUI and mostly FLUX.1-dev with random LoRA’s from civit haha… It is kind of a wild space though, so be prepared to sort through a lot of “interesting” models lol…
Generally I run my 3090TI FE cap’d at 300 watts and have not yet added a second GPU to the mix as tbh the quality isn’t there yet imho…
Cheers, curious to hear how your adventures go!
got the suggestion to use litellm as a model endpoint working. TIL litellm as a simple open source model endpoint. The performance isn’t as good, but I suspect the ollama models don’t have the same quantization as LM studio. LM studio offers Q3-Q8 levels of quantization the lower a model is quantized the more constrained it is.
For those who want to read more see this article
There has been a lot of headlines discussing the diminishing returns of throwing more data, model parameters, and compute to improve llm performance. This is a great math/technical explanation why performance gains is slowing
made this meme of Dario Amodei and Yann LeCun
Chatgpt helped with the phrasing and making dario transparent
The edit spawned during some messages with some of my ai friends going over the following article about llms for chip design. I noticed the following phrase
my friend responded with this quote
after realizing the duality of man, I knew I had to create the I have 2 sides meme but with the AI Leaders Dario Amodei and Yann LeCun who have semi opposing takes on AI and AGI
ollama + Mistral or Mixtral
Maybe for brainstorming phrasing, have you found any good front ends for image instructions? LMStudio is definitely up there, but not as convenient as chatgpt.
I’ll write a post for the making of this meme
Alpaca or OpenWebUI
will have to look into it
While replying to this post, I remembered my conversation with a Intel rep at microsoft ignite about model quantization and how it scales to hardware, as well as his experience using with quantized image models with flux.
Apparently with model quantization, you can change the weight data types for each layer within the neural network.
In a unquantized model, the weights are stored as floating point 32 bit integer (fp32) datatype, and as we quantize the model, we reduce the fp32 bits down to lower levels for example into 8. He showed me that as the quantization is lowered, the quality of outputs decreases, but the latency is much faster
If that explanation didn’t make sense, this hugging face article goes more in depth into the techniques to
I may butcher the explanation, but with hardware, having more memory allows for more weights to be stored, and having dedicated neural processing units reduces the amount of latency to process the weight calculations.
That depends on hardware support, fwiw.
FP16 has great support in modern-ish hardware and gets pretty much double the speed of your regular fp32 weights model, with basically no performance loss.
Going down to smaller data types (be int or float), will save you on memory since the data is smaller (duh), but the performance depends on the underlying hardware support. For CPU use, that’s pretty much irrelevant.
But for GPUs or other dedicated hardware, it may not support the specific format, and thus performance won’t be sped up and will be slower than your regular fp32/16 inference run.
Of course the above is moot if we’re talking a model that’s normally way too big for your memory, since running a quantized model, no matter the speed, is going to be way faster than not being able to run at all lol
like my hardware M2 pro or a fedora 5800x3d + 2070 8gb?
neat I didn’t know about that
Based on this, I’m assuming you mean latency when you are talking about performance in the previous sentence.
yeah I’m really interested to see how far companies can push small models. Microsoft’s Phi and Apple seem to be leading the charge on this front
Modern as in your GPU (Turing) has tensor cores with support for mixed precision. FP16 inference should be twice as fast than regular FP32.
No idea on the specifics of the M2 chips, nor the software support for those.
I assume that by latency they meant the time it takes to perform a single feed-forward step, so yeah, more performance = less latency.
I mean, you can have a 10B Q8 model, and a 20B Q4 model. The former has half the parameters, but is as when it comes to memory to the latter.
I see research advancing in both having larger models with less precise weights (like those 1-bit models) and models with less parameters (like the smaller Phi models from MS), which is indeed very nice.
This was a really interesting read that I want to buy the book
Had a colleague ask me about image editing models. This was a pretty cool blog
This video inspired me to go back and make some progress some of my past projects
For example I updated my python environment setup guide
The old poetry guide from
is this meme