So my wife is in Nurse Practitioner school and has a metric ton of PDF notes and other coursework she’s piled into folders over the last couple of years starting with her Master’s program. On the other side of it is me who uses Obsidian with Textgenerator and an OpenAI API to help quickly do brain dumps of an idea or project with AI’s help into my Vault. I’d like to stop using an external API to OpenAI and do a self hosted box here as a hardware project. After watching the video on the RTX 4000 SFF I feel this is the route I want to go, though overpriced, it looks like a fun box to do very lazy AI work. My first thought is Ubuntu running Ollama, but I’m not fully sure it has CORS support which is what the Textgenerator Obsidian plugin uses to send the requests and wait for generation. I think it does, but I’ve not tried it yet. Also, what would be the best method of getting the local LLM to go through all of her saved coursework for inferencing and to allow her to ask it questions based on her own work?
Lastly, for the fun part… help me build the little LLM server! Case/Mobo/CPU/Ram suggestions! I’d hate to just grab an HP Elite 800 G9 SFF, though it looks like it would fit the bill.
Can you elaborate on your wife’s use case? Does she want to search her notes corpus? Does she want to query through an AI using the corpus? Also, what is the nature of these documents, are they hand written? Do they contain diagrams?
W.R.T. CORS, I don’t know anything about the API for the plugin, but provided it isn’t too terse you could always just write a wrapper around your chosen LLM and put in flask-cors. Of course this requires some amount of python programming and digging into what appears to be typescript on the plugin end (it maybe sufficient to look at what I assume would be REST statements in the repo).
Basically she’d like to be able to have her notes and concepts abbreviated and searchable. She randomly uses ChatGPT to look up concepts and get quick answers but it would make more sense for most of the content to have been generated by her actual typed PDF notes and or a PDF upload of the book she is using at the time. I did this once before awhile back with H2O GPT but it didn’t work overly well at the time.
As for the CORS portion, that would only be to integrate the system directly with my Obsidian vault so I could generate far longer content without having to pay for the API calls from OpenAI
Not a complete solution but to give you a few points to ponder about.
Gotta pay attention to the prompt window size of the LLMs you are planning to use, it may be smaller than the prompt window size of the large commercial LLMs.
Obviously you can not feed your entire collection of documents to an LLM on every orompt. So, you probably want to break it down to small chunks and search for relevant pieces before sending the prompt to the LLM. If you want to search semantically, you cut feed the chunks to a vector database.
Inference is going to run much quicker on GPUs. Check how much GPU memory you would need to run the models you choose.
That is kind of where I am at. I’m torn between going the tiny path and getting the RTX 4000 SFF with 20G or just going nuts and getting a DGX station with 128GB lol… but not sure spending ~15k + feeding a 20A circuit is cost effective for this project, though fun.
Have a look at the Nvidia NEMO stuff. They have a few LLMs ready to go (docker and LLM models of your choice) along with good docs and a few sample apps if you dig around in their site, including one that does RAG on pdfs.
The problem is the large models (that give the best results) need plenty VRAM.
I only tried llama2 locally, and it was disappointing compared to Claude Sonnet. Maybe the latest local models are better, I need to try them yet.