So, I purchased a Bosgame M5 AI mini-pc with a Ryzen AI Max+ 395, Radeon 8060s, and 128GB of unified memory this summer. I want to use one of the PCIe x4 slots for NVMe to an external PCIe x4 board and use the R9700 Pro with it.
I have been developing an advanced RAG system that operates on Windows Native. Talk about a headache. I tried different approaches as I am not that great of a coder, but am excellent at troubleshooting hardware.
The most recent incarnation is an MCP server with a hybrid RAG configuration, utilizing PostgreSQL 18 + pgvector and Neo4j. The ingestion system requires multiple stages, including docling (with tesseract or rapidocr for the ocr system) parser, a custom chunking design down to the paragraph level and footnote awareness (in progress), with sliding windows if the paragraph needs further split to avoid truncation, domain detection as a means to select the NER (tons of NER based on domain) in a multiphase NER system for entity extraction before being ran through Glirel for the relationship extraction/building. Entity extraction is so that you can have a base core level of relationships and entities, but then you can have domain specific relationships and entities. This is so that you can define hundreds or thousands, but narrow them down automatically to only those relevant for the scientific text, legal text, medical text, so that you do not overly bog down the graph database with irrelevant relationships and entities, while at the same time having a rich environment (instead of simplification). That includes sub-domains (like legal breaking to civil and criminal, to civil-securities, down to statutes). Additionally, the domain detection is used to enrich the metadata for the postgres+pgvector database.
Then you have another small model like qwen3 or granite4-tiny go through and generate summaries for the document as a whole (3 sentences), the section (1 sentence), and paragraph (1 sentence) to further enrich the information for selection of chunks relevant to a query.
And that is just the ingestion system (and there are more models used in the ingestion, but to recap Qwen3-embed, Deberta-MNLI for domain classification, granite4-tiny for summarization, domain specific NER + gliNER + glirel for 2-stage NER). Meanwhile, retrieval uses colbert, qwen3-embed, BM25 (built into postgres 18), graph retrieval (repurposing and custom training a graphsage/graphrag combo to come later), and the RRF to the Qwen3 reranker to feed to the main model being used.
All of this is on Windows 11 without the use of WSL, meaning I have had to develop with the growing pains of HIP SDK, eventually getting the current python ROCm packages and having to migrate to new environments (started where the Ryzen AI 1.5.0 was on Python 3.10, so some AI/ML packages were not available and easy to break using the vitisai, directml execution providers; now on Ryzen AI 1.6.1 on Python 3.12 with ROCm 7.10 from repo.amd.com with torch 2.9). I could regale you with stories of doing the compiling for a custom triton amd windows and the headaches of trying to get flash-attention 2 to work, going with sdpa on math for docling, etc. Development on windows has been… fun.
So let’s throw another variable in the mix: that R9700 Pro. I would set it up with a daughter board (occulink nvme to pcie), then offload some of the models (or possibly the main model for something like a smaller ai coder like Qwen3-coder-30B or granite4-tiny with a large context window) to ensure I do not run out of memory. I already use about 66% of the 128GB when everything is running to date. When you add a main local model, you are now getting to the edge of what this machine can handle. I keep the models warm, but may have to move to lazy loading otherwise.
I also have an advanced ingestion for code repositories specifically when the domain is detected with per language custom AST-based checks to enrich the neo4j graph database.
But, this is a work in progress. It is not done. But this would be an amazing addition and shows how small dedicated models add to a sum larger than its individual parts.
