AI Medical Record Processing

I’m new to the forum but I’ve been a fan of the youtube channel for years.

I’m building a python program to read medical records, identify key items (name, date of birth, etc) and summarize and classify them for later searches. The end goal is to improve quality of care by flagging things like when follow up is needed, or when preventive services are due. One of the stretch goals is to process all of the hand written historical vaccine records that go back for decades.

I’m using Qwen3 VL 8B for OCR of scanned or faxed records and GPT OSS 20B for the text processing. It’s then converted into a JSON file for upload to the EMR.

Right now I have an AMD Strix Point system, which has been phenomenal for the development work. Being able to keep both models loaded with the context size I need maxes out the shared memory. It can do everything but at the cost of speed.

Now that I’m processing documents, it’s taking about 24min per document for the current workflow. I’d use the 9700 PRO to speed up document processing to keep up with new records and process the huge archive of old records.

Once everything has all the bugs worked out, I’d like to build it into an open source tool that works with with standard FHIR interfaces, meaning it can work with any EMR.

1 Like

24 minutes? do you not run multiple requests?

Not right now. I think I can improve it by running two at a time but I’m mostly gpu limited. 24 minutes is the average, a few larger documents bring that up. Better NPU support would also help., mostly with prompt processing.

Which platform are you on (windows or linux)? What Strix point processor? How much ram?

So, have you given consideration to deepseek-OCR - deepseek-ai/DeepSeek-OCR at main (just under 7GB).

Or possibly doing a rapidocr or tesseract with docling implementation for the PDF parsing to simple ingestion (like qwen-3 embedding). What databases, if any, are you using? Even though it has the NPU, that support is, eh… what platform are you on (windows or linux). Then there is a chance that running the models on the NPU may be slower than doing a directML on the GPU or using ROCm + Torch + Transformers implementation for dedicated tasks.

I need more information to help.

I get it because I’m trying to do an advanced rag MCP server that can do what you are asking (mostly) and more for personal injury law, but where I want the more complex hybrid RAG with postgres + pgvector + neo4j graph database for relationships.

It’s a Ryzen HX370. I was thinking about the 395+ but couldn’t justify almost twice the price for basically an experimental project. My thinking was that even if it’s slow, I can prove the concept without spending too much. Heck, if I didn’t get anything working I still had a nice workstation. I have 96gb of RAM, split 64 to the system and 32 for VRAM. I know it’s only 5600MT RAM, I don’t think that’s the biggest factor in it running slow. It’s spending most of its time at 100% GPU. I tried a couple of different models with NPU support, they seemed faster, I didn’t run any speed tests.

I started off with Ubuntu 25.10 and was having trouble getting LLMs to work on the GPU. Lemonade didn’t work at all, and Ollama was running on the CPU. I switched to Windows and LM Studio. Everything is built in Python and uses API calls for the LLM inference, to work across platforms. Mostly. I still have some stuff to work out. I should also be able to us any LLM platform by just passing a different base URL.

Tesseract was fast but not as accurate as Qwen3-VL, I never tried RapidOCR. I tried some traditional OCR tools, too many errors with hand written text. DeepSeek OCR was giving me errors when trying to pass it files, I didn’t spend a lot of time trying to get it to work. I’ve heard good things about it though. Qwen 3-VL has been almost flawless, even if a little slow.

I only just got to the point where it even works this past weekend. I know I have a lot that I can optimize from here.

I’m genuinely curious (ie not throwing stones) - how are you handling the accuracy problem? A friend of mine asked me to build something similar, and when I looked into it the question of legal liability for errors in the resulting data was pretty much unanswerable here in the UK without manual effort significant enough to render the time saving meaningless.

1 Like

It a legit question. I looked at the liability of each step and made the process fail in a safe way. I’m running at a 90%+ success rate, so the time savings is definitely worth it.

Patient Name & DOB – Liability high. The patient will be matched to the EMR using this data, so I could potentially load it into the wrong chart if this is wrong. If there are multiple patients or no patients returned by the search, the result is flagged and discarded. The document will then be processed manually.

Document description – Liability low. Documents have already been reviewed by a physician and any action needed has been taken. Misspelled words are common already.

Document Metatags – Liability low. I’m working off a predefined list of metatags that the model chooses from. If it deviates from the list the tag is ignored and the document processes normally.

Vaccine Type – Liability medium – The original scanned data is retained for reference. Again, I’m working off a predefined list of known good results and will flag and manually process bad results.

Other vaccine information – Liability low, lot numbers, site given, etc. Nobody really cares about this but we are required to record it. We can always look at the original scan.

@Cronus_k - aha, I hadn’t understood that you were linking historical records to patients who already exist on an accessible database. That makes more sense, you obviously have a lot of reference data for safeguards (although these poor AI models still have to deal with doctors’ handwriting…I’ll light a candle for their sacrifice…).

I was coming at it from the perspective of a blank data store, where it would obviously be much harder to get reliable data.

Then you’ve had roughly the same pains I have had, in part. Are you using the Ryzen AI 1.6.1? Have you had the growing pains with ROCm on windows.

https://repo.amd.com/rocm/whl/

Comparatively, I am getting better results recently, but have had to migrate python builds from 3.10 to 3.12, etc. When I saw the limits of XDNA ONNX, it is good for certain tasks only, so the GPU DirectML and the Torch and Transformers and Sentence-Transformers became more important to me.

But custom coding the loaders and uses instead of just calling multiple models from Ollama has become its own chore, trying to custom set the context windows, etc.

I started knowing nothing in April and now have a lot more, but it is a learning process.

I can say, when I have it working, it takes less time than you for ingestion, but has its own quirks.

I considered Qwen3-VL recently, as I’m doing a V2 on the ingestion pipeline at the moment.

I’m wondering if there is any part of the models being pushed to the CPU allotted memory or layers being pushed to the CPU considering your setup. Or if the entire thing is running CPU.

With LM Studio, I have a fine time at smaller context windows, but when expanding them, I’ve had issues crop up which has been frustrating.

I may need to take a look at Qwen3-VL. I use Qwen3 embedding and reranking already, which are great.

I don’t remember the version, but I had to sign up for an AMD developer account to get the drivers. Come on AMD do better. I like the extra allocated VRAM of strix point so I don’t need to load and unload models as I’m processing each file and the ability to at least try larger models. It does come with quite a few quirks though.

I custom coded as little as possible. I’m using Ghostscript for pre-processing the files and ollama-ocr by imanoop7 to call the OCR model. Then I pass that text to GPT OSS for classification. I did code the EMR API calls and error handling myself.

I don’t know that Qwen3-VL is the best, I tried olmOCR 2, llama3.2-vision, granite3.2-vision, and DeepSeekOCR. Qwen had the lowest error rate. I think Granite was the fastest.

Everything is running on the GPU through Vulkan, it was taking 2 hours per document running on the CPU. I know Vulkan has a performance penalty over ROCm. From what I was researching it’s maybe 10%. It was worth it because it just worked right away. I could batch process files for better performance, but that introduces more complexity.

1 Like

I’m director of scientific computing at a cancer center and would love to hear more about this. We have something similar in clinical trials right now.

I’m doing a 2-phase NER design. It is likely overkill for your use case, but is great for mine (Personal Injury Law). My intake involves an MNLI for domain detection and JSON definitions for categories and sub-categories per domain. It has to discover domain to understand if it is an academic paper, a legal case, a medical record, etc. Then the parsing is to help set it up for two things: 1) chunking (with document summary, section summary, and paragraph summaries, table and chart awareness and footnote awareness) down to the paragraph level, and 2) entity extraction, which requires a proper parsing and domain detection for categories and sub-categories to then select the proper NER for entity extraction and relationship buildings. I use multiple NER models from HF for the medical entity extraction and relationship building, then GliREL for the relationships.

They have a host of open medical NER models for different uses in the medical community. With the proper domain detection, and parsing, you could tag the metadata for specific things easily or summarize it easily, then have that pull the relevant information, including diagnostic coding per procedure.

That’s why I am interested in your project, as yours seems like a more targeted approach to what I am creating and I’m hoping to glean what I can from yours to work on my own. LOL.

But, as you can tell from the description, this is much more involved, code intensive, and requires studying the per model licenses carefully to find models that allow for commercial use without license fees. Then you have the limitations on the Neo4j licensure if you are doing each patient with its own database instead of proper RLS with both restrictions on project (per case or per client ID) and tenant-id restrictions, among other things, to prevent access by unauthorized personnel, etc.

I’m still learning the database management side myself, plus it needs to qualify for HIPAA, which locally hosted, but having the restrictions on access for each patient or client is still needed to guarantee compliance.

I just remembered, there are two reasons to sign up for that, the HIP SDK, which is replaced if you are using the ROCm from AMD Repo (but you should create in the conda env in the active.d folder an env_vars.bat or env_vars.ps1 file with the env keys to switch the reference from the 6.4 HIP ROCm folder to the 7.10 ROCm in your conda/python environment to auto-point to the right files for clang, HIP drivers, etc.). Then there is the downloading for the NPU driver to get that going, which is up to Ryzen AI 1.6.1, which now works in Python 3.12. It is faster than the older 1.5.0.

Installation Instructions — Ryzen AI Software 1.6.1 documentation
https://repo.amd.com/rocm/whl/

With the newer ROCm, you get Torch 2.9 running and newer transformers, among other benefits (like the packages on python 3.12 instead of 3.10). But I am still waiting on 7.0 or 7.1 to be downloadable and easily installed on Windows.

If you need some info on getting that working, let me know. But you will want to create a constraints.txt file for python so that as you install new modules, you do not overwrite the working ones for NPU support or ROCM 7.10 and Torch that are custom to the AMD hardware. You also will want to block other onnxruntime modules from installing and may have to install dependencies manually when it tries to install the wrong onnxruntime and it is blocked by the constraints.txt file. Maintaining your python dev. env. is essential.

I’m glad some people are interested. I thought this would be a boring topic.

Thanks for the link to Open Med. I’ll give those models a try. Also, thanks for introducing me to NER. I didn’t know that was a thing, there is a lot of info there that gave me some ideas for things to change. I was just working off the assumption that breaking down the problem into smaller chunks with hyper detailed and specific prompts will give me the best results.

Right now, I’m passing a list of acronyms, definitions, and explanations through the system prompt, user prompt, and assistant prompt depending on how important I think that item is. GPT OSS seems to be great at following instructions to the letter, so I haven’t felt the need to explore a bunch of different models. To an extent, I’m replacing a lot of what could be accomplished in code with instructions in the prompt. “Use the format mm/dd/yyyy. If multiple dates appear, choose the date explicitly labeled as DOS, Date of service, encounter date, or service date. If there are multiple dates of service, pick the earliest one.”

I’m also planning a build for a “production” computer to host the LLMs, so that I can keep using the HX370 for development. I already have most of the parts that I need, the GPU being a big missing component. I was looking at a W6800 for the VRAM and relative cost efficiency. That’s going to be running Ubuntu again, I’m definitely going to be spending more time getting ROCm working. Now that I know the rest of the process works, it will be worth the time investment.

A little bit of an update. Ran a batch of 106 files over the weekend. 100 passed successfully, 5 failed due to patient matching, one failed to upload.

I tweaked the prompt some along the way. Summary descriptions have been good all along, haven’t made any changes there yet. A couple of notes from the changes I’ve made.

  • If you say “all”, even in a limited context, GPT OSS heavily weights that phrase in the prompt.
  • Added a couple of new definitions for patient name. i.e. member or enrollee.
  • OCR still needs some work, there were maybe 5 files that failed to match a patient that worked on a second pass.
  • Checkboxes or fill in the blanks are a problem. The OCR reproduces them exactly in the text output but GPT OSS sees the field label and assumes that it’s true. Even if the field is blank or there’s no checkmark. I’m not sure what the answer is to this yet.

Next steps
Now that I know the workflow is, well, working. I’m going to start experimenting with different models. DeepseekOCR and GLM-4.6V-Flash are two that I’m going to try for OCR. I’m going to try some negative prompting for the checkbox problem. Not sure if a different model will have more success here.

Honestly…4% failure isn’t terrible. There’s also the human nature element here…if it was a perfect 0% failure rate, nobody would trust it (and nor should they…as a mentor said early in my career, a test with no failures is a test that failed).

I’ve not done much with LLM OCR stuff, but…my first thought is that it’s not actually desirable for the OCR model to output with perfect fidelity; the goal is to output something that the next model can deal with more effectively. Therefore, I’d be aiming to get the OCR model to output true/false for checkboxes and <placeholder> for fill-in-the-blanks so that gpt-oss has less work to do.

Apologies if that’s stating the obvious and you’ve already tried and discarded it :slight_smile:

That’s a really good idea, I’ll try that out. Also, if you are in need of OCR for complicated documents, I highly recommend trying a LLM. The text I’m getting out of it is very clean and well formatted.

Glad to be (potentially) useful :slight_smile: Do let us know how that goes…I’m curious as to whether it’ll improve matters (and whether the OCR LLM can reliably follow that kind of instruction).

Must confess, I don’t do a lot with OCR - last time I needed to get an LLM to process documents, they were primarily graphical (football training drill diagrams, for my employer) and it had all sorts of trouble with them.