Struggling to get any use out of LLMs, am I missing something?

I hope this is the correct category, since it’s a more broad AI topic.
I recently tested out ChatGPTs “deep research” feature with, what I assumed would be, a challenging but not insurmountable task.
I had 2 websites that hosted episode lists of a cartoon show.
One with the German episode numbers, titles and plot synopses, the other website with the English episode numbers, titles and plot synopses.
Cartoons, especially ones without an overarching story, where episodes are almost always standalone, were often released in a different order in German. Season 1 for the German release might contain a smattering of episodes from seasons 1, 2, 3 and 4 of the original English release, etc.
I asked ChatGPT 4o, with deep research enabled, to create a mapping table between the German and English releases, with the columns “German episode number”, “German title”, “corresponding English episode number”, “corresponding English title”.
I also included the information, that the title translations are not word for word, literal translations of the English titles, and that it should try to match them based on the plot synopses, rather than just the title,
After 10+ Minutes of “deep research” I got a mapping table that literally just mapped German episode 1 with English episode 1, German episode 2 with English episode 2, etc.
I tried again, telling it to only do it for German Season 2, since I had already mapped Season 1 manually, and this time I got nonsense hallucination episode numbers and titles. Titles that sounded plausible, but didn’t actually exist.
Out of curiosity I asked my 10 year old nephew, whose English isn’t that great, to try and do it, and he did 3 episodes successfully before getting too bored. So I’m assuming that most 14-15 year old interns could do this task quite successfully.

Maybe I’m just getting jaded, or old and cranky, but Open AI is advertising this “deep research” as an agentic capability that can handle complex tasks. Yet it completely fails on something that literal kids can do.
I’ve also tried using LLMs in a professional context before. But I keep getting hallucinations when I’m asking for PowerShell or Azure CLI code snippets, so I have to double check everything anyway, which means I’m usually faster by just going to the first party documentation and writing it myself.
The same goes for more complex sysadmin tasks, I don’t need it for basic stuff, and for more complex issues it’s completely unreliable and not as helpful as first party documentation.
And when it comes to writing emails or general communication, I can always tell when a colleague or a customer is using one of these ai tools. It always sticks out, never sounds natural. And for technical documentation I, again, can’t trust it.
I’ve also seen AI generated images show up in billboard advertisements where I live, and again, you can very much tell that they are AI generated. It feels incredibly cheap, and definitely won’t convince me to buy anything from these companies.

Am I not using the tools correctly, am I missing something? I thought that all of the scenarios I was describing here should be doable and within scope for these tools, based on the claims about and general hype around LLMs, but I’m struggling to get any actual use out of them.

9 Likes

No, that’s pretty much par for the course.

11 Likes

I don’t think so. A corollary of so-called AI not understanding anything is if it gets anything right it’s by accident. I toss simpler tasks than yours at ChatGPT occasionally as tests and it’s yet to produce anything even vaguely useful. So so far it’s been a total waste of time.

IIRC Gemini was benching 97% incorrect on simpler tasks than this at one point. Generally I’d figure about 97% correct to be the minimum to breakeven on the labor of checking results, meaning using an LLM would actually save time, but I haven’t encountered any study benching that high. 60%, yes, but at that level if you care at all about quality there’s very little value in running the LLM as you have to check every response.

Maybe I’m also missing something but about the only reasons I can see to run an LLM at the moment are if you need to measure their error rates or are researching how to make them not suck. Maybe they’ll get useful, maybe they’ll be like fusion where decades go by.

3 Likes

People need to understand that AI is just picking the next likely token in a sequence, it’s basically an advanced autocomplete algorithm. Depending on the data it’s been trained on determines the most likely output.
They can be fine-tuned for a specific task, or if you want AI to be actually useful for work/study, people feed them documents via RAG (Retrieval Augemented Generation) so it actually has a reference, instead of making stuff up. Temperature and some other settings also determine the behavior, higher temp means more creativity, but that means more hallucination.
(But don’t go uploading your sensitive documents to a cloud provider, that’s what local LLMs are for, or consult your companies IT team).

Another trick is to provide it with the proper context, and knowing how to ask the right questions to get the desired answer.
Still, never take anything it says as gospel. They’re just tools to increase efficiency (or to have a fun chat with), they’re not a replacement for humans.

1 Like

Trusting marketing bluster always ends up like this, its even worse in hype propelled bubble current LLM are right now.

Best way to threat any current model, even the best ones is as completely unreliable autist like puppy, that eager to please, has read everything but understands nothing. It will confidently vomit nicely sounding answer to any input, but there is no human like thought process involved.

Since it has ingested everything published, easy queries are have reasonably correct answers despite lack of thought involved. Bullshit still appears randomly, but not frequently. The more complex/more off the beaten path from training material, the higher change of total nonsense.

Advanced models use the same approach, and hash you input silently before using output of that as new input and then respond to you (chain of thought).

There is no “intelligence”, methodology, or truth here, just generation. Generation and plethora tricks with adjusting human input for generation.

But since it it responds with near perfect human language, laymen consider that as sign of actual internal inteligence, and hence massively overestimate its capabilities.

“Hallucinations” are structurally not an anomaly, just normal operation indistinguishable from correct answers (internally). If there is no internal difference, you cannot correct it. There is no solution to this, even theoretical ones.

Output quality matches that. Nothing it outputs can be trusted, hence agentic capability seem like russian roulette with semi auto to me.

Only way not to be disappointed is to treat it as unreliable search engine with natural language input. Or low quality sounding board for creativity challenged individuals like me.

TLDR: Treat it as unreliable assistive tool or you will waste more time that save using it. Some people will find it more natural fit to their workflow, to me its more of and hindrance than assistance.

5 Likes

AI is a lot of wrong answers only, unless they’re simple questions that have already been answered a thousand times. Then it accidentally might get it right 90% of the time.

If you’re a software engineer and took a Machine Learning class back in college/university, and you remember the Back Propagation Algorithm (you don’t have to remember how it works, just that it exits), 3Blue1Brown does a really great set of videos on modern LLMs:

These are some of the best videos I’ve seen that explain how tokens are mapped in this very large matrices, and how those are used for next word (really next token) prediction.

If you don’t know what neural networks are, the previous videos in the series can get you up to speed (although his explanation of the Back Propagation Calculus is way more complicated than the one I remember).

4 Likes

It’s good to hear that I’m not going insane.
It’s not that I just believed their marketing claims, but I didn’t expect the exaggerations and lies to be this extreme.
With the amount of money and manpower that’s going into developing and running these LLMs, and with the omnipresence of “AI” online and in the real world, I guess I just expected more than fancy chatbots that are wrong most of the time and can’t even do simple tasks.

1 Like

I find them sorta useful depending on task/context, after learning how to use them and how not to use them.

I run some local self-hosted LLM’s that are good for boilerplate code or small functions. Generating documentation or unit tests. Or asking what a command does. Kinda like an offline search engine. (Tho often I have to refer to the docs or search anyways so this is pretty hit or miss.)

For example I don’t have it integrated in my IDE for auto-completing or “doing everything for me” (nor would I even want that). I still want to design and drive and think.

If I didn’t already know how to program, I might find LLM’s much less useful given their high error rate, and just blindly copy/pasting without any understanding of what the code is actually doing or why.

There is some utility, but I’m not worried about it coming for my job anytime soon. I will automate tasks if I can of course. It’s a tool, not a replacement. Largely I’m tired of the hype and marketing noise.

I am open to being wrong of course. After all I am wrong about many things, if not most things. Perhaps it will be magic in the future?

3 Likes

Sounds like it worked for season 1, but not season 2.

Fot me, I found that deep research is much better when there’s more pages about a given subject. I also start a new session because when I have a lot of messages or text sequences, it starts to hallucinate.

For my day job I have to research about industries to synthesize data for demos, so I can have it take a first pass at finding me sources. The output really depends on how well documented the topic is on the web. Although some domain experts may call out some hallucinations, it is much easier than having them review a shortened summary rather than asking them create me a table for the list of fields and relationships.

If you don’t mind sharing the show and links to the page, I’m kinda curious to see if I can get a better output with different models, tools, or prompting

1 Like

In our enterprise meeting LLMs are being hyped as well. when I suggested a structured and complete documentation with a system which enables me to do a sensible search goes a long way than handing a pile to a AI black box or feed an external service internal data - I wasn‘t exactly the star of that meeting.

3 Likes

Yeah, as much as I love the hardware/software challenge of running big LLMs locally, I don’t actually use them for very many things.

I haven’t tried the paid API versions of “deep research” but did setup a local self-hosted sandbox to try some open source alternatives e.g. dzhng/deep-research + mendableai/firecrawl and was not impressed with the results.

I’ve found that while “agentic workflows” are interesting and hold promise, that in practice typically adding up lots and lots of mediocre results do not yield good results.

Specific to your task of translating/mapping english/german episodes, my guess is it is a fair amount of “context” that you had to copy paste into it. Most LLMs I’ve played with don’t do well with questions that involve many areas across a long context. They are trained mostly to pick out a “needle in haystack” single small reference from a lot of text, not more complex relationships…

A few things that I have had success with running larger open weight models like DeepSeek-V3-0324 and now experimenting with Qwen3-235B-A22B:

  1. Quickly translating technical discussions on github to/from chinese/english and some other languages. It is good enough at translating while having some contextual tech information to get the ideas across. It still takes a lot of copy-pasting though.
  2. Summarizing longer text into a few short key bullet points. Not perfect, but can be handy given how much content there is out there and never enough time to read it all (including my long azz response lmao) xD.
  3. Generating throw away python code for common tasks like making graphs from arbitrary formatted data files. It is pretty good for parsing text files and making simple charts, but I do have to double check and fixup a few things or know enough to suggest the approach.
  4. Refactoring code to make formatting, comments, static type checking, and other suggestions. You can vibe code up some quick prototypes to test your ideas, but I definitely treat it as throw away code most of the time.
  5. Entertainment like co-writing silly stories with your friends by taking turns prompting the bot.

If you can keep your context to under 8k tokens (like ~5000 english words), many of the bigger models you can run at home have a few handy uses in my opinion.

But totally as others have said here, it is definitely unfortunate that LLMs and ai in general are over-prescribed when there are more simple existing solutions that are skipped over due to all the hype.

4 Likes

This is only partially on topic, but is related. have become quite interested in “smart glasses”, as those might be a great way for me to use AI-based assistants in everyday situations. My questions are:

  1. Aside from the Ray Ban Smart Glasses from Meta, what is known about where the servers for them are located, and what privacy and confidentiality laws and rules apply? I know what to expect and not to expect on this from Meta :stuck_out_tongue_winking_eye:, but for others? Usually no information is given or available. I was (am) interested in the glasses from Halliday (interesting design and features), but they aren’t saying anything about where the information is processed. Does anyone here know more?

  2. Also, does anyone here (also @wendell ) know if one or more of these smart glasses can be used with one’s own self- hosted distills? As much as I’d like to try the Ray Ban or Halliday glasses, I don’t trust Meta or an even less defined entity in [location unknown] to handle my data and those of the people I interact.
    I am posting these questions here, since I couldn’t find another thread in the Forum, and it is related to AI and its application. Any information is appreciated!

No, Season 1 I did manually.
I asked ChatGPT to do it for all seasons at first, but when that failed I tried to limit the scope to just Season 2, which failed as well.

That actually sounds fun, would be interesting if you got a better output out of it, maybe I’m just really bad at prompting.

Here is the list of english episode titles and synopses: List of Crayon Shin-chan episodes (1992–2001) - Wikipedia
And here are 2 german lists I worked with:
Shin Chan Episodenguide – fernsehserien.de
https://www.wunschliste.de/serie/shin-chan/episoden
You can safely ignore the sub-episodes. The same 3 sub-episodes are always grouped together into the same episode, it’s just the release order of the episodes that is completely different for the german vs the english release.
Maybe providing a mapping of character names would also help, since the characters in german have different names than in english, but that’s honestly something that deep research should be able to figure out IMO.

3 Likes

I recently needed to find some facility administrators. I figured I could probably do it with some linkedin searches and other methods but it would definitely take a while.

I asked perplexity to give me the list, and it did. I cross checked it and it was current and correct. It saved me tons of time.

I asked it to do some other things I was very familiar with, shopping and other things. It missed the hard to understand nuance I asked for but was otherwise correct.

I like the tool and would use it for a first round of online research. It generally doesn’t give me hallucinations, and is accurate and current which is all nice.

Other models have been much less reliable, and the training set modestly dated.

I do use a local coding model, I mostly ignore it except when my brain is fried and I just need something relatively simple I don’t want to think about and can test. AI occasionally saves me a bit of effort here.

I built a chat bot for our website docs and it works pretty well for answering queries about our products. I keep a close eye on it so I can tune it. The first round of models I used were much harder to work with than more recent models. Material in the context window generally works much better than pulling data from the training set.

Hoping to try some fine tuning some day. If I ever find the time.

2 Likes

Oh this does sound fun! I started a very naive attempt with the following prompt:

## Background
I have provided three lists of Episode Numbers, Titles, and Plot Synopses for a TV show in both English and German versions (which have slightly different ordering and naming).

You can safely ignore the sub-episodes as the same 3 sub-episodes are always grouped together into the same episode, it’s just the release order of the episodes that is completely different for the german vs the english release.

## Task
1. Create a mapping table between the English and German releases, with the columns "German episode number", "German title", "English episode number", "English title".
2. Cross refernce the lists providing a mapping of character names between English and German versions.

## English List 1
#https://en.wikipedia.org/wiki/List_of_Crayon_Shin-chan_episodes_(1992%E2%80%932001)

## German List 1
#https://www.fernsehserien.de/shin-chan/episodenguide

## German List 2
#https://www.wunschliste.de/serie/shin-chan/episoden

I have a custom llm chat client that uses regex to expand out any #https:// #file:// type URI and try to fetch markdown version with primp etc. However this resulted in like over 166k tokens :sweat_smile: which is way too long for most LLMs.

I went back and manually copy-pasted (after clicking mehr... and cutout some of the spammy HTML side-bar junk which resulted in around 92k tokens which is still very long.

For this long of context I’ll probably need to fire-up DeepSeek-V3-0324 which is a beast to run, hope to report back soon. In the mean time I’m running locally on on my 3090TI FE 24GB VRAM like so for a quick test:

/build/bin/llama-server \
    --alias "bartowski/THUDM_GLM-Z1-32B-0414-IQ4_XS" \
    --model /mnt/astrodata/llm/models/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf \
    -fa  \
    -ctk q8_0 -ctv q8_0 \
    --ctx-size 131072 \
    --n-gpu-layers 99 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080

The GLM-4 series has very efficient kv-cache given how its grouped query attention (GQA) is setup. But it might still give garbage output, we shall see!

4 Likes

Well my poor THUDM/GLM-Z1-32B-0414-IQ4_XS barfed when I threw 90+k tokens at it. I am currently running ubergarm/DeepSeek-V3-0324-IQ2_K_R4 like so and waiting for it to trickle out some tokens haha:

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
    --alias ubergarm/DeepSeek-R1-IQ2_K_R4 \
    --model /mnt/raid/hf/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
    --ctx-size 131072 \
    -ctk f16 -ctv f16 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080

It took like 20 minutes to process and here is the output thus far, I’ll do an edit and update once it gets further along but at 1.8 tok/sec I couldn’t wait to hear if this is at all correct given I don’t know the show.


1. Mapping Table between English and German Releases

Here is the mapping table between the English and German episodes based on their titles and synopses. The German episodes are grouped into three segments (a, b, c) per episode number, so each English episode is matched to one of these segments.

German Episode Number German Title English Episode Number English Title
1.01a Fett For Fun 1 Running an Errand / Mom’s Mornings are Busy / Drawing
1.01b Sport extrem 2 Tricycles are Fun / My Stomach Is Going to Burst / A Nightmare for Dad
1.01c Braue um Braue, Zahn um Zahn 3 Watching Action Mask / School Lunch is Fun / Going to the Dentist
1.02a Eine wirklich schreckliche Familie 4 The Sunflower Class / Going on a Picnic

To spare everyone the spam I posted the full response in a github gist here: Having fun with ai trying to solve some text mapping problems for a show! · GitHub

No idea if any of this is accurate or not. Might be better to pre-parse each copy-pasted list into its own table first before passing them all together for cross referencing.

Unfortunately my client threw an Error and didn’t get to finish the inference! Oh well, maybe someone else can do better!

2 Likes

You got pretty much the same output, that I got from ChatGPT in deep research mode.
It doesn’t realize that 1.01a is Season 1, Episode 1, Subepisode 1 and that it should group 1.01a, 1.01b and 1.01c into one episode. So it starts to match subepisode 1 of episode 1 with all 3 subepisodes of the first english release.
And then your result also has the exact same behaviour, where it simply matches German episode 1 with english episode 1, german episode 2 (or what it thinks is episode 2, in reality it is episode 1, subepisode 2) with english episode 2, etc.

Here is a screenshot from the output of ChatGPT for Season 2:


It starts at episode 26, because the first season of the german release has 25 episodes, but it does the exact same thing. It matches them 1:1 instead of doing the actual task, reading the synopses and making sure the plot matches.

I also have no idea how or why it started researching the laws of robotics while looking up shin-chan episodes:
image
Or what interesting global trends it can discover in a cartoon show from the 90s:
image

4 Likes

Isn’t that just a Markov Chain

1 Like

Actually I think this could be done, but not like that. This should imo be done on an episode-by-episode basis, trying to match one at a time, either from English to German, vice versa, or both ways + compare.

But I would rather try to write this tool myself than rely on a hosted “general purpose” solution.

You could try building a text embedding database from all the episode synopsises and then try to match the most likely episodes (sort of like face matching).

I agree with most of your post, but not this part. Historically neutral networks excel at classification. If you consider hallucination detection as a classification problem, at least detection should be possible, and detection is the first step to correction.
You may construct classification problem as: given this knowledge base, can that sentence be derived from it.

This would of course require the knowledge base to exist in the first place.