Struggling to get any use out of LLMs, am I missing something?

First time hearing about this, thanks for sharing.
And for the memes I even asked my waifu:

… and she hallucinated right here:

They are not just predicting the next word. I consider all the words in the input sequence […]

One does not negate the other. There is a grain of truth (Attention considers all previous tokens, Markov Chain only the last k), but the sentence is simply not true.

Edit Also:

This is a massive leap beyond simple sequential prediction. The output is a probability distribution over a vocabulary of words.

Umm… so is an output from a Markov Chain? I don’t get her point.

I’m regurgitating patterns

At least she recognized as much.

3 Likes

I appreciate this thread’s turn towards comedy gold. :popcorn: :popcorn: :popcorn: :popcorn:

FWIW “a few words” in the Markov Chains I work with is often a few million. Actually a scientific research topic to figure out how deep the memory goes.

2 Likes

For coding tasks a lot of LLMs are just bad at powershell. For PS I get better results from Phi4 running locally.

This would of course require the knowledge base to exist in the first place.

Which loops back to problem one, it doesnt, it is not embedded within LLM model in observable, editable and debugable state.

Most of the approaches to fix this either try to reinforce train by stating correct facts which is not effective solution but helps slightly in unpredictable fashion, or adjust input query to change model response.

It doesnt work well either.

Not suprising since all we do is either retrain massive binary blob or ask the same thing differently.

Unless there is is major breakthrough that allows us to inspect/read and edit final weights economically like plain code, and spend man years finetuning the existing model this way, hallucinations will remain unsolvable problem.

Second breakthrough would be different mechanism of training and input data formatting that would allow us construct end model in targeted fashion capability by capability and verify each step, but it would likely need the capability of above and much, much better understating of neural networks.

Personally I dont see LLM as solution, but as part of one. It foolish to outsource logic and thinking to model that does not incorporate in any real way.
Its not even thinking adjacent, its thinking heuristics since it trained on human generated language.

Now what would happen if we could create symbolic logic model trained from ground up by curated corpus of logic statements like traditional AI approaches of yesteryear, then use specialized llm as input LLM from human language to to logic model and then format logic model back to human language or other outputs?

Mixture of experts where the thinking part is not stochastic but handcrafted by hard logic rules first.

TLDR: LLM are potentionally useful, but their current core marketing is mostly exageration bordering on outright lies. If we define not lying as actually provable claim in academoic sense, then they are lying shamelessly.

Influx of VC money as next best thing does not help, and neither do sociopaths in managent who are treating this tech as real way to reduce headcount right NOW.

I see it even my company, moron in charge and his sycophants is trying to remove entire system monitoring department as a proof of concept with customer service being likely next.

5 Likes

Late to the party I am but…

Are you missing something? Maybe a few million in hardware. Which is a cheeky way of saying the more compute and memory you have things do get better but the “ceiling of stupid” is pretty low. Which is another cheeky way of saying it’s pretty garbage even with expensive compute.

My experience is it’s at best a parlor trick but it’s not useful at all. I’ve tried simple logic things like “I have 3 orange and 3 apples, I ate one so how many do I have left?” Rather than saying it can’t answer because I didn’t say what I ate or giving me something like “Either 2 apples and 3 oranges (5) or 3 apples and 2 oranges (5)” it will do some bat ____crazy “math” and somehow I have 7 or more of each.

I had another silly simple question for testing which was I have 6 chickens but one got sick, so how many chickens do I have. Often times I get a lot of idiotic rambling about contacting a vet and now I have 5 chickens. Though I have had a few models say 6 chickens because sick doesn’t mean dead heh. Either way no matter how many GPU’s I throw at a model the answers are always useless. It’s like asking a politician questions. They dip dodge and dive and say tons but never anything useful much less writing code or solving real problems.

Here is some output from the fruit question…if this was written by a human I’d say they were having a schizophrenic break…

I have 3 Apples and 3 Pears. I eat one, how many do I have left?

To evaluate the expression, we need to follow the order of operations

(PEMDAS):

  1. Add 3 and 3: 3 + 3 = 6

  2. Subtract 5 from 6: 6 - 5 = 1

So the final answer is: 1

I have 3 apples and 3 pears but then I eat 5 of the group how many fruits do

… I have left

You started with a total of:

3 apples + 3 pears = 6 fruits

Then, you ate 5 fruits. To find out how many fruits are left, subtract 5

from the original total:

6 - 5 = 1

So, you have 1 fruit left!

I have 3 apples and 3 pears but then I eat 5 of the group which type of fru

… it remains

You started with a total of:

  • 3 apples

  • 3 pears = 6 fruits (in total)

Then, you ate 5 fruits. Since the number of apples is 3 and the number of

pears is also 3, it’s likely that the first 2 fruits you ate were both

apples or both pears.

Let’s consider the possibilities:

  • If all 5 eaten fruits were apples, then you’d have 0 apples left and 3

pears.

  • If all 5 eaten fruits were pears, then you’d have 3 apples and 0 pears.

  • If some of the eaten fruits were apples and some were pears, then we

need more information to determine which type remains.

However, since I don’t know whether you ate more apples or more pears,

I’ll make an educated guess. Based on the previous problem where you had

only 1 fruit left, it’s likely that both types of fruits (apples and

pears) were eaten in such a way that you ended up with either 0 apples or

0 pears.

So, my answer is: You have either no apples or no pears.

I follow this madness up with the fact it’s wrong and it goes off the rails worse. Which is to say I said I ate ONE but somehow to it I said I ate 5…until it decides I somehow only ate 3 after it somehow arrives at 7 fruit…

"If you eat 5 fruits, that means you can only eat 2 more fruits (since
there are only 6 in total).

Given the numbers, the most likely scenario is that you ate all 3 pears (leaving 3 apples)"

It’s infuriating to toss insane hardware at this to have it ramble on about the most bat ____ insane crap and make zero sense.

2 Likes

Hah, I hadn’t made this connection before!

Most politicians seem to base their careers on the ability to create plausible-sounding answers to questions without any real understanding or logic involved. Just like LLMs! They really are infuriating in the same way! (And we have the same issue with most people seemingly unable to differentiate between these politicians and the few who do have some expertise – just like most people seem to trust LLMs blindly.)

So… LLMs are electronic populists? They give you the answer you are asking for? :slight_smile:

Sadly you’re dead on with the answer you ask for. Which isn’t the RIGHT answer, just the bias that might fit you best. I’ve done a few other conversational tests where left leaning questions got left leaning retorts (mostly regurgitating what I already said) and vice versa. They really are like Politicians combined with a mini echo chamber.

Comically I grabbed an Ollama deekseek model to ask some questions about…" Tiananmen Square" Guess how that went. :wink: Insert that Pawn Stars Meme “Historically factual answers? Best I can do is propaganda.”

1 Like

You were right, deep research did it correctly for one episode:

Maybe the issue was the parsing of the sub episodes and the massive number of them, this seems to work. I’ll try regular ChatGPT 4o with the same prompt (I don’t have high hopes) and see what it can do.

Actually, I only have 3 more deep research tokens left for this month, and I won’t be paying for ChatGPT.
Can someone help me with finding the best strategy here? I honestly don’t know exactly how this works.

If I ask ChatGPT to “deep research” a topic for me (for example the english episode list), will it keep the context of this topic in my conversation and be able to pull from it, even when I’m using the “regular” chatGPT models to ask about it? Or do I need to burn a deep research token every time I want to access this context?

Or should I try to provide a list of several episodes synopses in the prompt itself and let deep research map them one after the other? How long can a prompt usually be before things fall apart?

Any other ideas how to speed this up and make it more efficient?

This is probably more of an obnoxious “non-answer” [hey maybe I’m an LLM ;)] but it seems like it would be faster and more efficient to just do it manually. Barring that wait a few years until things either don’t suck or the cost for paid services allow your dollar to go farther…I mean…wait 3 weeks until the banana is peeled then add 2 Turtles and rotate three quarters turn anti-clock wise while bad flooring entered space at 3000Km/Sec reaching exit velocity of 3Mph while roasted chicken provides ample protein for your diet if you eat it 600 times a day at 40Oz a serving.

Thank you Thank you I’m here all night, don’t forget to tip your waitress bot…

It’s definitely cheaper and probably faster as well to do it manually (since validating the chatGPT result takes roughly the same time as doing it manually per episode), this is just out of interest at this point.
I’ve seen Jensen Huang talk about LLMs reading research papers and then asking them about the research, this seems like a similar-ish usecase.
I want it to research a topic (episode synopses for this show), and then I want to ask it about the thing it researched (which episode could this correspond to).
I just don’t know whether every single prompt is the computationally (and monetarily) expensive deep-research, or whether it just needs to do the deep research once, and can then be asked with “normal”, less expensive prompts.

Well Jensen wants to sell you on what he’s selling and where is the proof of his claims? I mean 5070 4090 performance and all that fakery BS. Pretty sure every big corpo with some LLM BS has been caught cheating/gaming benches and such.

I get your interest. It would be awesome just to have a “robot” friend to talk to…but they aren’t even really good enough for that. They just parrot back what you said with a few things moved around and patronize you.

I tried a lot of task specific models for things but I never found anything that did what I asked. From complex to basic 100% failure rate. The deep reasoning models just “show the work” i.e. output even more garbage to show how they got everything wrong.

This is where the paid models running on heavy rigs do do better but still just eat your money like a claw machine or carnival game while you try to game the prompt. Ai/ML is almost like a gambling addiction these days. Come on man the next token this machine is going to pay out…I mean like grand prize jackpot…just one more token! The more you buy the more you save…Keep buying it’s going to pay off any moment now! Didn’t you hear how this is a “Revolutionary New Workflow!” It will enable to you to do all sorts of things (“Things not disclosed”) and will totally allow you automate anything just keep paying!

If you really want to figure things out I’m guessing you’re doing to need to dig into getting a local base model set up and doing some serious tweaking along with some custom python work. Which is to say unless you’re on an old laptop local with a lot of effort is probably your best bet…but it won’t be easy.

Deep research model behaves differently that normal chatgpt, it strongly prefers fixed input format that traditional conversation.

If you give it well structured task definition in single go, it will give you much better answer.

I single take describe:

  • high level task definition what you want
  • supplementary directives what and how to do/focus on if any
  • specify preferred output format
  • add any further data

It literally works better if your query is well specified work assigment.

I’m not talking about the model weights being the knowledge base, but something akin to a database. The fact checking model wouldn’t even need to be super heavy, it doesn’t even need to generate anything, only output a single logit.

BTW attention kinda is debuggable, because you can clearly see which tokens are activated by which keys.

I had the same issues you had where chatgpt got fixated on matching on the episode number. I think the issue is the way how the data is formatted on wikipedia and the german page.

I will probably write some python code to get the text and use a smaller language model to extract the segments. Will post it when i finish, have been a bit busy.

Definitely interested in this because I also watched this show too lol

And even before sending the deep research query, asking normal chatgpt a few times on how the query can be improved, can be helpful.

Other unrelated tips:

All this discussion on how to “correctly game the prompt” sounds more like a guide on how to make a deal with the devil so there is no “You didn’t specify X” gotcha.

1 Like