One thing I appreciated about this post, unlike a lot of AI-skeptic posts, is that it actually makes a concrete falsifiable prediction; specifically, "LLMs will never manage to deal with large code bases 'autonomously'". So in the future we can look back and see whether it was right.
For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.
"Deal with" and "autonomously" are doing a lot of heavy lifting there. Cursor already does a pretty good job indexing all the files in a code base in a way that lets it ask questions and get answers pretty quickly. It's just a matter of where you set the goalposts.
Cursor fails miserably for me even just trying to replace function calls with method calls consistently, like I said in the post. This I would hope is fixable. By dealing autonomously I mean "you don't need a programmer - a PM talks to an LLM and that's how the code base is maintained, and this happens a lot (rather than on one or two famous cases where it's pretty well known how they are special and different from most work)"
By "large" I mean 300K lines (strong prediction), or 10 times the context window (weaker prediction)
I don't shy away from looking stupid in the future, you've got to give me this much
I'm pretty sure you can do that right now in Claude Code with the right subagent definitions.
(For what it's worth, I respect and greatly appreciate your willingness to put out a prediction based on real evidence and your own reasoning. But I think you must be lacking experience with the latest tools & best practices.)
If you're right, there will soon be a flood of software teams with no programmers on them - either across all domains, or in some domains where this works well. We shall see.
Indeed I have no experience with Claude Code, but I use Claude via chat, and it fails all the time on things not remotely as hard as orientation in a large code base. Claude Code is the same thing with the ability to run tools. Of course tools help to ground its iterations in reality, but I don't think it's a panacea absent a consistent ability to model the reality you observe thru your use of tools. Let's see...
I was very skeptical of Claude Code but was finally convinced to try it and it does feel very different to use. I made three hobby projects in a weekend that I had pushed up for years due to "it's too much hassle to get started" inertia. Two of the projects it did very well with, the third I had to fight with it and it still is subtly wrong (swiftUI animations and claude code seemingly is not a good mix!)
That being said, I think your analysis is 100% correct. LLMs are fundamentally stupid beyond belief :P
> SwiftUI animations and claude code seemingly is not a good mix
Where is the corpus of SwiftUI animations to train Claude what probable soup you probably want regurgitated?
Hypothesis: iOS devs don't share their work openly for reasons associated with how the App Store ecosystem (mis)behaves.
Relatedly, the models don't know about Swift 6 except from maybe mid-2024 WWDC announcements. It's worth feeding them your own context. If you are 5.10, great. If you want to ship iOS 26 changes, wait till 2026 or again, roll your own context.
In my case the big issue seems to be that if you hide a component in SwiftUI, it's by default animated with a fade. This not shown in the API surface area at all.
> Indeed I have no experience with Claude Code, but I use Claude via chat...
These are not even remotely similar, despite the name. Things are moving very fast, and the sort of chat-based interface that you describe in your article is already obsolete.
Claude is the LLM model. Claude Code is a combination of internal tools for the agent to track its goals, current state, priorities, etc., and a looped mechanism for keeping it on track, focused, and debugging its own actions. With the proper subagents it can keep its context from being poisoned from false starts, and its built-in todo system keeps it on task.
Really, try it out and see for yourself. It doesn't work magic out of the box, and absolutely needs some hand-holding to get it to work well, but that's only because it is so new. The next generation of tooling will have these subagent definitions auto selected and included in context so you can hit the ground running.
We are already starting to see a flood of software coming out with very few active coders on the team, as you can see on the HN front page. I say "very few active coders" not "no programmers" because using Claude Code effectively still requires domain expertise as we work out the bugs in agent orchestration. But once that is done, there aren't any obvious remaining stumbling blocks to a PM running a no-coder, all-AI product team.
Claude Code isn't an LLM. It's a hybrid architecture where an LLM provides the interface and some of the reasoning, embedded inside a broader set of more or less deterministic tools.
It's obvious LLMs can't do the job without these external tools, so the claim above - that LLMs can't do this job - is on firm ground.
But it's also obvious these hybrid systems will become more and more complex and capable over time, and there's a possibility they will be able to replace humans at every level of the stack, from junior to CEO.
If that happens, it's inevitable these domain-specific systems will be networked into a kind of interhybrid AGI, where you can ask for specific outputs, and if the domain has been automated you'll be guided to what you want.
It's still a hybrid architecture though. LLMs on their own aren't going to make this work.
It's also short of AGI, never mind ASI, because AGI requires a system that would create high quality domain-specific systems from scratch given a domain to automate.
Nearly every definition I’ve seen that involves AGI (there are many) includes the ability to self learn and create “novel ideas”. The LLM behind it isn’t capable of this, and I don’t think the addition of the current set of tools enables this either.
Artificial general intelligence was a phrase invented to draw distinction from “narrow intelligence” which are algorithms that can only be applied to specific problem domains. E.g. Deep Blue was amazing at playing chess, but couldn’t play Go much less prioritize a grocery list. Any artificial program that could be applied to arbitrary tasks not pre-trained on is AGI. ChatGPT and especially more recent agentic models are absolutely and unquestionably AGI in the original definition of the term.
Goalposts are moving though. Through the efforts of various people in the rationalist-connected space, the word has since morphed to be implicitly synonymous with the notion of superintellgence and self-improvement, hence the vague and conflicting definitions people now ascribe to it.
Also, fwiw the training process behind the generation of an LLM is absolutely able to discover new and novel ideas, in the same sense that Kepler’s laws of planetary motion were new and novel if all you had were Tycho Brache’s astronomical observations. Inference can tease out these novel discoveries, if nothing else. But I suspect also that your definition of creative and novel would also exclude human creativity if it were rigorously applied—our brains after all are merely remixing our own experiences too.
I feel like refutations like this (you aren't using the tool right | you should try this other tool) pop up often but are fundamentally worthless because as long as you're not showing code you might as well be making it up. The blog post gives examples of clear failures that can be reproduced by anyone by themselves, I think its time vibe code defenders are held to the same standard.
The very first example is that LLMs lose their mental model of chess when playing a game. Ok, so instead ask Claude Code to design an MCP for tracking chess moves, and vibe code it. That’s the very first thing that comes to mind, and I expect it would work well enough.
"LLM" as well, because coding agents are already more than just an LLM. There is very useful context management around it, and tool calling, and ability to run tests/programs, etc. Though they are LLM-based systems, they are not LLMs.
The author would still be wrong in the tool-calling scenario. There is already perfect (or at least superhuman) chess engines. There is no perfect "coding engine". LLM's + tools being able to reliably work on large codebases would be a new thing.
Correct - as long as the tools the LLM uses are non-ML-based algorithms existing today, and it operates on a large code base with no programmers in the loop, I would be wrong. If the LLM uses a chess engine, then it does nothing on top of the engine; similarly if an LLM will use another system adding no value on top, I would not be wrong. If the LLM uses something based on a novel ML approach, I would not be wrong - it would be my "ML breakthrough" scenario. If the LLM uses classical algorithms or an ML algo known today and adds value on top of them and operates autonomously on a large code base - no programmer needed on the team - then I am wrong
« autonomously » what happens when subtle updates that are not bugs but change the meaning of some features that might break the workflow on some other external parts of a client’s system ? It happens all the time and, because it’s really hard to have the whole meaning and business rules written and maintained up to date, an LLM might never be able to grasp some meaning.
Maybe if instead of developing code and infrastructures, the whole industry shifts toward only writing impossibly precise spec sheets that make meaning and intent crystal clear then, maybe « autonomously » might be possible to pull off
Not exactly. It depends how software is written and if there is ADRs in the project. I had to work on projects where there was bugs because someone coded business rules in a very bad and unclear way. You move an if somewhere and something breaks somewhere else. You ask « is this condition the way it’s supposed to work or is it a bug » when software is not clear enough - and often it isn’t because we have to go fast - we ask people to confirm the rule.
My point is this, amazingly written software surely works best with LLMs. That’s not the most software written for now because businesses value speed over engineering sometimes (or it’s lack of skills)
Right: software is not necessarily a sufficiently-clear specification, but a sufficiently-clear specification would be software – and you've correctly identified that a good part of your job is ensuring the software provides a sufficiently-clear specification.
Amazingly-written software is necessary for LLMs to work well, but it isn't sufficient: LLMs tend to make nonsensical changes that, while technically implementing what they're asked to do (much of the time), reduce the quality of the software. As this repeats, the LLMs become less and less able to modify the program. This is because they can't program: they can translate, plagiarise, and interpolate, but they're missing several key programming skills, and probably cannot learn them.
In two years there will be probably no new 'autonomous' LLMs. They will be most likely integrated into 'products', trained and designed for this. We see the beginning of it today as agents and tools.
The whole of modern science is based on the idea that we can never prove a theory about the world to be true, but that we can provide experiments which allow us to show that some theories are closer to the truth than others.
Eh, if the hypothesis remains unfalsified for longer and longer, we can have increased confidence.
Similar, Newton's laws say that bodies always stay at rest unless acted upon by a force. Strictly speaking, if a billiard ball jumps up without cause tomorrow that would disprove Newton. So we'd have to wait an infinite amount of time to prove Newton right.
However no one has to wait so long, and we found ways to express how Newton's ideas are _better_ than those of Aristotle without waiting an eternity.
The post is based on a misconception. If you read the blog post linked at the end of this message, you'll see how a very small GPT-2 alike transformer (Karpathy nano-gpt trained to a very small size) after seeing just PGN games and nothing more develops an 8x8 internal representation with which chess piece is where. This representation can be extracted by linear probing (and can be even altered by using the probe in reverse). LLMs are decent but not very good chess players for other reasons, not because they don't have a world model of the chess board.
Ironically, that lesswrong article is more wrong than right.
First, chess is perfect for such modeling. The game is basically a tree of legal moves. The "world model" representation is already encoded in the dataset itself and at a certain scale the chance of making an illegal move is minimal, as the dataset itself includes an insane amount of legal moves compared to illegal moves, let alone when you are training it on a chess dataset like PGN one
Second, the probing is quite... a subjective thing.
We are cherry-picking activations across an arbitrary amount of dimensions, on a model specifically trained for chess, taking these arbitrary representations and displaying it on 2D graph.
Well yeah, with enough dimensions and cherry-picking, we can also show how "all zebras are elephants, because all elephants are horses and look their weights overlap in so many dimensions - large four-legged animals you see on safari!" - especially if we cherry-pick it. Especially if we tune a dataset on it.
This shows nothing other than "training LLMs on a constrained move dataset makes LLM great at predicting next move in that dataset".
And if it knew every possible board configuration and optimal move, it could potentially do as well as it could, but instead if it were to just recognize “this looks like a chess game” and use an optimized tool to determine the next move, that would be a better use of training, it would seem.
The post or rather the part you refer to is based on a simple experiment which I encourage you to repeat. (It is way likelier to reproduce in the short to medium run than the others.)
From your link: "...The first was gpt-3.5-turbo-instruct's ability to play chess at 1800 Elo"
These things don't play at 1800 ELO, though maybe someone measured this ELO without cheating but rather relying on some artifacts of how an engine told to play at a low rating does against an LLM (engines are weird when you ask them to play badly, as a rule); a good start to a decent measurement would be to try it on chess 960. These things do lose track of the pieces in 10 moves. (As do I absent a board to look at, but I understand enough to say "I can't play blindfold chess, let's set things up so I can look at the current position somehow")
This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.
LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.
Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.
My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.
Let's see how my predictions hold up; I have made enough to look very wrong if they don't.
Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says
Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.
I imagine people give up silently more often than they write a well syndicated article about it. The actual adoption and efficiencies we see in enterprises will be the most verifiable data on if LLMs are generally useful in practice. Everything so far is just academic pontificating or anecdata from strangers online.
However, I'm not completely sure. Eg object oriented programming was basically a useless fad full of empty, never-delivered-on promises, but software companies still lapped it up. (If you happen to like OOP, you can probably substitute your own favourite software or wider management fad.)
Another objection: even an LLM with limited capabilities and glaring flaws can still be useful for some commercial use-cases. Eg the job of first line call centre agents that aren't allowed to deviate from a fixed script can be reasonable automated with even a fairly bad LLM.
Will it suck occasionally? Of course! But so does interacting with the humans placed into these positions without authority to get anything done for you. So if the bad LLM is cheaper, it might be worthwhile.
This. I think we’ve about reached the limit of the usefulness of anecdata “hey I asked an LLM this this and this” blog posts. We really need more systematic large scale data and studies on the latest models and tools - the recent one on cursor (which had mixed results) was a good start but it was carried out before Claude Code was even released, i.e. prehistoric times in terms of AI coding progress.
For my part I don’t really have a lot of doubts that coding agents can be a useful productivity boost on real-world tasks. Setting aside personal experience, I’ve talked to enough developers at my company using them for a range of tickets on a large codebase to know that they are. The question is more, how much: are we talking a 20% boost, or something larger, and also, what are the specific tasks they’re most useful on. I do hope in the next few years we can get some systematic answers to that as an industry, that go beyond people asking LLMs random things and trying to reason about AI capabilities from first principles.
I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.
When talking about the capabilities of a class of tools long term, it makes sense to be general. I think deriving conclusions at all is pretty difficult given how fast everything is moving, but there is some realities we do actually know about how LLMs work and we can talk about that.
Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.
I've seen plenty of blunders, but in general it's better than their previous models.
Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.
I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.
I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.
A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:
* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.
* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.
* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)
* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.
I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.
I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."
I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)
>LLMs are not "compelled" by the training algorithms to learn symbolic logic.
I think "compell" is such a human trait that machine will never replicate to the T.
The article did mention specifically about this very issue:
"And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform."
If it's not already evident that in itself LLM is a limited stochastic AI tool by definition and its distant cousins are the deterministic logic, optimization and constraint programming [1],[2],[3]. Perhaps one of the two breakthroughs that the author was predicting will be in this deterministic domain in order to assist LLM, and it will be the hybrid approach rather than purely LLM.
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
It’s not just on the job learning though. I’m no AI expert, but the fact that you have “prompt engineers” and AI doesn’t know what it doesn’t know, gives me pause.
If you ask an expert, they know the bounds of their knowledge and can understand questions asked to them in multiple ways. If they don’t know the answer, they could point to someone who does or just say “we don’t know”.
LLMs just lie to you and we call it “hallucinating“ as though they will eventually get it right when the drugs wear off.
> I’m no AI expert, but the fact that you have “prompt engineers” [...] gives me pause.
Why? A bunch of human workers can get a lot more done with a capable leader who helps prompt them in the right direction and corrects oversights etc.
And overall, prompt engineering seems like exactly the kind of skill AI will be able to develop by itself. You already have a bit like this happening: when you ask Gemini to create a picture for you, then the language part of Gemini will take your request and engineer a prompt for the picture part of Gemini.
> A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
It's closer to AlphaGo, which first trained on expert human games and then 'fine tuned' with self-play.
AlphaZero specifically did not use human training data at all.
I am waiting for an AlphaZero style general AI. ('General' not in the GAI sense but in the ChatGPT sense of something you can throw general problems at and it will give it a good go, but not necessarily at human level, yet.) I just don't want to call it an LLM, because it wouldn't necessarily be trained on language.
What I have in mind is something that first solves lots and lots of problems, eg logic problems, formally posed programming problems, computer games, predicting of next frames in a web cam video, economic time series, whatever, as a sort-of pre-training step and then later perhaps you feed it a relatively small amount of human readable text and speech so you can talk to it.
Just to be clear: this is not meant as a suggestion for how to successfully train an AI. I'm just curious whether it would work at all and how well / how badly.
Presumably there's a reason why all SOTA models go 'predict human produced text first, then learn problem solving afterwards'.
> I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. [...]
Yes, I agree. But 'on-the-job' training is also such an obvious idea that plenty of people are working on making it work.
With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.
The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.
It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.
Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.
Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.
Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think
It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.
The question is, did these LLMs figured it out by themselves or has someone programmed a specific coroutine to address this „issue“, to make it look smarter than it is?
On a trillion dollar budget, you could just crawl the web for AI tests people came up with and solve them manually. We know it‘s a massively curated game. With that kind of money you can do a lot of things. You could feed every human on earth countless blueberries for starters.
Calling an algorithm to count letters in a word isn’t exactly worth the hype tho is it?
The point is, we tend to find new ways these LLMs can’t figure out the most basic shit about the world. Horses can count. Counting is in everything. If you read every text ever written and still can’t grasp counting you simply are not that smart.
Some LLMs do better than others, but this still sometimes trips up even "frontier" non-reasoning models. People were showing this on this very forum with GPT-5 in the past couple days.
Of course they do stuff like that, otherwise it would look like they are stagnating. Fake it till you make it. Tho, at this point, the world is in deep shit, if they don’t make it…
My prediction is that this will be like the 2000 dot com bubble. Both dot com and AI are real and really useful technologies but hype and share price has got way ahead of it so will need to re adjust.
A major economic crisis, yes. I think the web is already kinda broken because of AI, gonna get a lot worse. I also question its usefulness… Is it useful solving any real problems, and if so how long before we run out of these problems? Because we conflated a lot of bullshit with innovation right before AI. Right now people may be getting a slight edge, but it’s like getting a dishwasher, once expectations adjusted things will feel like a grind again, and I really don’t think people will like that new reality in regard to experience of self-efficacy (which is important for mental health). I presume the struggle to get information, figuring it out yourself, may be a really important part of putting pressure towards process optimization and for learning, cognitive development. We may collectively regress there. With so many major crisis, a potential economic crisis on top, I am not sure we can afford losing problem solving capabilities to any extent. And I really, really don’t think AI is worth the fantastical energy expenditure, waste of resources and human exploitation, so far.
It depend on context. English is often not very precise and relies on implied context clues. And that's good. It makes communication more efficient in general.
To spell it out: in this case I suspect you are talking about English letter case? Most people don't care about case when they ask these questions, especially in an informal question.
> they clearly don't have any world model whatsoever
Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.
> where it certainly hadn’t seen the questions before?
What are you basing this certainty on?
And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.
It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.
Like the other reply said, each exam has entirely new questions which are of course secret until the test is taken.
Sure, the questions were probably in a similar genre as existing questions or required similar techniques that could be found in solutions that are out there. So what? You still need some kind of world model of mathematics in which to understand the new problem and apply the different techniques to solve it.
Are you really claiming that SOTA LLMs don’t have any world model of mathematics at all? If so, can you tell us what sort of example would convince you otherwise? (Note that the ability to do novel mathematics research is setting the bar too high, because many capable mathematics majors never get to that point, and they clearly have a reasonable model of mathematics in their heads.)
I think both the literature on interpretability and explorations on internal representations actually reinforce the author's conclusion. I think internal representation research tends to nets that deal with a single "model" don't necessary have the same representation and don't necessarily have a single representation.
And doing well on XYZ isn't evidence of a world model in particular. The point that these things aren't always using a world is reinforced by systems being easily confused by extraneous information, even systems as sophisticated as thus that can solve Math Olympiad questions. The literature has said "ad-hoc predictors" for a long time and I don't think much has changed - except things do better on benchmarks.
And, humans too can act without a consistent world model.
> LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction because of how far you can take them despite the approach being fundamentally incorrect.
I don't believe that it is a fundamentally incorrect approach. I believe, that human mind does something like that all the time, the difference is our minds have some additional processes that can, for example, filter hallucinations.
Kids at a specific age range are afraid of their imagination. Their imagination can place a monster into any dark place where nothing can be seen. Adult mind can do the same easily, but the difference is kids have difficulties distinguishing imagination and perception, while adult generally manage.
I believe, the ability of human mind to see difference between imagination/hallucinations from one hand and perception and memory from the other is not a fundamental thing stemming from the architecture of brains but a learned skill. Moreover people can be tricked to acquire false memory[1]. If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.
What LLMs need is to learn some tricks to detect hallucinations. Probably they will not get 100% reliable detector, but to get to the level of humans they don't need 100% reliability.
I have recently lived through something called a psychotic break, which was an unimaginably horrible thing, but it did let me see from the inside what insanity does to your thinking. And what's fascinating, coming out the other side of this, is how similar LLMs are to someone in psychosis. Someone in psychosis can have all the ability LLMs have to recognise patterns and sound like they know what they're talking about, but their brain is not working well enough to have proper self-insight, to be able to check their thoughts actually fully make sense. (And “making sense” turns out to be a sliding scale — it is not as if you just wake up one day suddenly fully rational again, there's a sliding scale of irrational thinking and you have to gradually re-process your older thoughts into more and more coherent shapes as your brain starts to work more correctly again.) I believe this isn't actually a novel insight either, many have worried about this for years! Psychosis might be an interesting topic to read about if you want to get another angle to understand the AI models from. I won't claim that it's exactly the same thing, but I will say that most people probably have a very undeveloped idea of what mental illness actually is or how it works, and that leaves them badly prepared for interacting with a machine that has a strong resemblance to a mentally ill person who's learned to pretend to be normal.
Thank you for sharing, and sorry you had to go through that. I had a good friend go through a psychotic break and I spent a long time trying to understand what was going on in his brain. The only solid conclusion I could come to was that I could not relate to what he was going through, but that didn’t change that he was obviously suffering and needed whatever support I could offer. Thanks for giving me a little bit of insight into his brain. Hope you were/are able to find support out there.
If we just take simply a panic attack, many people have no clue what or how it feels like, which is unfortunate, because they lack empathy for those who do experience it. My psychiatrists definitely need to experience it to understand.
Do you have many memories of that time, around 3 to 5, and remember what your cognitive processes were?
When the child is afraid of the monster in the dark, they are not literally visually hallucinating a beast in the dark; they are worried that there could be a beast in the dark, and they are not sure that there is due to a lack of sensory information confirming a lack of the monster. They are not being hyper precise because they are 3, so they say "there is a monster under my bed"! Children have instincts to be afraid of the dark.
Similarly with imaginary friends and play, it's an instinct to practice through smaller stakes simulations. When they are emotionally attached to their imaginary friends, it's much like they are emotionally attached to their security blanket. They know that the "friend" is not perceptible.
It's much like the projected anxieties of adults or teenagers, who are worried that everyone thinks they are super lame and thus act like people do, because on the balance of no information, they choose the "safer path".
That is pretty different than the hallucinations of LLMs IMO.
From my perspective, the fundamental problem arises from the assumption that brain's all functions are self contained, however there are feedback loops in the body which supports the functions of the brain.
The simplest one is fight/flight/freeze. Brain starts the process by being afraid, and hormones gets released, but next step is triggered by the nerve feedback coming from the body. If you are using beta-blockers and can't get panicked, the initial trigger fizzles and you return to your pre-panic state.
an LLM doesn't model a complete body. It just models the language. It's just a very small part of what brain handles, so assuming that modelling the language, even the whole brain gonna answer all the questions we have is a flawed approach.
Latest research shows body is a much more complicated and interconnected system than we learnt in school 30 years ago.
Sure, your points about the body aren’t wrong, but (as you say) LLMs are only modelling a small subset of a brain’s functions at the moment: applied knowledge, language/communication, and recently interpretation of visual data. There’s no need or opportunity for an LLM (as they currently exist) to do anything further. Further, just because additional inputs exist in the human body (gut-brain axis, for example) it doesn’t mean that they are especially (or at all) relevant for knowledge/language work.
The point is that knowledge/language work can't work reliably unless it's grounded in something outside of itself. Without it you don't get an oracle, you get a superficially convincing but fundamentally unreliable idiot savant who lacks a stable sense of self, other, or real world.
The fundamental foundation of science and engineering is reliability.
If you start saying reliability doesn't matter, you're not doing science and engineering any more.
I'm really struggling to understand what you're trying to communicate here; I'm even wondering if you're an LLM set up to troll, due to the weird language and confusing non-sequiturs.
> The point is that knowledge/language can't work reliably unless it's grounded in something outside of itself.
Just, what? Knowledge is facts, somehow held within a system allowing recall and usage of those facts. Knowledge doesn't have a 'self', and I'm totally not understanding how pure knowledge as a concept or medium needs "grounding"?
Being charitable, it sounds more like you're trying to describe "wisdom" - which might be considered as a combination of knowledge, lived experience, and good judgement? Yes, this is valuable in applying knowledge more usefully, but has nothing to do with the other bodily systems which interact with the brain, which is where you started?
> The fundamental foundation of science and engineering is reliability.
> If you start saying reliability doesn't matter, you're not doing science and engineering any more.
No-one mentioned reliability - not you in your original post, or me in my reply. We were discussing whether the various (unconscious) systems which link to the brain in the human body (like the gut:brain axis) might influence its knowledge/language/interpretation abilities.
You probably know the Law of Archimedes. Many people do. But do you know it in the same way Archimedes did? No. You were told the law, then taught how to apply it. But Archimedes discovered it without any of that.
Can we repeat the feat of Archimedes? Yes, we can, but first we'd have to forget what we were told and taught.
The way we actually discover things is very different from amassing lots of hearsay. Indeed, we do have an internal part that behaves the same way LLM does. But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate. That part does not help us think; it helps us to avoid thinking. The reason it exists is that it is useful: thinking is hard and slow, but recalling is easy and fast. But it not thinking; it is the opposite.
> But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate.
Close, but not exactly. To start from a clean slate is not very difficult, the trick is to reject some chosen parts of existing knowledge, or more specifically the difficulty is to choose what to reject. Starting from a clean slate you'll end up spending millennia to get the knowledge you've just rejected.
So the overall process of generating knowledge is to look under the streetlight till finding something new becomes impossible or too hard, and then you start experimenting with rejecting some bits of your knowledge to rethink them. I was taught to read works of Great Masters of the past critically, trying to reproduce their path while looking for forks where you can try to go the other way. It is a little bit like starting from a clean slate, but not exactly.
> If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.
She's strongly oversold how and when false memories can be created. She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.
She's spent a career over-generalizing data about implanting false minor memories to make money discrediting victims' traumatic memories and defend abusers.
You conflate "hallucination" with "imagination" but the former has much more in common with lieing than it does with imagining.
> She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.
Did she have financial incentives? Was this a live demonstration? :P
I really don't like how you rejecting the idea completely. People have online one-shot training, but have you tried to learn how to play on piano? To learn it you need a lot of repetitions. Really a lot. You need a lot of repetitions to learn how to walk, or how to do arithmetic, or how to read English. This is very similar to LLMs, isn't it? So they are not completely different architectures, aren't they? It is more like human brains have something on top of "LLM" that allows it to do tricks that LLMs couldn't do.
That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.
I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)
And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.
> it wrote by itself as part of a larger task I had given it, so it certainly understands transparency
Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)
> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.
> is a purely philosophical question
It is indeed. A question we need to ask ourselves.
> We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers.
If LLMs are stochastic parrots, but also we’re just stochastic parrots, then what does it matter? That would mean that LLMs are in fact useful for many things (which is what I care about far more than any abstract discussion of free will).
We're not just stochastic parrots though, we can parrot things stochastically when that has utility, but we can also be original. The first time that work was done, it was sone by a person, autonomously. Current LLMs couldnt have done it the first time
0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?
1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.
2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.
You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!
I my opinion the author refers to a LLMs inability to create a inner world, a world model.
That means it does not build a mirror of a system based on its interactions.
It just outputs fragments of world models it was build one and tries to give you a string of fragments that should match to the fragment of your world model that you provided through some input method.
It can not abstract the code base fragments you share it can not extend them with details using the model of the whole project.
A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.
The examples are from all the major commercial American LLMs as listed in a sister comment.
You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.
If you train an LLM on chess, it will learn that too. You don't need to explain the rules, just feed it chess games, and it will stop making illegal moves at some point. It is a clear example of an inferred world model from training.
This is the best and clearest explanation I have yet seen that describe a tricky thing, namely that LLMs, which are synonymous with "AI" for so many people, are just one variation of many possible types of machine intelligence.
Which I find important because, well, hallucinating facts is what you would expect from an LLM, but isn't necessarily inherent issue with machine intelligence writ large if it's trained from the ground up on different principles, or modelling something else. We use LLMs as a stand in for tutors because being really good at language incidentally makes them able to explain math or history as a side effect.
Importantly it doesn't show that hallucinating is a baked in problem for AI writ large. Presumably different models will have different kinds of systemic errors based on their respective designs.
Great quote at the end that I think I resonate a lot with:
> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.
I think they know this but don’t have causality built-in. In the sense they aren’t incentivised to understand holistically. Kids around 4 years old are spamming with “why? why? why?” questions and I think this is some process we need yet to reproduce. (BTW I suspect they ask this as manifestation of what is going in their brain and not real curiosity as they ask same question multiple times)
I think it’s mostly because they are incentivised to answer verbatim as medicine students and not with their own understanding. RL methods change that.
What with this and your previous post about why sometimes incompetent management leads to better outcomes, you are quickly becoming one of my favorite tech bloggers. Perhaps I enjoyed the piece so much because your conclusions basically track mine. (I'm a software developer who has dabbled with LLMs, and has some hand-wavey background on how they work, but otherwise can claim no special knowledge.) Also your writing style really pops. No one would accuse your post of having been generated by an LLM.
Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.
It seems clear to me that LLMs are a useful sort of dumb smart activity. They can take some pretty useful stabs in the dark, and therefore do much better in an environment which can give them feedback (coding) or where there is no objective correct answer (write a poem). It opens the door for some novel type of computational tasks, and the more feedback you can provide within the architecture of your application, the more useful the LLM will probably be. I think the hype of their genuine intelligence is overblown, but doesn’t mean they are not useful.
When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.
Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.
But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.
In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.
In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.
This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.
The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.
Good on you for having the meta-cognition to recognize it.
I've graded many exams in my university days (and set some myself), and it's exceedingly obvious that that's what many students are doing. I do wonder though how often they manage to fly under the radar. I'm sure it happens, as you described.
(This is also the reason why I strongly believe that in exams where students write free-form answers, points should be subtracted for incorrect statements even if a correct solution is somewhere in the word salad.)
This article appeared on HN a while ago. https://dynomight.net/more-chess/
It basically is in agreement with this article and provides a few more trials and explanations.
No, I think they're suggesting the LLM should literally be "talking shit", e.g. in a chat window alongside the game UI, as if you're in a live chat with another player. As in, use the LLM for processing language, and the chess engine for playing chess.
I think this is quite an amusing idea, as the LLM would see the moves the chess engine made and comment along the lines of "wow, I didn't see that one coming!" very Roger Sperry.
As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).
ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough
Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.
My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.
How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.
The current models lack "remember/use/update" parts.
> I don't think there's any fundamental difference in the principle of their operation
Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).
That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.
But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.
This is interesting. The "professional level" rating of <1800 isn't, but still.
However:
"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches
99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves,
reinforcing that continuous error correction and
learning the correct moves significantly improve ELO"
You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.
> Failing to do so means that it has not learned a model of what chess is, at some basic level.
I'm not sure about this. Among a standard amateur set of chess players, how often when they lack any kind of guidance from a computer do they attempt to make a move that is illegal? I played chess for years throughout elementary, middle and high school, and I would easily say that even after hundreds of hours of playing, I might make two mistakes out of a thousand moves where the move was actually illegal, often because I had missed that moving that piece would continue to leave me in check due to a discovered check that I had missed.
It's hard to conclude from that experience that players that are amateurs lack even a basic model of chess.
Can you say 100% you can generate a good next move (example from the paper) without using tools, and will never accidentally make a mistake and give an illegal move?
Language models aren't world models for the same reason languages aren't world models.
Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.
But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.
If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?
This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.
Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!
I don’t think it’s a communication problem as much as there is no possible relation between a word and a (literal) physical experiences. They’re, quite literally, on different planes of existence.
When I have a physical experience, sometimes it results in me saying a word.
Now, maybe there are other possible experiences that would result in me behaving identically, such that from my behavior (including what words I say) it is impossible to distinguish between different potential experiences I could have had.
But, “caused me to say” is a relation, is it not?
Unless you want to say that it wasn’t the experience that caused me to do something, but some physical thing that went along with the experience, either causing or co-occurring with the experience, and also causing me to say the word I said. But, that would still be a relation, I think.
Yes, but it's a unidirectional relation: it was the result of the experience. The word cannot represent the context (the experience), in a meaningful way.
It's like trying to describe a color to a blind person: poetic subjective nonsense.
I don’t know what you mean by “unidirectional relation”. I get that you gave an explanation after the colon, but I still don’t quite get what you mean. Do you just mean that what words I use doesn’t pick out a unique possible experience? That’s true of course, but I don’t know why you call that “unidirectional”
I don’t think describing colors to a blind person is nonsense. One can speak of how the different colors relate to one-another. A blind person can understand that a stop sign is typically “red”, and that something can be “borderline between red and orange”, but that things will not be “borderline between green and purple”. A person who has never had any color perception won’t know the experience of seeing something red or blue, but they can still have a mental model of the world that includes facts about the colors of things, and what effects these are likely to have, even though they themselves cannot imagine what it is like to see the colors.
IMO, the GP's idea is that you can't explain sounds to a deaf man, or emotions to someone who doesn't feel them. All that needs direct experience and words only point to our shared experience.
You exist in the full experience. That lossy projection to words is still meaningful to you, in your reading, because you know the experience it's referencing. What do I mean by "lossy projection"? It's the experience of seeing the color blue to the word "blue". The word "blue" is meaningless without already having experienced it, because the word is not a description of the experience, it's a label. The experience itself can't be sufficiently described, as you'll find if you try to explain a "blue" to a blind person, because it exists outside of words.
The concept here is that something like an LLM, trained on human text, can't having meaningful comprehension of some concepts, because some words are labels of things that exist entirely outside of text.
You might say "but multimodal models use tokens for color!", or even extending that to "you could replace the tokens used in multimodal models with color names!" and I would agree. But, the understanding wouldn't come from the relation of words in human text, it would come from the positional relation of colors across a space, which is not much different than our experience of the color, on our retina
tldr: to get AI to meaningful understand something, you have to give it a meaningful relation. Meaningful relations sometimes aren't present, in human writing.
> Symbols, by definition, only represent a thing. They are not the same as the thing
First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.
But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?
What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).
> If we accept the incompleteness theorem
There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system.
Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.
Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
> First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.
It should be capable of something similar (fsvo similar), but the largest difference is that humans have to be power-efficient and LLMs do not.
That is, people don't actually have world models, because modeling something is a waste of time and energy insofar as it's not needed for anything. People are capable of taking out the trash without knowing what's in the garbage bag.
Well, the physical universe will still exist, but I don't think that
physics - the scientific study of said universe - will become sort of meaningless, I would think?
Why meaningless? Imperfect knowledge can still be useful, and ultimately that's the only kind we can ever have about anything.
"We could learn to sail the oceans and discover new lands and transport cargo cheaply... But in a few centuries we'll discover we were wrong and the Earth isn't really a sphere and tides are extra-complex so I guess there's no point."
Because if there's an infinite number of laws, are they laws at all? You can't predict anything because you don't even know if some of the laws you don't know yet (which is pretty much all of them) makes an exception to the 0% of laws you do know. I'm not saying it's not interesting, but it's more history - today the apple fell down rather than up or sideways - than physics.
> course, if physics does exist - i.e. the universe is governed by a finite set of laws
That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.
The Humean way of looking at physics is that we notice relationships and model those with various symbols. They symbols form incomplete models because we can't get to the bottom of why the relationships exist.
> that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
The indeterminism of Quantum Mechanics limits how how precise measure can be and how predictable the future is.
> That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.
What I meant was that since physics is the scientific search for the laws of nature, then if there's an infinite number of them, then the pursuit becomes somewhat meaningless, as an infinite number of laws aren't really laws at all.
> They symbols form incomplete models because we can't get to the bottom of why the relationships exist.
Why would a model be incomplete if we don't know why the laws are what they are? A model pretty much is a set of laws; it doesn't require an explanation (we may want such an explanation, but it doesn't improve the model).
I knew someone would call me out on that. I used the wrong word; what I meant was "expressed in a way that would satisfy" which implies proof within the symbolic order being used. I don't claim to be a mathematician or philosopher.
Well, you don't get it. The LLM definitely can state propositions "that satisfy", let's just call them true propositions, and that this is not the same as having a proof for it is what the incompleteness theorem says.
Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?
I clearly do not meet the requirements to use the analogy.
I am hearing the term super intelligence a lot and it seems to me the only form that would take is the machine spitting out a bunch of symbols which either delight or dismay the humans. Which implies they already know what it looks like.
If this technology will advance science or even be useful for everyday life, then surely the propositions it generates will need to hold up to reality, either via axiomatic rigor or empirically. I look forward to finding out if that will happen.
But it's still just a movement from the known to the known, a very limited affair no matter how many new symbols you add in whatever permutation.
> Language models aren't world models for the same reason languages aren't world models.
Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
Symbols, maps, descriptions, and words are useful precisely because they are NOT what they represent. Representation is not identity. What else could a “world model” be other than a representation? Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?
> Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?
I was following the string of questions, but I think there is a logical leap between those two questions.
Another question: is Language the only way to define models?
An imagined sound or an imagined picture of an apple in my minds-eye are models to me, but they don't use language.
Gödel’s incompleteness theorems aren’t particularly relevant here. Given how often people attempt to apply them to situations where they don’t say anything of note, I think the default should generally be to not publicly appeal to them unless one either has worked out semi-carefully how to derive the thing one wants to show from them, or at least have a sketch that one is confident, from prior experience working with it, that one could make into a rigorous argument. Absent these, the most one should say, I think, is “Perhaps one can use Gödel’s incompleteness theorems to show [thing one wants to show].” .
Now, given a program that is supposed to output text that encodes true statements (in some language), one can probably define some sort of inference system that corresponds to the program such that the inference system is considered to “prove” any sentence that the program outputs (and maybe also some others based on some logical principles, to ensure that the inference system satisfies some good properties), and upon defining this, one could (assuming the language allows making the right kinds of statements about arithmetic) show that this inference system is, by Gödel’s theorems, either inconsistent or incomplete.
This wouldn’t mean that the language was unable to express those statements. It would mean that the program either wouldn’t output those statements, or that the system constructed from the program was inconsistent (and, depending on how the inference system is obtained from the program, the inference system being inconsistent would likely imply that the program sometimes outputs false or contradictory statements).
But, this has basically nothing to do with the “placeholders” thing you said. Gödel’s theorem doesn’t say that some propositions are inexpressible in a given language, but that some propositions can’t be proven in certain axiom+inference systems.
Rather than the incompleteness theorems, the “undefinability of truth” result seems more relevant to the kind of point I think you are trying to make.
Still, I don’t think it will show what you want it to, even if the thing you are trying to show is true. Like, perhaps it is impossible to capture qualia with language, sure, makes sense. But logic cannot show that there are things which language cannot in any way (even collectively) refer to, because to show that there is a thing it has to refer to it.
————
“Can you write a test suite for it?”
Hm, might depend on what you count as a “suite”, but a test protocol, sure. The one I have in mind would probably be a bit expensive to run if it fails the test though (because it involves offering prize money).
There is an important implication of learning and indexing being equivalent problems. A number of important data models and data domains exist for which we do not know how to build scalable indexing algorithms and data structures.
It has been noted for several years in US national labs and elsewhere that there is an almost perfect overlap between data models LLMs are poor at learning and data models that we struggle to index at scale. If LLMs were actually good at these things then there would be a straightforward path to addressing these longstanding non-AI computer science problems.
The incompleteness is that the LLM tech literally can't represent elementary things that are important enough that we spend a lot of money trying to represent them on computers for non-AI purposes. A super-intelligent AGI being right around the corner implies that we've solved these problems that we clearly haven't solved.
Perhaps more interesting, it also implies that AGI tech may look significantly different than the current LLM tech stack.
And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision. So if the incompleteness theorem means that neural nets can never find truth, it also means that the human brain can never find truth.
Human neuron firing patterns, after all, only represent a thing; they are not the same as the thing. Your experience of seeing something isn't recreating the physical universe in your head.
> And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision.
Wouldn't it become harder to simulate a human brain the larger a machine is? I don't know nothing, but I think that peaky speed of light thing might pose a challenge.
Everything is just a low resolution representation of a thing. The so-called reality we supposedly have access to is at best a small number of sound waves and photons hitting our face. So I don't buy this argument that symbols are categorically different. It's a gradient and symbols are more sparse and less rich of a data source, yes. But who are we to say where that hypothetical line exists, beyond which further compression of concepts into smaller numbers of buckets becomes a non-starter for intelligence and world modelling. And then there's multi modal LLMs which have access to data of a similar richness that humans have access to.
There are no "things" in the universe. You say this wave and that photon exist and represent this or that, but all of that is conceptual overlay. Objects are parts of speech, reality is undifferentiated quanta. Can you point to a particular place where the ocean becomes a particular wave? Your comment already implies an understanding that our mind is behind all the hypothetical lines; we impose them, they aren't actually there.
Reminds me of this [1] article. If us humans, after all these years we've been around, can't relay our thoughts exactly as we perceive them in our heads, what makes us think that we can make a model that does it better than us?
I’m not a math guy but the incompleteness theorem applies to formal systems, right? I’ve never thought about LLMs as formal systems, but I guess they are?
Nor am I. I'm not claiming an LLM is a formal system, but it is mechanical and operates on symbols. It can't deal in anything else. That should temper some of the enthusiasm going around.
> Language models aren't world models for the same reason languages aren't world models.
> Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
There is a lot of negatives in there, but I feel like it boils down to a model of a thing is not the thing. Well duh. It's a model. A map is a model.
Right. It's a dead thing that has no independent meaning. It doesn't even exist as a thing except conceputally. The referent is not even another dead thing, but a reality that appears nowhere in the map itself. It may have certain limited usefulness in the practical realm, but expecting it to lead to new insights ignores the fact that it's fundamentally an abstraction of the real, not in relationship to it.
I just tried a few things that are simple and a world model would probably get right. Eg
Question to GPT5:
I am looking straight on to some objects. Looking parallel to the ground.
In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?
Answer (after choosing thinking mode)
No.
The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.
It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency
this seems like a strange riddle. In my mind I was thinking that regardless of the glass, all of the objects can be seen (due to perspective, and also the fact you mentioned the locations, meaning you're aware of them).
It seems to me it would only actually work in an orthographic perspective, which is not how our reality works
Have you asked five adults this riddle? I suspect at least two of them would get it wrong or have some uncertainty about whether or not the peanut was visible.
This. Was also thinking "yes" first because of the glass of water, transparency, etc, but then got unsure: The objects might be spaced so widely that the milk or coke bottle would obscure the view due to perspective - or the peanut would simply end up outside the viewer's field of vision.
Shows that even if you have a world model, it might not be the right one.
This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...
King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.
Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.
Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”
The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.
One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.
LOL, you really think that intelligence (however you want to define or measure the concept) is a guarantee that people won't make mistakes, misremember, make stuff up, or lie?
I agree with the article. I will be very surprised if LLMs end up being "it". I say this as a language geek who has always been amazed how language drives our thinking. However, I think language exists between brains, not inside them. There's something else in us and LLMs aren't it.
I wonder how the nature of the language used to train an LLM affects its model of the world. Would a language designed for the maximum possible information content and clarity like Ithkuil make an LLMs world model more accurate?
I'm surprised the models haven't been enshittified by capitalism. I think in a few years we're going to see lightning-fast LLMs generating better output compared to what we're seeing today. But it won't be 1000x better, it will be 10x better, 10x faster, and completely enshittified with ads and clickbait links. Enjoy ChatGPT while it lasts.
One thing I appreciated about this post, unlike a lot of AI-skeptic posts, is that it actually makes a concrete falsifiable prediction; specifically, "LLMs will never manage to deal with large code bases 'autonomously'". So in the future we can look back and see whether it was right.
For my part, I'd give 80% confidence that LLMs will be able to do this within two years, without fundamental architectural changes.
"Deal with" and "autonomously" are doing a lot of heavy lifting there. Cursor already does a pretty good job indexing all the files in a code base in a way that lets it ask questions and get answers pretty quickly. It's just a matter of where you set the goalposts.
Cursor fails miserably for me even just trying to replace function calls with method calls consistently, like I said in the post. This I would hope is fixable. By dealing autonomously I mean "you don't need a programmer - a PM talks to an LLM and that's how the code base is maintained, and this happens a lot (rather than on one or two famous cases where it's pretty well known how they are special and different from most work)"
By "large" I mean 300K lines (strong prediction), or 10 times the context window (weaker prediction)
I don't shy away from looking stupid in the future, you've got to give me this much
I'm pretty sure you can do that right now in Claude Code with the right subagent definitions.
(For what it's worth, I respect and greatly appreciate your willingness to put out a prediction based on real evidence and your own reasoning. But I think you must be lacking experience with the latest tools & best practices.)
If you're right, there will soon be a flood of software teams with no programmers on them - either across all domains, or in some domains where this works well. We shall see.
Indeed I have no experience with Claude Code, but I use Claude via chat, and it fails all the time on things not remotely as hard as orientation in a large code base. Claude Code is the same thing with the ability to run tools. Of course tools help to ground its iterations in reality, but I don't think it's a panacea absent a consistent ability to model the reality you observe thru your use of tools. Let's see...
I was very skeptical of Claude Code but was finally convinced to try it and it does feel very different to use. I made three hobby projects in a weekend that I had pushed up for years due to "it's too much hassle to get started" inertia. Two of the projects it did very well with, the third I had to fight with it and it still is subtly wrong (swiftUI animations and claude code seemingly is not a good mix!)
That being said, I think your analysis is 100% correct. LLMs are fundamentally stupid beyond belief :P
> SwiftUI animations and claude code seemingly is not a good mix
Where is the corpus of SwiftUI animations to train Claude what probable soup you probably want regurgitated?
Hypothesis: iOS devs don't share their work openly for reasons associated with how the App Store ecosystem (mis)behaves.
Relatedly, the models don't know about Swift 6 except from maybe mid-2024 WWDC announcements. It's worth feeding them your own context. If you are 5.10, great. If you want to ship iOS 26 changes, wait till 2026 or again, roll your own context.
In my case the big issue seems to be that if you hide a component in SwiftUI, it's by default animated with a fade. This not shown in the API surface area at all.
> Indeed I have no experience with Claude Code, but I use Claude via chat...
These are not even remotely similar, despite the name. Things are moving very fast, and the sort of chat-based interface that you describe in your article is already obsolete.
Claude is the LLM model. Claude Code is a combination of internal tools for the agent to track its goals, current state, priorities, etc., and a looped mechanism for keeping it on track, focused, and debugging its own actions. With the proper subagents it can keep its context from being poisoned from false starts, and its built-in todo system keeps it on task.
Really, try it out and see for yourself. It doesn't work magic out of the box, and absolutely needs some hand-holding to get it to work well, but that's only because it is so new. The next generation of tooling will have these subagent definitions auto selected and included in context so you can hit the ground running.
We are already starting to see a flood of software coming out with very few active coders on the team, as you can see on the HN front page. I say "very few active coders" not "no programmers" because using Claude Code effectively still requires domain expertise as we work out the bugs in agent orchestration. But once that is done, there aren't any obvious remaining stumbling blocks to a PM running a no-coder, all-AI product team.
Claude Code isn't an LLM. It's a hybrid architecture where an LLM provides the interface and some of the reasoning, embedded inside a broader set of more or less deterministic tools.
It's obvious LLMs can't do the job without these external tools, so the claim above - that LLMs can't do this job - is on firm ground.
But it's also obvious these hybrid systems will become more and more complex and capable over time, and there's a possibility they will be able to replace humans at every level of the stack, from junior to CEO.
If that happens, it's inevitable these domain-specific systems will be networked into a kind of interhybrid AGI, where you can ask for specific outputs, and if the domain has been automated you'll be guided to what you want.
It's still a hybrid architecture though. LLMs on their own aren't going to make this work.
It's also short of AGI, never mind ASI, because AGI requires a system that would create high quality domain-specific systems from scratch given a domain to automate.
If you want to be pedantic about word definitions, it absolutely is AGI: artificial general intelligence.
Whether you draw the system boundary of an LLM to include the tools it calls or not is a rather arbitrary distinction, and not very interesting.
Nearly every definition I’ve seen that involves AGI (there are many) includes the ability to self learn and create “novel ideas”. The LLM behind it isn’t capable of this, and I don’t think the addition of the current set of tools enables this either.
Artificial general intelligence was a phrase invented to draw distinction from “narrow intelligence” which are algorithms that can only be applied to specific problem domains. E.g. Deep Blue was amazing at playing chess, but couldn’t play Go much less prioritize a grocery list. Any artificial program that could be applied to arbitrary tasks not pre-trained on is AGI. ChatGPT and especially more recent agentic models are absolutely and unquestionably AGI in the original definition of the term.
Goalposts are moving though. Through the efforts of various people in the rationalist-connected space, the word has since morphed to be implicitly synonymous with the notion of superintellgence and self-improvement, hence the vague and conflicting definitions people now ascribe to it.
Also, fwiw the training process behind the generation of an LLM is absolutely able to discover new and novel ideas, in the same sense that Kepler’s laws of planetary motion were new and novel if all you had were Tycho Brache’s astronomical observations. Inference can tease out these novel discoveries, if nothing else. But I suspect also that your definition of creative and novel would also exclude human creativity if it were rigorously applied—our brains after all are merely remixing our own experiences too.
I feel like refutations like this (you aren't using the tool right | you should try this other tool) pop up often but are fundamentally worthless because as long as you're not showing code you might as well be making it up. The blog post gives examples of clear failures that can be reproduced by anyone by themselves, I think its time vibe code defenders are held to the same standard.
The very first example is that LLMs lose their mental model of chess when playing a game. Ok, so instead ask Claude Code to design an MCP for tracking chess moves, and vibe code it. That’s the very first thing that comes to mind, and I expect it would work well enough.
"LLM" as well, because coding agents are already more than just an LLM. There is very useful context management around it, and tool calling, and ability to run tests/programs, etc. Though they are LLM-based systems, they are not LLMs.
Indeed. If the LLM calls a chess engine tool behind the scenes, it would be able to play excellent chess as well.
The author would still be wrong in the tool-calling scenario. There is already perfect (or at least superhuman) chess engines. There is no perfect "coding engine". LLM's + tools being able to reliably work on large codebases would be a new thing.
Correct - as long as the tools the LLM uses are non-ML-based algorithms existing today, and it operates on a large code base with no programmers in the loop, I would be wrong. If the LLM uses a chess engine, then it does nothing on top of the engine; similarly if an LLM will use another system adding no value on top, I would not be wrong. If the LLM uses something based on a novel ML approach, I would not be wrong - it would be my "ML breakthrough" scenario. If the LLM uses classical algorithms or an ML algo known today and adds value on top of them and operates autonomously on a large code base - no programmer needed on the team - then I am wrong
This rapidly gets philosophical. If I use tools am I not handling the codebase? Are we classing LLM as tool or user in this scenario?
True, there'd be a need to operationalize these things a bit more than is done in the post to have a good advance prediction.
« autonomously » what happens when subtle updates that are not bugs but change the meaning of some features that might break the workflow on some other external parts of a client’s system ? It happens all the time and, because it’s really hard to have the whole meaning and business rules written and maintained up to date, an LLM might never be able to grasp some meaning. Maybe if instead of developing code and infrastructures, the whole industry shifts toward only writing impossibly precise spec sheets that make meaning and intent crystal clear then, maybe « autonomously » might be possible to pull off
Those spec sheets exist: they're called software.
Not exactly. It depends how software is written and if there is ADRs in the project. I had to work on projects where there was bugs because someone coded business rules in a very bad and unclear way. You move an if somewhere and something breaks somewhere else. You ask « is this condition the way it’s supposed to work or is it a bug » when software is not clear enough - and often it isn’t because we have to go fast - we ask people to confirm the rule. My point is this, amazingly written software surely works best with LLMs. That’s not the most software written for now because businesses value speed over engineering sometimes (or it’s lack of skills)
Right: software is not necessarily a sufficiently-clear specification, but a sufficiently-clear specification would be software – and you've correctly identified that a good part of your job is ensuring the software provides a sufficiently-clear specification.
Amazingly-written software is necessary for LLMs to work well, but it isn't sufficient: LLMs tend to make nonsensical changes that, while technically implementing what they're asked to do (much of the time), reduce the quality of the software. As this repeats, the LLMs become less and less able to modify the program. This is because they can't program: they can translate, plagiarise, and interpolate, but they're missing several key programming skills, and probably cannot learn them.
I don't think that statement is falsifiable until you define "deal with" and "large code bases."
That feels like a statement that's far too loosely defined to be meaningful to me.
I work on codebases that you could describe as 'large', and you could describe some of the LLM driven work being done on them as 'autonomous' today.
In two years there will be probably no new 'autonomous' LLMs. They will be most likely integrated into 'products', trained and designed for this. We see the beginning of it today as agents and tools.
How large? What does "deal" mean here? Autonomously - is that on its own whim, or at the behest of a user?
>LLMs will never manage to deal
time to prove hypothesis: infinity years
The whole of modern science is based on the idea that we can never prove a theory about the world to be true, but that we can provide experiments which allow us to show that some theories are closer to the truth than others.
Eh, if the hypothesis remains unfalsified for longer and longer, we can have increased confidence.
Similar, Newton's laws say that bodies always stay at rest unless acted upon by a force. Strictly speaking, if a billiard ball jumps up without cause tomorrow that would disprove Newton. So we'd have to wait an infinite amount of time to prove Newton right.
However no one has to wait so long, and we found ways to express how Newton's ideas are _better_ than those of Aristotle without waiting an eternity.
> LLMs will never manage to deal with large code bases 'autonomously'
Absolutely nothing about that statement is concrete or falsifiable.
Hell, you can already deal with large code bases 'autonomously' without LLMs - grep and find and sed goes a long way!
The post is based on a misconception. If you read the blog post linked at the end of this message, you'll see how a very small GPT-2 alike transformer (Karpathy nano-gpt trained to a very small size) after seeing just PGN games and nothing more develops an 8x8 internal representation with which chess piece is where. This representation can be extracted by linear probing (and can be even altered by using the probe in reverse). LLMs are decent but not very good chess players for other reasons, not because they don't have a world model of the chess board.
https://www.lesswrong.com/posts/yzGDwpRBx6TEcdeA5/a-chess-gp...
Ironically, that lesswrong article is more wrong than right.
First, chess is perfect for such modeling. The game is basically a tree of legal moves. The "world model" representation is already encoded in the dataset itself and at a certain scale the chance of making an illegal move is minimal, as the dataset itself includes an insane amount of legal moves compared to illegal moves, let alone when you are training it on a chess dataset like PGN one
Second, the probing is quite... a subjective thing.
We are cherry-picking activations across an arbitrary amount of dimensions, on a model specifically trained for chess, taking these arbitrary representations and displaying it on 2D graph.
Well yeah, with enough dimensions and cherry-picking, we can also show how "all zebras are elephants, because all elephants are horses and look their weights overlap in so many dimensions - large four-legged animals you see on safari!" - especially if we cherry-pick it. Especially if we tune a dataset on it.
This shows nothing other than "training LLMs on a constrained move dataset makes LLM great at predicting next move in that dataset".
And if it knew every possible board configuration and optimal move, it could potentially do as well as it could, but instead if it were to just recognize “this looks like a chess game” and use an optimized tool to determine the next move, that would be a better use of training, it would seem.
Way better use, at this point that engine is more like a world's most expensive monte carlo search.
The post or rather the part you refer to is based on a simple experiment which I encourage you to repeat. (It is way likelier to reproduce in the short to medium run than the others.)
From your link: "...The first was gpt-3.5-turbo-instruct's ability to play chess at 1800 Elo"
These things don't play at 1800 ELO, though maybe someone measured this ELO without cheating but rather relying on some artifacts of how an engine told to play at a low rating does against an LLM (engines are weird when you ask them to play badly, as a rule); a good start to a decent measurement would be to try it on chess 960. These things do lose track of the pieces in 10 moves. (As do I absent a board to look at, but I understand enough to say "I can't play blindfold chess, let's set things up so I can look at the current position somehow")
This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.
LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.
Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.
My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.
Let's see how my predictions hold up; I have made enough to look very wrong if they don't.
Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says
Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.
https://imgur.com/a/O9CjiJY
This seems like a common theme with these types of articles
Perhaps the people who get decent answers don't write articles about them?
I imagine people give up silently more often than they write a well syndicated article about it. The actual adoption and efficiencies we see in enterprises will be the most verifiable data on if LLMs are generally useful in practice. Everything so far is just academic pontificating or anecdata from strangers online.
I am inclined to agree.
However, I'm not completely sure. Eg object oriented programming was basically a useless fad full of empty, never-delivered-on promises, but software companies still lapped it up. (If you happen to like OOP, you can probably substitute your own favourite software or wider management fad.)
Another objection: even an LLM with limited capabilities and glaring flaws can still be useful for some commercial use-cases. Eg the job of first line call centre agents that aren't allowed to deviate from a fixed script can be reasonable automated with even a fairly bad LLM.
Will it suck occasionally? Of course! But so does interacting with the humans placed into these positions without authority to get anything done for you. So if the bad LLM is cheaper, it might be worthwhile.
This. I think we’ve about reached the limit of the usefulness of anecdata “hey I asked an LLM this this and this” blog posts. We really need more systematic large scale data and studies on the latest models and tools - the recent one on cursor (which had mixed results) was a good start but it was carried out before Claude Code was even released, i.e. prehistoric times in terms of AI coding progress.
For my part I don’t really have a lot of doubts that coding agents can be a useful productivity boost on real-world tasks. Setting aside personal experience, I’ve talked to enough developers at my company using them for a range of tickets on a large codebase to know that they are. The question is more, how much: are we talking a 20% boost, or something larger, and also, what are the specific tasks they’re most useful on. I do hope in the next few years we can get some systematic answers to that as an industry, that go beyond people asking LLMs random things and trying to reason about AI capabilities from first principles.
I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.
When talking about the capabilities of a class of tools long term, it makes sense to be general. I think deriving conclusions at all is pretty difficult given how fast everything is moving, but there is some realities we do actually know about how LLMs work and we can talk about that.
Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.
Yes, I’d be curious about his experience with GPT-5 Thinking model. So far I haven’t seen any blunders from it.
I've seen plenty of blunders, but in general it's better than their previous models.
Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.
In a very real sense it doesn’t even know that it doesn’t know.
Maybe. But in math you can either produce the proof (with each step checkable) or you can't.
I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.
I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.
A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:
* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.
* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.
* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)
* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.
I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.
Anyway thanks for writing this and responding!
I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."
I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)
But this is parallel to saying LLMs are not "compelled" by the training algorithms to learn symbolic logic.
Which says to me there are two camps on this and the verdict is still out on this and all related questions.
>LLMs are not "compelled" by the training algorithms to learn symbolic logic.
I think "compell" is such a human trait that machine will never replicate to the T.
The article did mention specifically about this very issue:
"And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform."
If it's not already evident that in itself LLM is a limited stochastic AI tool by definition and its distant cousins are the deterministic logic, optimization and constraint programming [1],[2],[3]. Perhaps one of the two breakthroughs that the author was predicting will be in this deterministic domain in order to assist LLM, and it will be the hybrid approach rather than purely LLM.
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
https://www.youtube.com/live/TknN8fCQvRk
[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:
https://youtube.com/watch?v=HB5TrK7A4pI
[3] Google OR-Tools:
https://developers.google.com/optimization
[4] MiniZinc:
https://www.minizinc.org/
It’s not just on the job learning though. I’m no AI expert, but the fact that you have “prompt engineers” and AI doesn’t know what it doesn’t know, gives me pause.
If you ask an expert, they know the bounds of their knowledge and can understand questions asked to them in multiple ways. If they don’t know the answer, they could point to someone who does or just say “we don’t know”.
LLMs just lie to you and we call it “hallucinating“ as though they will eventually get it right when the drugs wear off.
> I’m no AI expert, but the fact that you have “prompt engineers” [...] gives me pause.
Why? A bunch of human workers can get a lot more done with a capable leader who helps prompt them in the right direction and corrects oversights etc.
And overall, prompt engineering seems like exactly the kind of skill AI will be able to develop by itself. You already have a bit like this happening: when you ask Gemini to create a picture for you, then the language part of Gemini will take your request and engineer a prompt for the picture part of Gemini.
This is the goalpost flip which happens in AI conversations. If goalpost is even the right term, conversation switch?
Theres 2 AI conversations on HN occurring simultaneously.
Convo A: Is it actually reasoning? does it have a world model? etc..
Convo B: Is it good enough right now? (for X, Y, or Z workflow)
Maybe, yes. It's good to acknowledge that both of these conversations are worthwhile to have.
LLM comprehends, but does not understand. It is interesting to see these two qualities separated; so far they were synonyms.
> A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.
It's closer to AlphaGo, which first trained on expert human games and then 'fine tuned' with self-play.
AlphaZero specifically did not use human training data at all.
I am waiting for an AlphaZero style general AI. ('General' not in the GAI sense but in the ChatGPT sense of something you can throw general problems at and it will give it a good go, but not necessarily at human level, yet.) I just don't want to call it an LLM, because it wouldn't necessarily be trained on language.
What I have in mind is something that first solves lots and lots of problems, eg logic problems, formally posed programming problems, computer games, predicting of next frames in a web cam video, economic time series, whatever, as a sort-of pre-training step and then later perhaps you feed it a relatively small amount of human readable text and speech so you can talk to it.
Just to be clear: this is not meant as a suggestion for how to successfully train an AI. I'm just curious whether it would work at all and how well / how badly.
Presumably there's a reason why all SOTA models go 'predict human produced text first, then learn problem solving afterwards'.
> I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. [...]
Yes, I agree. But 'on-the-job' training is also such an obvious idea that plenty of people are working on making it work.
With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.
The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.
It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.
In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.
I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/
LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.
> It's like asking a blind person to count the number of colors on a car.
I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.
That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.
Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.
> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.
Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.
Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think
Tokenization makes things harder, but it doesn't make them impossible. Just takes a bit more memorization.
Other writing systems come with "tokenization" built in making it still a live issue. Think of answering:
1. How many n's are in 日本?
2. How many ん's are in 日本?
(Answers are 2 and 1.)
> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.
Is this a real defect, or some historical thing?
I just asked GPT-5:
and it replied: I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.
The question is, did these LLMs figured it out by themselves or has someone programmed a specific coroutine to address this „issue“, to make it look smarter than it is?
On a trillion dollar budget, you could just crawl the web for AI tests people came up with and solve them manually. We know it‘s a massively curated game. With that kind of money you can do a lot of things. You could feed every human on earth countless blueberries for starters.
Calling an algorithm to count letters in a word isn’t exactly worth the hype tho is it?
The point is, we tend to find new ways these LLMs can’t figure out the most basic shit about the world. Horses can count. Counting is in everything. If you read every text ever written and still can’t grasp counting you simply are not that smart.
Depends how you define historical. If by historical you mean more than two days ago then, yeah, it's ancient history.
Some LLMs do better than others, but this still sometimes trips up even "frontier" non-reasoning models. People were showing this on this very forum with GPT-5 in the past couple days.
It is not historical:
https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...
Perhaps they have a hot fix that special cases HN complaints?
They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.
I wouldn't be surprised if some models get set up to identify that type of question and run the word through string processing function.
Of course they do stuff like that, otherwise it would look like they are stagnating. Fake it till you make it. Tho, at this point, the world is in deep shit, if they don’t make it…
What deep shit do you foresee?
My prediction is that this will be like the 2000 dot com bubble. Both dot com and AI are real and really useful technologies but hype and share price has got way ahead of it so will need to re adjust.
A major economic crisis, yes. I think the web is already kinda broken because of AI, gonna get a lot worse. I also question its usefulness… Is it useful solving any real problems, and if so how long before we run out of these problems? Because we conflated a lot of bullshit with innovation right before AI. Right now people may be getting a slight edge, but it’s like getting a dishwasher, once expectations adjusted things will feel like a grind again, and I really don’t think people will like that new reality in regard to experience of self-efficacy (which is important for mental health). I presume the struggle to get information, figuring it out yourself, may be a really important part of putting pressure towards process optimization and for learning, cognitive development. We may collectively regress there. With so many major crisis, a potential economic crisis on top, I am not sure we can afford losing problem solving capabilities to any extent. And I really, really don’t think AI is worth the fantastical energy expenditure, waste of resources and human exploitation, so far.
It was discussed and reproduced on GPT-5 on HN couple of days ago: https://news.ycombinator.com/item?id=44832908
Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.
Shouldn't the correct answer be that there is not a "B" in "blueberry"?
No, why?
It depend on context. English is often not very precise and relies on implied context clues. And that's good. It makes communication more efficient in general.
To spell it out: in this case I suspect you are talking about English letter case? Most people don't care about case when they ask these questions, especially in an informal question.
Ask a kid that doesn't know how to read and write how many Bs there are in blueberry.
For a kid that doesn't know to read or write, Chat GPT writes way too much.
I don’t solve math problems with my poetry writing skills:
https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...
> they clearly don't have any world model whatsoever
Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.
> where it certainly hadn’t seen the questions before?
What are you basing this certainty on?
And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.
It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.
> What are you basing this certainty on?
People make up new questions for each IMO.
Didn’t OpenAI get caught bribing their way to pre-tournament access of the questions?
This is the first time I hear about this. (It's certainly possible, but I'd need to see some evidence or at least a write-up.)
OpenAI got flamed over announcing their results before the embargo was up:
IMO had asked companies to wait at least a week or so after the human winners were announced to announce the AI results. OpenAI did not wait.
Like the other reply said, each exam has entirely new questions which are of course secret until the test is taken.
Sure, the questions were probably in a similar genre as existing questions or required similar techniques that could be found in solutions that are out there. So what? You still need some kind of world model of mathematics in which to understand the new problem and apply the different techniques to solve it.
Are you really claiming that SOTA LLMs don’t have any world model of mathematics at all? If so, can you tell us what sort of example would convince you otherwise? (Note that the ability to do novel mathematics research is setting the bar too high, because many capable mathematics majors never get to that point, and they clearly have a reasonable model of mathematics in their heads.)
https://arxiv.org/abs/2508.01191
I think both the literature on interpretability and explorations on internal representations actually reinforce the author's conclusion. I think internal representation research tends to nets that deal with a single "model" don't necessary have the same representation and don't necessarily have a single representation.
And doing well on XYZ isn't evidence of a world model in particular. The point that these things aren't always using a world is reinforced by systems being easily confused by extraneous information, even systems as sophisticated as thus that can solve Math Olympiad questions. The literature has said "ad-hoc predictors" for a long time and I don't think much has changed - except things do better on benchmarks.
And, humans too can act without a consistent world model.
Any suggestions from this literature?
The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.
> LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction because of how far you can take them despite the approach being fundamentally incorrect.
I don't believe that it is a fundamentally incorrect approach. I believe, that human mind does something like that all the time, the difference is our minds have some additional processes that can, for example, filter hallucinations.
Kids at a specific age range are afraid of their imagination. Their imagination can place a monster into any dark place where nothing can be seen. Adult mind can do the same easily, but the difference is kids have difficulties distinguishing imagination and perception, while adult generally manage.
I believe, the ability of human mind to see difference between imagination/hallucinations from one hand and perception and memory from the other is not a fundamental thing stemming from the architecture of brains but a learned skill. Moreover people can be tricked to acquire false memory[1]. If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.
What LLMs need is to learn some tricks to detect hallucinations. Probably they will not get 100% reliable detector, but to get to the level of humans they don't need 100% reliability.
I have recently lived through something called a psychotic break, which was an unimaginably horrible thing, but it did let me see from the inside what insanity does to your thinking. And what's fascinating, coming out the other side of this, is how similar LLMs are to someone in psychosis. Someone in psychosis can have all the ability LLMs have to recognise patterns and sound like they know what they're talking about, but their brain is not working well enough to have proper self-insight, to be able to check their thoughts actually fully make sense. (And “making sense” turns out to be a sliding scale — it is not as if you just wake up one day suddenly fully rational again, there's a sliding scale of irrational thinking and you have to gradually re-process your older thoughts into more and more coherent shapes as your brain starts to work more correctly again.) I believe this isn't actually a novel insight either, many have worried about this for years! Psychosis might be an interesting topic to read about if you want to get another angle to understand the AI models from. I won't claim that it's exactly the same thing, but I will say that most people probably have a very undeveloped idea of what mental illness actually is or how it works, and that leaves them badly prepared for interacting with a machine that has a strong resemblance to a mentally ill person who's learned to pretend to be normal.
Thank you for sharing, and sorry you had to go through that. I had a good friend go through a psychotic break and I spent a long time trying to understand what was going on in his brain. The only solid conclusion I could come to was that I could not relate to what he was going through, but that didn’t change that he was obviously suffering and needed whatever support I could offer. Thanks for giving me a little bit of insight into his brain. Hope you were/are able to find support out there.
If we just take simply a panic attack, many people have no clue what or how it feels like, which is unfortunate, because they lack empathy for those who do experience it. My psychiatrists definitely need to experience it to understand.
Do you have many memories of that time, around 3 to 5, and remember what your cognitive processes were?
When the child is afraid of the monster in the dark, they are not literally visually hallucinating a beast in the dark; they are worried that there could be a beast in the dark, and they are not sure that there is due to a lack of sensory information confirming a lack of the monster. They are not being hyper precise because they are 3, so they say "there is a monster under my bed"! Children have instincts to be afraid of the dark.
Similarly with imaginary friends and play, it's an instinct to practice through smaller stakes simulations. When they are emotionally attached to their imaginary friends, it's much like they are emotionally attached to their security blanket. They know that the "friend" is not perceptible.
It's much like the projected anxieties of adults or teenagers, who are worried that everyone thinks they are super lame and thus act like people do, because on the balance of no information, they choose the "safer path".
That is pretty different than the hallucinations of LLMs IMO.
From my perspective, the fundamental problem arises from the assumption that brain's all functions are self contained, however there are feedback loops in the body which supports the functions of the brain.
The simplest one is fight/flight/freeze. Brain starts the process by being afraid, and hormones gets released, but next step is triggered by the nerve feedback coming from the body. If you are using beta-blockers and can't get panicked, the initial trigger fizzles and you return to your pre-panic state.
an LLM doesn't model a complete body. It just models the language. It's just a very small part of what brain handles, so assuming that modelling the language, even the whole brain gonna answer all the questions we have is a flawed approach.
Latest research shows body is a much more complicated and interconnected system than we learnt in school 30 years ago.
Sure, your points about the body aren’t wrong, but (as you say) LLMs are only modelling a small subset of a brain’s functions at the moment: applied knowledge, language/communication, and recently interpretation of visual data. There’s no need or opportunity for an LLM (as they currently exist) to do anything further. Further, just because additional inputs exist in the human body (gut-brain axis, for example) it doesn’t mean that they are especially (or at all) relevant for knowledge/language work.
The point is that knowledge/language work can't work reliably unless it's grounded in something outside of itself. Without it you don't get an oracle, you get a superficially convincing but fundamentally unreliable idiot savant who lacks a stable sense of self, other, or real world.
The fundamental foundation of science and engineering is reliability.
If you start saying reliability doesn't matter, you're not doing science and engineering any more.
I'm really struggling to understand what you're trying to communicate here; I'm even wondering if you're an LLM set up to troll, due to the weird language and confusing non-sequiturs.
> The point is that knowledge/language can't work reliably unless it's grounded in something outside of itself.
Just, what? Knowledge is facts, somehow held within a system allowing recall and usage of those facts. Knowledge doesn't have a 'self', and I'm totally not understanding how pure knowledge as a concept or medium needs "grounding"?
Being charitable, it sounds more like you're trying to describe "wisdom" - which might be considered as a combination of knowledge, lived experience, and good judgement? Yes, this is valuable in applying knowledge more usefully, but has nothing to do with the other bodily systems which interact with the brain, which is where you started?
> The fundamental foundation of science and engineering is reliability.
> If you start saying reliability doesn't matter, you're not doing science and engineering any more.
No-one mentioned reliability - not you in your original post, or me in my reply. We were discussing whether the various (unconscious) systems which link to the brain in the human body (like the gut:brain axis) might influence its knowledge/language/interpretation abilities.
You probably know the Law of Archimedes. Many people do. But do you know it in the same way Archimedes did? No. You were told the law, then taught how to apply it. But Archimedes discovered it without any of that.
Can we repeat the feat of Archimedes? Yes, we can, but first we'd have to forget what we were told and taught.
The way we actually discover things is very different from amassing lots of hearsay. Indeed, we do have an internal part that behaves the same way LLM does. But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate. That part does not help us think; it helps us to avoid thinking. The reason it exists is that it is useful: thinking is hard and slow, but recalling is easy and fast. But it not thinking; it is the opposite.
> But to get to the real understanding we actually shut down that part, forget what we "know", start from a clean slate.
Close, but not exactly. To start from a clean slate is not very difficult, the trick is to reject some chosen parts of existing knowledge, or more specifically the difficulty is to choose what to reject. Starting from a clean slate you'll end up spending millennia to get the knowledge you've just rejected.
So the overall process of generating knowledge is to look under the streetlight till finding something new becomes impossible or too hard, and then you start experimenting with rejecting some bits of your knowledge to rethink them. I was taught to read works of Great Masters of the past critically, trying to reproduce their path while looking for forks where you can try to go the other way. It is a little bit like starting from a clean slate, but not exactly.
> If LLM fell to tricks of Elizabet Loftus, we'd say LLM hallucinated.
She's strongly oversold how and when false memories can be created. She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.
She's spent a career over-generalizing data about implanting false minor memories to make money discrediting victims' traumatic memories and defend abusers.
You conflate "hallucination" with "imagination" but the former has much more in common with lieing than it does with imagining.
> She testified in defense of Ghislaine Maxwell at her 2021 trial that financial incentives can create false memories and only later admitted that there were no studies to back this up when directly questioned.
Did she have financial incentives? Was this a live demonstration? :P
> I believe, that human mind does something like that all the time
Absolutely not. Human brains have online one-shot training. LLMs weights are fixed and fine-tuning them is a huge multi-year enterprise.
Fundamentally it's two completely different architectures.
I really don't like how you rejecting the idea completely. People have online one-shot training, but have you tried to learn how to play on piano? To learn it you need a lot of repetitions. Really a lot. You need a lot of repetitions to learn how to walk, or how to do arithmetic, or how to read English. This is very similar to LLMs, isn't it? So they are not completely different architectures, aren't they? It is more like human brains have something on top of "LLM" that allows it to do tricks that LLMs couldn't do.
That whole bit about color blending and transparency and LLMs "not knowing colors" is hard to believe. I am literally using LLMs every day to write image-processing and computer vision code using OpenCV. It seamlessly reasons across a range of concepts like color spaces, resolution, compression artifacts, filtering, segmentation and human perception. I mean, removing the alpha from a PNG image was a preprocessing step it wrote by itself as part of a larger task I had given it, so it certainly understands transparency.
I even often describe the results e.g. "this fails when in X manner when the image has grainy regions" and it figures out what is going on, and adapts the code accordingly. (It works with uploading actual images too, but those consume a lot of tokens!)
And all this in a rather niche domain that seems relatively less explored. The images I'm working with are rather small and low-resolution, which most literature does not seem to contemplate much. It uses standard techniques well known in the art, but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
If it can reason about images and vision and write working code for niche problems I throw at it, whether it "knows" colors in the human sense is a purely philosophical question.
> it wrote by itself as part of a larger task I had given it, so it certainly understands transparency
Or it’s a common step or a known pattern or combination of steps that is prevalent in its training data for certain input. I’m guessing you don’t know what’s exactly in the training sets. I don’t know either. They don’t tell ;)
> but it adapts and combines them well to suit my particular requirements. So they seem to handle "novel" pretty well too.
We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers. It doesn’t mean what you are doing specifically is done in this exact way before, rather the patterns adapted and the approach may not be one of their kind.
> is a purely philosophical question
It is indeed. A question we need to ask ourselves.
> We tend to overestimate the novelty of our own work and our methods and at the same time, underestimate the vastness of the data and information available online for machines to train on. LLMs are very sophisticated pattern recognizers.
If LLMs are stochastic parrots, but also we’re just stochastic parrots, then what does it matter? That would mean that LLMs are in fact useful for many things (which is what I care about far more than any abstract discussion of free will).
We're not just stochastic parrots though, we can parrot things stochastically when that has utility, but we can also be original. The first time that work was done, it was sone by a person, autonomously. Current LLMs couldnt have done it the first time
This article is interesting but pretty shallow.
0(?): there’s no provided definition of what a ‘world model’ is. Is it playing chess? Is it remembering facts like how computers use math to blend Colors? If so, then ChatGPT: https://chatgpt.com/s/t_6898fe6178b88191a138fba8824c1a2c has a world model right?
1. The author seems to conflate context windows with failing to model the world in the chess example. I challenge them to ask a SOTA model with an image of a chess board or notation and ask it about the position. It might not give you GM level analysis but it definitely has a model of what’s going on.
2. Without explaining which LLM they used or sharing the chats these examples are just not valuable. The larger and better the model, the better its internal representation of the world.
You can try it yourself. Come up with some question involving interacting with the world and / or physics and ask GPT-5 Thinking. It’s got a pretty good understanding of how things work!
https://chatgpt.com/s/t_689903b03e6c8191b7ce1b85b1698358
I my opinion the author refers to a LLMs inability to create a inner world, a world model.
That means it does not build a mirror of a system based on its interactions.
It just outputs fragments of world models it was build one and tries to give you a string of fragments that should match to the fragment of your world model that you provided through some input method.
It can not abstract the code base fragments you share it can not extend them with details using the model of the whole project.
A "world model" depends on the context which defines which world the problem is in. For chess, which moves are legal and needing to know where the pieces are to make legal moves are parts of the world model. For alpha blending, it being a mathematical operation and the visibility of a background given the transparency of the foreground are parts of the world model.
The examples are from all the major commercial American LLMs as listed in a sister comment.
You seem to conflate context windows with tracking chess pieces. The context windows are more than large enough to remember 10 moves. The model should either track the pieces, or mention that it would be playing blindfold chess absent a board to look at and it isn't good at this, so could you please list the position after every move to make it fair, or it doesn't know what it's doing; it's demonstrably the latter.
If you train an LLM on chess, it will learn that too. You don't need to explain the rules, just feed it chess games, and it will stop making illegal moves at some point. It is a clear example of an inferred world model from training.
https://arxiv.org/abs/2501.17186
PS "Major commercial American LLM" is not very meaningful, you could be using GPT4o with that description.
This is the best and clearest explanation I have yet seen that describe a tricky thing, namely that LLMs, which are synonymous with "AI" for so many people, are just one variation of many possible types of machine intelligence.
Which I find important because, well, hallucinating facts is what you would expect from an LLM, but isn't necessarily inherent issue with machine intelligence writ large if it's trained from the ground up on different principles, or modelling something else. We use LLMs as a stand in for tutors because being really good at language incidentally makes them able to explain math or history as a side effect.
Importantly it doesn't show that hallucinating is a baked in problem for AI writ large. Presumably different models will have different kinds of systemic errors based on their respective designs.
Great quote at the end that I think I resonate a lot with:
> Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.
I think they know this but don’t have causality built-in. In the sense they aren’t incentivised to understand holistically. Kids around 4 years old are spamming with “why? why? why?” questions and I think this is some process we need yet to reproduce. (BTW I suspect they ask this as manifestation of what is going in their brain and not real curiosity as they ask same question multiple times)
I think it’s mostly because they are incentivised to answer verbatim as medicine students and not with their own understanding. RL methods change that.
What with this and your previous post about why sometimes incompetent management leads to better outcomes, you are quickly becoming one of my favorite tech bloggers. Perhaps I enjoyed the piece so much because your conclusions basically track mine. (I'm a software developer who has dabbled with LLMs, and has some hand-wavey background on how they work, but otherwise can claim no special knowledge.) Also your writing style really pops. No one would accuse your post of having been generated by an LLM.
thank you for your kind words!
Agree in general with most of the points, except
> but because I know you and I get by with less.
Actually we got far more data and training than any LLM. We've been gathering and processing sensory data every second at least since birth (more processing than gathering when asleep), and are only really considered fully intelligent in our late teens to mid-20s.
Don't forget the millions of years of pre-training! ;)
It seems clear to me that LLMs are a useful sort of dumb smart activity. They can take some pretty useful stabs in the dark, and therefore do much better in an environment which can give them feedback (coding) or where there is no objective correct answer (write a poem). It opens the door for some novel type of computational tasks, and the more feedback you can provide within the architecture of your application, the more useful the LLM will probably be. I think the hype of their genuine intelligence is overblown, but doesn’t mean they are not useful.
Maybe pure language models aren't world models, but Genie 3 for example seems to be a pretty good world model:
https://deepmind.google/discover/blog/genie-3-a-new-frontier...
We also have multimodal AIs that can do both language and video. Genie 3 made multimodal with language might be pretty impressive.
Focusing only on what pure language models can do is a bit of a straw man at this point.
Here's what LLMs remind me of.
When I went to uni, we had tutorials several times a week. Two students, one professor, going over whatever was being studied that week. The professor would ask insightful questions, and the students would try to answer.
Sometimes, I would answer a question correctly without actually understanding what I was saying. I would be spewing out something that I had read somewhere in the huge pile of books, and it would be a sentence, with certain special words in it, that the professor would accept as an answer.
But I would sometimes have this weird feeling of "hmm I actually don't get it" regardless. This is kinda what the tutorial is for, though. With a bit more prodding, the prof will ask something that you genuinely cannot produce a suitable word salad for, and you would be found out.
In math-type tutorials it would be things like realizing some equation was useful for finding an answer without having a clue about what the equation actually represented.
In economics tutorials it would be spewing out words about inflation or growth or some particular author but then having nothing to back up the intuition.
This is what I suspect LLMs do. They can often be very useful to someone who actually has the models in their minds, but not the data to hand. You may have forgotten the supporting evidence for some position, or you might have missed some piece of the argument due to imperfect memory. In these cases, LLM is fantastic as it just glues together plausible related words for you to examine.
The wheels come off when you're not an expert. Everything it says will sound plausible. When you challenge it, it just apologizes and pretends to correct itself.
Good on you for having the meta-cognition to recognize it.
I've graded many exams in my university days (and set some myself), and it's exceedingly obvious that that's what many students are doing. I do wonder though how often they manage to fly under the radar. I'm sure it happens, as you described.
(This is also the reason why I strongly believe that in exams where students write free-form answers, points should be subtracted for incorrect statements even if a correct solution is somewhere in the word salad.)
> When you challenge it, it just apologizes and pretends to correct itself.
Even when it was right the first time!
This article appeared on HN a while ago. https://dynomight.net/more-chess/ It basically is in agreement with this article and provides a few more trials and explanations.
Don’t: use LLMs to play chess against you
Do: use LLMs to talk shit to you while a real chess AI plays chess against you.
The above applies to a lot of things besides chess, and illustrates a proper application of LLMs.
Are you suggesting that we use an LLM as an interface between the AI and the player?
Why would anyone choose to awkwardly play using natural language rather than a reliable, fast and intuitive UI?
No, I think they're suggesting the LLM should literally be "talking shit", e.g. in a chat window alongside the game UI, as if you're in a live chat with another player. As in, use the LLM for processing language, and the chess engine for playing chess.
I think this is quite an amusing idea, as the LLM would see the moves the chess engine made and comment along the lines of "wow, I didn't see that one coming!" very Roger Sperry.
I don’t think you even need such complex tests as chess to see it doesn’t have a world model - just ask it to do any 5+ digit multiplication
As far as I can tell they don’t say which LLM they used which is kind of a shame as there is a huge range of capabilities even in newly released LLMs (e.g. reasoning vs not).
ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough
Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.
My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.
How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.
The current models lack "remember/use/update" parts.
> I don't think there's any fundamental difference in the principle of their operation
Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).
That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.
But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.
It doesn't matter. These limitations are fundamental to LLMs, so all of them that will ever be made suffer from these problems.
https://arxiv.org/abs/2501.17186
This is interesting. The "professional level" rating of <1800 isn't, but still.
However:
"A significant Elo rating jump occurs when the model’s Legal Move accuracy reaches 99.8%. This increase is due to the reduction in errors after the model learns to generate legal moves, reinforcing that continuous error correction and learning the correct moves significantly improve ELO"
You should be able to reach the move legality of around 100% with few resources spent on it. Failing to do so means that it has not learned a model of what chess is, at some basic level. There is virtually no challenge in making legal moves.
> Failing to do so means that it has not learned a model of what chess is, at some basic level.
I'm not sure about this. Among a standard amateur set of chess players, how often when they lack any kind of guidance from a computer do they attempt to make a move that is illegal? I played chess for years throughout elementary, middle and high school, and I would easily say that even after hundreds of hours of playing, I might make two mistakes out of a thousand moves where the move was actually illegal, often because I had missed that moving that piece would continue to leave me in check due to a discovered check that I had missed.
It's hard to conclude from that experience that players that are amateurs lack even a basic model of chess.
> r4rk1 pp6 8 4p2Q 3n4 4N3 qP5P 2KRB3 w — — 3 27
Can you say 100% you can generate a good next move (example from the paper) without using tools, and will never accidentally make a mistake and give an illegal move?
Language models aren't world models for the same reason languages aren't world models.
Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
They only have meaning to sentient beings, and that meaning is heavily subjective and contextual.
But there appear to be some who think that we can grasp truth through mechanical symbol manipulation. Perhaps we just need to add a few million more symbols, they think.
If we accept the incompleteness theorem, then there are true propositions that even a super-intelligent AGI would not be able to express, because all it can do is output a series of placeholders. Not to mention the obvious fallacy of knowing super-intelligence when we see it. Can you write a test suite for it?
> Symbols, by definition, only represent a thing.
This is missing the lesson of the Yoneda Lemma: symbols are uniquely identified by their relationships with other symbols. If those relationships are represented in text, then in principle they can be inferred and navigated by an LLM.
Some relationships are not represented well in text: tacit knowledge like how hard to twist a bottle cap to get it to come off, etc. We aren't capturing those relationships between all your individual muscles and your brain well in language, so an LLM will miss them or have very approximate versions of them, but... that's always been the problem with tacit knowledge: it's the exact kind of knowledge that's hard to communicate!
I don’t think it’s a communication problem as much as there is no possible relation between a word and a (literal) physical experiences. They’re, quite literally, on different planes of existence.
When I have a physical experience, sometimes it results in me saying a word.
Now, maybe there are other possible experiences that would result in me behaving identically, such that from my behavior (including what words I say) it is impossible to distinguish between different potential experiences I could have had.
But, “caused me to say” is a relation, is it not?
Unless you want to say that it wasn’t the experience that caused me to do something, but some physical thing that went along with the experience, either causing or co-occurring with the experience, and also causing me to say the word I said. But, that would still be a relation, I think.
Yes, but it's a unidirectional relation: it was the result of the experience. The word cannot represent the context (the experience), in a meaningful way.
It's like trying to describe a color to a blind person: poetic subjective nonsense.
I don’t know what you mean by “unidirectional relation”. I get that you gave an explanation after the colon, but I still don’t quite get what you mean. Do you just mean that what words I use doesn’t pick out a unique possible experience? That’s true of course, but I don’t know why you call that “unidirectional”
I don’t think describing colors to a blind person is nonsense. One can speak of how the different colors relate to one-another. A blind person can understand that a stop sign is typically “red”, and that something can be “borderline between red and orange”, but that things will not be “borderline between green and purple”. A person who has never had any color perception won’t know the experience of seeing something red or blue, but they can still have a mental model of the world that includes facts about the colors of things, and what effects these are likely to have, even though they themselves cannot imagine what it is like to see the colors.
IMO, the GP's idea is that you can't explain sounds to a deaf man, or emotions to someone who doesn't feel them. All that needs direct experience and words only point to our shared experience.
Well shit, I better stop reading books then.
I think you've missed the concept here.
You exist in the full experience. That lossy projection to words is still meaningful to you, in your reading, because you know the experience it's referencing. What do I mean by "lossy projection"? It's the experience of seeing the color blue to the word "blue". The word "blue" is meaningless without already having experienced it, because the word is not a description of the experience, it's a label. The experience itself can't be sufficiently described, as you'll find if you try to explain a "blue" to a blind person, because it exists outside of words.
The concept here is that something like an LLM, trained on human text, can't having meaningful comprehension of some concepts, because some words are labels of things that exist entirely outside of text.
You might say "but multimodal models use tokens for color!", or even extending that to "you could replace the tokens used in multimodal models with color names!" and I would agree. But, the understanding wouldn't come from the relation of words in human text, it would come from the positional relation of colors across a space, which is not much different than our experience of the color, on our retina
tldr: to get AI to meaningful understand something, you have to give it a meaningful relation. Meaningful relations sometimes aren't present, in human writing.
> Symbols, by definition, only represent a thing. They are not the same as the thing
First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.
But to your philosophical point, assuming there are only a finite number of things and places in the universe - or at least the part of which we care about - why wouldn't they be representable with a finite set of symbols?
What you're rejecting is the Church-Turing thesis [1] (essentially, that all mechanical processes, including that of nature, can be simulated with symbolic computation, although there are weaker and stronger variants). It's okay to reject it, but you should know that not many people do (even some non-orthodox thoughts by Penrose about the brain not being simulatable by an ordinary digital computer still accept that some physical machine - the brain - is able to represent what we're interested in).
> If we accept the incompleteness theorem
There is no if there. It's a theorem. But it's completely irrelevant. It means that there are mathematical propositions that can't be proven or disproven by some system of logic, i.e. by some mechanical means. But if something is in the universe, then it's already been proven by some mechanical process: the mechanics of nature. That means that if some finite set of symbols could represent the laws of nature, then anything in nature can be proven in that logical system. Which brings us back to the first point: the only way the mechanics of nature cannot be represented by symbols is if they are somehow infinite, i.e. they don't follow some finite set of laws. In other words - there is no physics. Now, that may be true, but if that's the case, then AI is the least of our worries.
Of course, if physics does exist - i.e. the universe is governed by a finite set of laws - that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
[1]: https://plato.stanford.edu/entries/church-turing/
> First of all, the point isn't about the map becoming the territory, but about whether LLMs can form a map that's similar to the map in our brains.
It should be capable of something similar (fsvo similar), but the largest difference is that humans have to be power-efficient and LLMs do not.
That is, people don't actually have world models, because modeling something is a waste of time and energy insofar as it's not needed for anything. People are capable of taking out the trash without knowing what's in the garbage bag.
> Of course, if physics does exist - i.e. the universe is governed by a finite set of laws
Wouldn't physics still "exist" even if there were an infinite set of laws?
Well, the physical universe will still exist, but I don't think that physics - the scientific study of said universe - will become sort of meaningless, I would think?
Why meaningless? Imperfect knowledge can still be useful, and ultimately that's the only kind we can ever have about anything.
"We could learn to sail the oceans and discover new lands and transport cargo cheaply... But in a few centuries we'll discover we were wrong and the Earth isn't really a sphere and tides are extra-complex so I guess there's no point."
Because if there's an infinite number of laws, are they laws at all? You can't predict anything because you don't even know if some of the laws you don't know yet (which is pretty much all of them) makes an exception to the 0% of laws you do know. I'm not saying it's not interesting, but it's more history - today the apple fell down rather than up or sideways - than physics.
In the infinite set of all laws is there an infinite set of laws that do not conflict with each other?
.000000000000001% of infinity is still infinite.
> course, if physics does exist - i.e. the universe is governed by a finite set of laws
That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.
The Humean way of looking at physics is that we notice relationships and model those with various symbols. They symbols form incomplete models because we can't get to the bottom of why the relationships exist.
> that doesn't mean that we can predict the future, as that would entail both measuring things precisely and simulating them faster than their operation in nature, and both of these things are... difficult.
The indeterminism of Quantum Mechanics limits how how precise measure can be and how predictable the future is.
> That statement is problematic. It implies a metaphysical set of laws that make physical stuff relate a certain way.
What I meant was that since physics is the scientific search for the laws of nature, then if there's an infinite number of them, then the pursuit becomes somewhat meaningless, as an infinite number of laws aren't really laws at all.
> They symbols form incomplete models because we can't get to the bottom of why the relationships exist.
Why would a model be incomplete if we don't know why the laws are what they are? A model pretty much is a set of laws; it doesn't require an explanation (we may want such an explanation, but it doesn't improve the model).
First: true propositions (that are not provable) can definitely be expressed, if they couldn't, the incompleteness theorem would not be true ;-)
It would be interesting to know what the percentage of people is, who invoke the incompleteness theorem, and have no clue what it actually says.
Most people don't even know what a proof is, so that cannot be a hindrance on the path to AGI ...
Second: ANY world model that can be digitally represented would be subject to the same argument (if stated correctly), not only LLMs.
I knew someone would call me out on that. I used the wrong word; what I meant was "expressed in a way that would satisfy" which implies proof within the symbolic order being used. I don't claim to be a mathematician or philosopher.
Well, you don't get it. The LLM definitely can state propositions "that satisfy", let's just call them true propositions, and that this is not the same as having a proof for it is what the incompleteness theorem says.
Why would you require an LLM to have proof for the things it says? I mean, that would be nice, and I am actually working on that, but it is not anything we would require of humans and/or HN commenters, would we?
I clearly do not meet the requirements to use the analogy.
I am hearing the term super intelligence a lot and it seems to me the only form that would take is the machine spitting out a bunch of symbols which either delight or dismay the humans. Which implies they already know what it looks like.
If this technology will advance science or even be useful for everyday life, then surely the propositions it generates will need to hold up to reality, either via axiomatic rigor or empirically. I look forward to finding out if that will happen.
But it's still just a movement from the known to the known, a very limited affair no matter how many new symbols you add in whatever permutation.
> Language models aren't world models for the same reason languages aren't world models. Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
Symbols, maps, descriptions, and words are useful precisely because they are NOT what they represent. Representation is not identity. What else could a “world model” be other than a representation? Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?
> Aren’t all models representations, by definition? What exactly do you think a world model is, if not something expressible in language?
I was following the string of questions, but I think there is a logical leap between those two questions.
Another question: is Language the only way to define models? An imagined sound or an imagined picture of an apple in my minds-eye are models to me, but they don't use language.
Gödel’s incompleteness theorems aren’t particularly relevant here. Given how often people attempt to apply them to situations where they don’t say anything of note, I think the default should generally be to not publicly appeal to them unless one either has worked out semi-carefully how to derive the thing one wants to show from them, or at least have a sketch that one is confident, from prior experience working with it, that one could make into a rigorous argument. Absent these, the most one should say, I think, is “Perhaps one can use Gödel’s incompleteness theorems to show [thing one wants to show].” .
Now, given a program that is supposed to output text that encodes true statements (in some language), one can probably define some sort of inference system that corresponds to the program such that the inference system is considered to “prove” any sentence that the program outputs (and maybe also some others based on some logical principles, to ensure that the inference system satisfies some good properties), and upon defining this, one could (assuming the language allows making the right kinds of statements about arithmetic) show that this inference system is, by Gödel’s theorems, either inconsistent or incomplete.
This wouldn’t mean that the language was unable to express those statements. It would mean that the program either wouldn’t output those statements, or that the system constructed from the program was inconsistent (and, depending on how the inference system is obtained from the program, the inference system being inconsistent would likely imply that the program sometimes outputs false or contradictory statements).
But, this has basically nothing to do with the “placeholders” thing you said. Gödel’s theorem doesn’t say that some propositions are inexpressible in a given language, but that some propositions can’t be proven in certain axiom+inference systems.
Rather than the incompleteness theorems, the “undefinability of truth” result seems more relevant to the kind of point I think you are trying to make.
Still, I don’t think it will show what you want it to, even if the thing you are trying to show is true. Like, perhaps it is impossible to capture qualia with language, sure, makes sense. But logic cannot show that there are things which language cannot in any way (even collectively) refer to, because to show that there is a thing it has to refer to it.
————
“Can you write a test suite for it?”
Hm, might depend on what you count as a “suite”, but a test protocol, sure. The one I have in mind would probably be a bit expensive to run if it fails the test though (because it involves offering prize money).
There is an important implication of learning and indexing being equivalent problems. A number of important data models and data domains exist for which we do not know how to build scalable indexing algorithms and data structures.
It has been noted for several years in US national labs and elsewhere that there is an almost perfect overlap between data models LLMs are poor at learning and data models that we struggle to index at scale. If LLMs were actually good at these things then there would be a straightforward path to addressing these longstanding non-AI computer science problems.
The incompleteness is that the LLM tech literally can't represent elementary things that are important enough that we spend a lot of money trying to represent them on computers for non-AI purposes. A super-intelligent AGI being right around the corner implies that we've solved these problems that we clearly haven't solved.
Perhaps more interesting, it also implies that AGI tech may look significantly different than the current LLM tech stack.
> If we accept the incompleteness theorem
And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision. So if the incompleteness theorem means that neural nets can never find truth, it also means that the human brain can never find truth.
Human neuron firing patterns, after all, only represent a thing; they are not the same as the thing. Your experience of seeing something isn't recreating the physical universe in your head.
> And, by various universality theorems, a sufficiently large AGI could approximate any sequence of human neuron firings to an arbitrary precision.
Wouldn't it become harder to simulate a human brain the larger a machine is? I don't know nothing, but I think that peaky speed of light thing might pose a challenge.
simulate ≠ simulate-in-real-time
All simulation is realtime to the brain being simulated.
Sure, but that’s not the clock that’s relevant to the question of the light speed communication limits in a large computer?
Everything is just a low resolution representation of a thing. The so-called reality we supposedly have access to is at best a small number of sound waves and photons hitting our face. So I don't buy this argument that symbols are categorically different. It's a gradient and symbols are more sparse and less rich of a data source, yes. But who are we to say where that hypothetical line exists, beyond which further compression of concepts into smaller numbers of buckets becomes a non-starter for intelligence and world modelling. And then there's multi modal LLMs which have access to data of a similar richness that humans have access to.
There are no "things" in the universe. You say this wave and that photon exist and represent this or that, but all of that is conceptual overlay. Objects are parts of speech, reality is undifferentiated quanta. Can you point to a particular place where the ocean becomes a particular wave? Your comment already implies an understanding that our mind is behind all the hypothetical lines; we impose them, they aren't actually there.
Reminds me of this [1] article. If us humans, after all these years we've been around, can't relay our thoughts exactly as we perceive them in our heads, what makes us think that we can make a model that does it better than us?
[1]: https://www.experimental-history.com/p/you-cant-reach-the-br...
I’m not a math guy but the incompleteness theorem applies to formal systems, right? I’ve never thought about LLMs as formal systems, but I guess they are?
Anything that runs on a computer is a formal system. "Formal" (the manipulation of forms) is an old term for what, after Turing, we call "mechanical".
Nor am I. I'm not claiming an LLM is a formal system, but it is mechanical and operates on symbols. It can't deal in anything else. That should temper some of the enthusiasm going around.
I don't think you can apply the incompleteness theorem like that, LLMs aren't constrained to formal systems
> Language models aren't world models for the same reason languages aren't world models. > Symbols, by definition, only represent a thing. They are not the same as the thing. The map is not the territory, the description is not the described, you can't get wet in the word "water".
There is a lot of negatives in there, but I feel like it boils down to a model of a thing is not the thing. Well duh. It's a model. A map is a model.
Right. It's a dead thing that has no independent meaning. It doesn't even exist as a thing except conceputally. The referent is not even another dead thing, but a reality that appears nowhere in the map itself. It may have certain limited usefulness in the practical realm, but expecting it to lead to new insights ignores the fact that it's fundamentally an abstraction of the real, not in relationship to it.
> but expecting it to lead to new insights ignores the fact that it's fundamentally an abstraction of the real, not in relationship to it.
Where do humans get new insights from?
Generally the experience of insight is prior to any discursive expression. We put our insights in terms of words, they do not arise as such.
I just tried a few things that are simple and a world model would probably get right. Eg
Question to GPT5: I am looking straight on to some objects. Looking parallel to the ground.
In front of me I have a milk bottle, to the right of that is a Coca-Cola bottle. To the right of that is a glass of water. And to the right of that there’s a cherry. Behind the cherry there’s a cactus and to the left of that there’s a peanut. Everything is spaced evenly. Can I see the peanut?
Answer (after choosing thinking mode)
No. The cactus is directly behind the cherry (front row order: milk, Coke, water, cherry). “To the left of that” puts the peanut behind the glass of water. Since you’re looking straight on, the glass sits in front and occludes the peanut.
It doesn’t consider transparency until you mention it, then apologises and says it didn’t think of transparency
this seems like a strange riddle. In my mind I was thinking that regardless of the glass, all of the objects can be seen (due to perspective, and also the fact you mentioned the locations, meaning you're aware of them).
It seems to me it would only actually work in an orthographic perspective, which is not how our reality works
You can tell from the response it does understand the riddle just fine, it just gets it wrong.
Have you asked five adults this riddle? I suspect at least two of them would get it wrong or have some uncertainty about whether or not the peanut was visible.
This. Was also thinking "yes" first because of the glass of water, transparency, etc, but then got unsure: The objects might be spaced so widely that the milk or coke bottle would obscure the view due to perspective - or the peanut would simply end up outside the viewer's field of vision.
Shows that even if you have a world model, it might not be the right one.
Worked for me: https://chatgpt.com/share/689bc3ef-fa1c-800f-9275-93c2dbc11b...
Gemini 2.5 Pro gets this correct on the first attempt, and specifically points out the transparency of the glass of water.
https://g.co/gemini/share/362506056ddb
Time to get the ol' goalpost-moving gloves out.
https://www.youtube.com/watch?v=LtG0ACIbmHw
Sota LLMs do play legal moves in chess, I don't why the article seem to say otherwise.
Technically yes, but... it's moderately tricky to get an LLM to play good chess even though it can.
https://dynomight.net/more-chess/
This is significant in general because I personally would love to get these things to code-switch into "hackernews poster" or "writer for the Economist" or "academic philosopher", but I think the "chat" format makes it impossible. The inaccessibility of this makes me want to host my own LLM...
Yes LLMs can play chess and yes they can model it fine
https://arxiv.org/pdf/2403.15498v2
Haha. I enjoyed that Soviet-era joke at the end.
Yes, I hadn't heard that before. It's similar in spirit to this norwegian folk tale about a deaf man guessing what someone is saying to him:
https://en.wikipedia.org/wiki/%22Good_day,_fellow!%22_%22Axe...
Another similar story:
King Frederick, the great of Prussia had a very fine army, and none of the soldiers in it were finer than Giant Guards, who were all extremely tall men. It was difficult to find enough soldiers for these Guards, as there were not many men who were tall enough.
Frederick had made it a rule that no soldiers who did not speak German could be admitted to the Giant Guards, and this made the work of the officers who had to find men for them even more difficult. When they had to choose between accepting or refusing a really tall man who knew no German, the officers used to accept him, and then teach him enough. German to be able to answer if the King questioned him.
Frederick, sometimes, used to visit the men who were on guard around his castle at night to see that they were doing their job properly, and it was his habit to ask each new one that he saw three questions: “How old are you?” “How long have you been in my army?” and “Are you satisfied with your food and your conditions?”
The offices of the Giant Guards therefore used to teach new soldiers who did not know German the answers to these three questions.
One day, however, the King asked a new soldier the questions in a different order, he began with, “How long have you been in my army?” The young soldier immediately answered, “Twenty – two years, Your Majesty”. Frederick was very surprised. “How old are you then?”, he asked the soldier. “Six months, Your Majesty”, came the answer. At this Frederick became angry, “Am I a fool, or are you one?” he asked. “Both, Your Majesty”, the soldier answered politely.
https://archive.org/details/advancedstoriesf0000hill
LLM is not AI, it's dumbass, too stupid to NOT assume and hallucinate.
Lots of people assume, confabulate, misremember, and lie every day.
They are not intelligent.
LOL, you really think that intelligence (however you want to define or measure the concept) is a guarantee that people won't make mistakes, misremember, make stuff up, or lie?
I agree with the article. I will be very surprised if LLMs end up being "it". I say this as a language geek who has always been amazed how language drives our thinking. However, I think language exists between brains, not inside them. There's something else in us and LLMs aren't it.
A slight tangent: I think/wonder if the one place where AIs could be really useful, might be in translating alien languages :)
As in, an alien could teach one of our AIs their language faster than an alien could teach an human, and vice versa..
..though the potential for catastrophic disasters is also great there lol
I wonder how the nature of the language used to train an LLM affects its model of the world. Would a language designed for the maximum possible information content and clarity like Ithkuil make an LLMs world model more accurate?
I'm surprised the models haven't been enshittified by capitalism. I think in a few years we're going to see lightning-fast LLMs generating better output compared to what we're seeing today. But it won't be 1000x better, it will be 10x better, 10x faster, and completely enshittified with ads and clickbait links. Enjoy ChatGPT while it lasts.
Not yet