Tight feedback loops are the key in working productively with software. I see that in codebases up to 700k lines of code (legacy 30yo 4GL ERP systems).
The best part is that AI-driven systems are fine with running even more tight loops than what a sane human would tolerate.
Eg. running full linting, testing and E2E/simulation suite after any minor change. Or generating 4 versions of PR for the same task so that the human could just pick the best one.
> Or generating 4 versions of PR for the same task so that the human could just pick the best one.
That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality. Why are we doing this to ourselves and embracing it?
A few years ago, it would have been seen as a joke to say “the future of software development will be to have a million monkey interns banging on one million keyboards and submit a million PRs, then choose one”. Today, it’s lauded as a brilliant business and cost-saving idea.
We’re beyond doomed. The first major catastrophe caused by sloppy AI code can’t come soon enough. The sooner it happens, the better chance we have to self-correct.
I'm not sure that AI code has to be sloppy. I've had some success with hand coding some examples and then asking codex to rigorously adhere to prior conventions. This can end up with very self consistent code.
Agree though on the "pick the best PR" workflow. This is pure model training work and you should be compensated for it.
Yep this is what Andrej talks about around 20 minutes into this talk.
You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail. The second you start being vague, even if it WOULD be clear to a person with common sense, the LLM views that vagueness as a potential aspect of it's own creative liberty.
> the LLM views that vagueness as a potential aspect of it's own creative liberty.
I think that anthropomorphism actually clouds what’s going on here. There’s no creative choice inside an LLM. More description in the prompt just means more constraints on the latent space. You still have no certainty whether the LLM models the particular part of the world you’re constraining it to in the way you hope it does though.
> You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail.
If only there was a language one could use that enables describing all of your requirements in a unambiguous manner, ensuring that you have provided all the necessary detail.
If it's monkeylike quality and you need a million tries, it's shit. It you need four tries and one of those is top-tier professional programmer quality, then it's good.
> A truly terrible and demotivating way to work and produce anything of real quality
You clearly have strong feelings about it, which is fine, but it would be much more interesting to know exactly why it would terrible and demotivating, and why it cannot produce anything of quality? And what is "real quality" and does that mean "fake quality" exists?
> million monkey interns banging on one million keyboards and submit a million PRs
I'm not sure if you misunderstand LLMs, or the famous "monkeys writing Shakespeare" part, but that example is more about randomness and infinity than about probabilistic machines somewhat working towards a goal with some non-determinism.
> We’re beyond doomed
The good news is that we've been doomed for a long time, yet we persist. If you take a look at how the internet is basically held up by duct-tape at this point, I think you'd feel slightly more comfortable with how crap absolutely everything is. Like 1% of software is actually Good Software while the rest barely works on a good day.
If "AI" worked (which fortunately isn't the case), humans would be degraded to passive consumers in the last domain in which they were active creators: thinking.
Moreover, you would have to pay centralized corporations that stole all of humanity's intellectual output for engaging in your profession. That is terrifying.
The current reality is also terrifying: Mediocre developers are enabled to have a 10x volume (not quality). Mediocre execs like that and force everyone to use the "AI" snakeoil.
The profession becomes even more bureaucratic, tool oriented and soulless.
> And what is "real quality" and does that mean "fake quality" exists?
I think there is no real quality or fake quality, just quality. I am referencing the quality that Persig and C. Alexander have written about.
It’s… qualitative, so it’s hard to measure but easy to feel. Humans are really good at perceiving it then making objective decisions. LLMs don’t know what it is (they’ve heard about it and think they know).
It is actually funny that current AI+Coding tools benefit a lot from domain context and other information along the lines of Domain-Driven Design (which was inspired by the pattern language of C. Alexander).
A few teams have started incorporating `CONTEXT.MD` into module descriptions to leverage this.
"If the only tool you have is a hammer, you tend to see every problem as a nail."
I think the worlds leaning dangerously into LLMs expecting them to solve every problem under the sun. Sure AI can solve problems but I think that domain 1 they Karpathy shows if it is the body of new knowledge in the world doesn't grow with LLMs and agents maybe generation and selection is the best method for working with domain 2/3 but there is something fundamentally lost in the rapid embrace of these AI tools.
A true challenge question for people is would you give up 10 points of IQ for access to the next gen AI model? I don't ask this in the sense that AI makes people stupid but rather that it frames the value of intelligence is that you have it. Rather than, in how you can look up or generate an answer that may or may not be correct quickly. How we use our tools deeply shapes what we will do in the future. A cautionary tale is US manufacturing of precision tools where we give up on teaching people how to use Lathes, because they could simply run CNC machines instead. Now that industry has an extreme lack of programmers for CNC machines, making it impossible to keep up with other precision instrument producing countries. This of course is a normative statement and has more complex variables but I fear in this dead set charge for AI we will lose sight of what makes programming languages and programming in general valuable
This is why there has to be "write me a detailed implementation plan" step in between. Which files is it going to change, how, what are the gotchas, which tests will be affected or added etc.
It is easier to review one document and point out missing bits, than chase the loose ends.
Once the plan is done and good, it is usually a smooth path to the PR.
In my prompt I ask the LLM to write a short summary of how it solved the problem, run multiple instances of LLM concurrently, compare their summaries, and use the output of whichever LLM seems to have interpreted instructions the best, or arrived at the best solution.
And you trust that the summary matches what was actually done? Your experience with the level of LLMs understanding of code changes must significantly differ from mine.
It is not. The right way to work with generative AI is to get the right answer in the first shot. But it's the AI that is not living up to this promise.
Reviewing 4 different versions of AI code is grossly unproductive. A human co-worker can submit one version of code and usually have it accepted with a single review, no other "versions" to verify. 4 versions means you're reading 75% more code than is necessary. Multiply this across every change ever made to a code base, and you're wasting a shitload of time.
1. People get lazy when presented with four choices they had no hand in creating, and they don’t look over the four and just click one, ignoring the others. Why? Because they have ten more of these on the go at once, diminishing their overall focus.
2. Automated tests, end-to-end sim., linting, etc—tools already exist and work at scale. They should be robust and THOROUGHLY reviewed by both AI and humans ideally.
3. AI is good for code reviews and “another set of eyes” but man it makes serious mistakes sometimes.
An anecdote for (1), when ChatGPT tries to A/B test me with two answers, it’s incredibly burdensome for me to read twice virtually the same thing with minimal differences.
Code reviewing four things that do almost the same thing is more of a burden than writing the same thing once myself.
The more tedious the work is, the less motivation and passion you get for doing it, and the more "lazy" you become.
Laziness does not just come from within, there are situations that promote behaving lazy, and others that don't. Some people are just lazy most of the time, but most people are "lazy" in some scenarios and not in others.
A simple rule applies: "No matter what tool created the code, you are still responsible for what you merge into main".
As such, task of verification, still falls on hands of engineers.
Given that and proper processes, modern tooling works nicely with codebases ranging from 10k LOC (mixed embedded device code with golang backends and python DS/ML) to 700k LOC (legacy enterprise applications from the mainframe era)
> As such, task of verification, still falls on hands of engineers.
Even before LLM it was a common thing to merge changes which completely brake test environment. Some people really skip verification phase of their work.
Agreed. I think engineers though following simple Test-Driven Development procedures can write the code, unit tests, integration tests, debug, etc for a small enough unit by default forces tight feedback loops. AI may assist in the particulars, not run the show.
I’m willing to bet, short of droid-speak or some AI output we can’t even understand, that when considering “the system as a whole”, that even with short-term gains in speed, the longevity of any product will be better with real people following current best-practices, and perhaps a modest sprinkle of AI.
Why? Because AI is trained on the results of human endeavors and can only work within that framework.
Agreed. AI is just a tool. Letting in run the show is essentially what the vibe-coding is. It is a fun activity for prototyping, but tends to accumulate problems and tech debt at an astonishing pace.
Code, manually crafted by professionals, will almost always beat AI-driven code in quality. Yet, one has still to find such professionals and wait for them to get the job done.
I think, the right balance is somewhere in between - let tools handle the mundane parts (e.g. mechanically rewriting that legacy Progress ABL/4GL code to Kotlin), while human engineers will have fun with high-level tasks and shaping the direction of the project.
Unless you are doing something crazy like letting the fuzzer run on every change (cache that shit), the full test suite taking a long time suggests that either your isolation points are way too large or you are letting the LLM cross isolated boundaries and "full testing suite" here actually means "multiple full testing suites". The latter is an easy fix: Don't let it. The former is a lot harder to fix, but I suppose ending up there is a strong indicator that you can't trust the human picking the best LLM result in the first place and that maybe this whole thing isn't a good idea for the people in your organization.
The full test suite is probably tens of thousands of tests.
But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.
Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.
Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.
It's an interesting idea, but reactive, and could cause big delays due to bisecting and testing on those regressions. There's the 'old' saying that the sooner the bug is found the cheaper it is to fix, seems weird to intentionally push finding side effect bugs later in the process because faster CI runs. Maybe AI will get there but it seems too aggressive right now to me. But yeah, put the automation slider where you're comfortable.
It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.
Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?
Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.
If you're feeling brave, you start Robot B and C at the same time.
No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.
This is the workflow that ChatGPT Codex demonstrates nicely. Launch any number of «robotic» tasks in parallel, then go on your own. Come back later to review the results and pick good ones.
Well, they're demonstrating it somewhat, it's more of a prototype today. First tell is the low limit, I think the longest task for me been 15 minutes before it gives up. Second tell is still using a chat UI which is simple to implement, easy to implement and familiar, but also kind of lazy. There should be a better UX, especially with the new variations they just added. From the top of my head, some graph-like UX might have been better.
It works really nice with the following approach (distilled from experiences reported by multiple companies)
(1) Augment codebase with explanatory texts that describe individual modules, interfaces and interactions (something that is needed for the humans anyway)
(2) Provide Agent.MD that describes the approach/style/process that the AI agent must take. It should also describe how to run all tests.
(3) Break down the task into smaller features. For each feature - ask first to write a detailed implementation plan (because it is easier to review the plan than 1000 lines of changes. spread across a dozen files)
(4) Review the plan and ask to improve it, if needed. When ready - ask to draft an actual pull request
(5) The system will automatically use all available tests/linting/rules before writing the final PR. Verify and provide feedback, if some polish is needed.
(6) Launch multiple instances of "write me an implementation plan" and "Implement this plan" task, to pick the one that looks the best.
This is very similar to git-driven development of large codebases by distributed teams.
Yes, and (some near-future) AI is also more patient and better at multitasking than a reasonable human. It can make a change, submit for full fuzzing, and if there's a problem it can continue with the saved context it had when making the change. It can work on 100s of such changes in parallel, while a human trying to do this would mix up the reasons for the change with all the other changes they'd done by the time the fuzzing result came back.
LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.
You can't run N bots in parallel with testing between each attempt unless you're also running N tests in parallel.
If you could run N tests in parallel, then you could probably also run the components of one test in parallel and keep it from taking 2 hours in the first place.
To me this all sounds like snake oil to convince people to do something they were already doing, but by also spinning up N times as many compute instances and run a burn endless tokens along the way. And by the time it's demonstrated that it doesn't really offer anything more than doing it yourself, well you've already given them all of your money so their job is done.
In one of the systems (supply chain SaaS) we invested so much effort in having good tests in a simulated environment, that we could run full-stack tests at kHz. Roughly ~5k tests per second or so on a laptop.
I work in web dev, so people sometimes hook code formatting as a git commit hook or sometimes even upon file save.
The tests are problematic tho. If you work at huge project it's a no go idea at all. If you work at medium then the tests are long enough to block you, but short enough for you not to be able to focus on anything else in the meantime.
In my experience with Jules and (worse) Codex, juggling multiple pull requests at once is not advised.
Even if you tell the git-aware Jules to handle a merge conflict within the context window the patch was generated, it is like sorry bro I have no idea what's wrong can you send me a diff with the conflict?
I find i have to be in the iteration loop at every stage or else the agent will forget what it's doing or why rapidly. for instance don't trust Jules to run your full test suite after every change without handholding and asking for specific run results every time.
It feels like to an LLM, gaslighting you with code that nominally addresses the core of what you just asked while completely breaking unrelated code or disregarding previously discussed parameters is an unmitigated success.
The problem is that every time you run your full automation with linting and tests, you’re filling up the context window more and more. I don’t know how people using Claude do it with its <300k context window. I get the “your message will exceed the length of this chat” message so many times.
I don't know exactly how Claude works, but the way I work around this with my own stuff is prompting it to not display full outputs ever, and instead temporary redirect the output somewhere then grep from the log-file what it's looking for. So a test run outputting 10K lines of test output and one failure is easily found without polluting the context with 10K lines.
Cursor.sh agents or especially OpenAI Codex illustrate that a tool doesn't need to keep on stuffing context window with irrelevant information in order to make progress on a task.
And if really needed, engineers report that Gemini Pro 2.5 keeps on working fine within 200k-500k token context. Above that - it is better to reset the context.
I think it's interesting to juxtapose traditional coding, neural network weights and prompts because in many areas -- like the example of the self driving module having code being replaced by neural networks tuned to the target dataset representing the domain -- this will be quite useful.
However I think it's important to make it clear that given the hardware constraints of many environments the applicability of what's being called software 2.0 and 3.0 will be severely limited.
So instead of being replacements, these paradigms are more like extra tools in the tool belt. Code and prompts will live side by side, being used when convenient, but none a panacea.
I kind of say it in words (agreeing with you) but I agree the versioning is a bit confusing analogy because it usually additionally implies some kind of improvement. When I’m just trying to distinguish them as very different software categories.
What do you think about structured outputs / JSON mode / constrained decoding / whatever you wish to call it?
To me, it's a criminally underused tool. While "raw" LLMs are cool, they're annoying to use as anything but chatbots, as their output is unpredictable and basically impossible to parse programmatically.
Structured outputs solve that problem neatly. In a way, they're "neural networks without the training". They can be used to solve similar problems as traditional neural networks, things like image classification or extracting information from messy text, but all they require is a Zod or Pydantic type definition and a prompt. No renting GPUs, labeling data and tuning hyperparameters necessary.
They often also improve LLM performance significantly. Imagine you're trying to extract calories per 100g of product, but some product give you calories per serving and a serving size, calories per pound etc. The naive way to do this is a prompt like "give me calories per 100g", but that forces the LLM to do arithmetic, and LLMs are bad at arithmetic. With structured outputs, you just give it the fifteen different formats that you expect to see as alternatives, and use some simple Python to turn them all into calories per 100g on the backend side.
Even more than that. With Structured Outputs we essentially control layout of the response, so we can force LLM to go through different parts of the completion in a predefined order.
One way teams exploit that - force LLM to go through a predefined task-specific checklist before answering. This custom hard-coded chain of thought boosts the accuracy and makes reasoning more auditable.
Weights are code being replaced by data; something I've been making heavy use of since the early 00s. After coding for 10 years you start to see the benefits of it and understand where you should use it.
LLMs give us another tool only this time it's far more accessible and powerful.
Great talk, thanks for putting it online so quickly. I liked the idea of making the generation / verification loop go brrr, and one way to do this is to make verification not just a human task, but a machine task, where possible.
Yes, I am talking about formal verification, of course!
That also goes nicely together with "keeping the AI on a tight leash". It seems to clash though with "English is the new programming language". So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English? I think that is possible, if you have a formal language and logic that is flexible enough, and close enough to informal English.
Yes, I am talking about abstraction logic [1], of course :-)
So the goal would be to have English (German, ...) as the ONLY programming language, invisibly backed underneath by abstraction logic.
> So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English?
The problem with trying to make "English -> formal language -> (anything else)" work is that informality is, by definition, not a formal specification and therefore subject to ambiguity. The inverse is not nearly as difficult to support.
Much like how a property in an API initially defined as being optional cannot be made mandatory without potentially breaking clients, whereas making a mandatory property optional can be backward compatible. IOW, the cardinality of "0 .. 1" is a strict superset of "1".
> The problem with trying to make "English -> formal language -> (anything else)" work is that informality is, by definition, not a formal specification and therefore subject to ambiguity. The inverse is not nearly as difficult to support.
Both directions are difficult and important. How do you determine when going from formal to informal that you got the right informal statement? If you can judge that, then you can also judge if a formal statement properly represents an informal one, or if there is a problem somewhere. If you detect a discrepancy, tell the user that their English is ambiguous and that they should be more specific.
I think type theory is exactly right for this! Being so similar to programming languages, it can piggy back on the huge amount of training the LLMs have on source code.
I am not sure lean in part is the right language, there might be challengers rising (or old incumbents like Agda or Roq can find a boost). But type theory definitely has the most robust formal systems at the moment.
Why? By the completeness theorem, shouldn't first order logic already be sufficient?
The calculus of constructions and other approaches are already available and proven. I'm not sure why we'd need a special logic for LLMs unless said logic somehow accounts for their inherently stochastic tendencies.
If first-order logic is already sufficient, why are most mature systems using a type theory? Because type theory is more ergonomic and practical than first-order logic. I just don't think that type theory is ergonomic and practical enough. That is not a special judgement with respect to LLMs, I want a better logic for myself as well. This has nothing to do with "stochastic tendencies". If it is easier to use for humans, it will be easier for LLMs as well.
when I started coding at the age of 11 in machine code and assembly on the C64, the dream was to create software that creates software.
Nowadays it's almost reality, almost because the devil is always in the details.
When you're used to write code, writing code is relatively fast.
You need this knowledge to debug issues with generated code.
However you're now telling AI to fix the bugs in the generated code.
I see it kind of like machine code becomes overlaid with asm which becomes overlaid with C or whatever higher level language, which then uses dogma/methodology like MVC and such and on top of that there's now the AI input and generation layer.
But it's not widely available.
Affording more than 1 computer is a luxury. Many households are even struggling to get by.
When you see those what 5 7 Mac Minis, which normal average Joe can afford that or does even have to knowledge to construct an LLM at home?
I don't.
This is a toy for rich people.
Just like with public clouds like AWS, GCP I left out, because the cost is too high and running my own is also too expensive and there are cheaper alternatives that not only cost less but also have way less overhead.
What would be interesting to see is what those kids produced with their vibe coding.
No one, including Karpathy in this video, is advocating for "vibe coding". If nothing more, LLMs paired with configurable tool-usage, is basically a highly advanced and contextual search engine you can ask questions. Are you not using a search engine today?
Even without LLMs being able to produce code or act as agents they'd be useful, because of that.
But it sucks we cannot run competitive models locally, I agree, it is somewhat of a "rich people" tool today. Going by the talk and theme, I'd agree it's a phase, like computing itself had phases. But you're gonna have to actually watch and listen to the talk itself, right now you're basically agreeing with the video yet wrote your comment like you disagree.
This is most definitely not toys for rich people. Now perhaps depending on your country it may be considered rich but I would comfortably say that for most of the developed world, the costs for these tools are absolutely attainable, there is a reason ChatGPT has such a large subscriber base.
Also the disconnect for me here is I think back on the cost of electronics, prices for the level of compute have generally gone down significantly over time. The c64 launched around the $5-600 price level, not adjusted for inflation. You can go and buy a Mac mini for that price today.
>You should consider how much it actually costs, not how much they charge.
How do people fail to consider this?
Sure, nobody can predict the long-term economics with certainty but companies like OpenAI already have compelling business fundamentals today. This isn’t some scooter startup praying for margins to appear; it’s a platform with real, scaled revenue and enterprise traction.
But yeah, tell me more about how my $200/mo plan is bankrupting them.
It's cheap now. But if you take into account all the training costs, then at such prices they cannot make a profit in any way. This is called dumping to capture the market.
No doubt the complete cost of training and to getting where we are today has been significant and I don’t know how the accounting will look years from now but you are just making up the rest based on feelings. We know operationally OpenAI is profitable on purely the runtime side, nobody knows how that will look when accounting for R&D but you have no qualification to say they cannot make a profit in any way.
Yes, if you do not take into account the cost of training, I think it is very likely profitable. The cost of working models is not so high. This is just my opinion based on open models and I admit that I have not carried out accurate calculations.
Software 3.0 is the code generated by the machine, not the prompts that generated it. The prompts don't even yield the same output; there is randomness.
The new software world is the massive amount of code that will be burped out by these agents, and it should quickly dwarf the human output.
How I understood it is that natural language will form relatively large portions of stacks (endpoint descriptions, instructions, prompts, documentations, etc…). In addition to code generated by agents (which would fall under 1.0)
I think that if you give the same task to three different developers you'll get three different implementations. It's not a random result if you do get the functionality that was expected, and at that, I do think the prompt plays an important role in offering a view of how the result was achieved.
> I think that if you give the same task to three different developers you'll get three different implementations.
Yes, but if you want them to be compatible you need to define a protocol and conformance test suite. This is way more work than writing a single implementation.
The code is the real spec. Every piece of unintentional non-determinism can be a hazard. That’s why you want the code to be the unit of maintenance, not a prompt.
The comparison of our current methods of interacting with LLMs (back and forth text) to old-school terminals is pretty interesting.
I think there's still a lot work to be done to optimize how we interact with these models, especially for non-dev consumers.
On one hand, I'm incredibly impressed by the technology behind that demo. On the other hand, I can't think of many things that would piss me off more than a non-deterministic operating system.
I like my tools to be predictable. Google search trying to predict that I want the image or shopping tag based on my query already drives me crazy. If my entire operating system did that, I'm pretty sure I'd throw my computer out a window.
Ah yes, my operating system, most definitely a place I want to stick the Hallucinotron-3000 so that every click I make yields a completely different UI that has absolutely 0 bearing to reality. We're truly entering the "Software 3.0" days (can't wait for the imbeciles shoving AI everywhere to start overusing that dogshit, made-up marketing term incessantly)
We'll need to boil a few more lakes before we get to that stage I'm afraid, who needs water when you can have your AI hallucinate some for you after all?
Is me not wanting the UI of my OS to shift with every mouse click a hot take? If me wanting to have the consistent "When I click here, X happens" behavior instead of the "I click here and I'm Feeling Lucky happens" behavior is equal to me being dense, so be it I guess.
It's a brand of terribleness I've somewhat gotten used to, opening Google Drive every time, when it takes me to the "Suggested" tab. I can't recall a single time when it had the document I care about anywhere close to the top.
There's still nothing that beats the UX of Norton Commander.
A mixed ever-shifting UI can be excellent though. So you've got some tools which consistently interact with UI components, but the UI itself is altered frequently.
Take for example world-building video games like Cities Skylines / Sim City or procedural sandboxes like Minecraft. There are 20-30 consistent buttons (tools) in the game's UX, while the rest of the game is an unbounded ever-shifting UI.
The rest of the game is very deterministic where its state is controlled by the buttons. The slight variation is caused by the simulation engine and follows consistent patterns (you can’t have building on fire if there’s no building yet).
This talk https://www.youtube.com/watch?v=MbWgRuM-7X8 explores the idea of generative / malleable personal user interfaces where LLMs can serve as the gateway to program how we want our UI to be rendered.
Border-line off-topic, but since you're flagrantly self-promoting, might as well add some more rule breakage to it.
You know websites/apps who let you enter text/details and then not displaying sign in/up screen until you submit it, so you feel like "Oh but I already filled it out, might as well sign up"?
They really suck, big time! It's disingenuous, misleading and wastes people's time. I had no interest in using your thing for real, but thought I'd try it out, potentially leave some feedback, but this bait-and-switch just made the whole thing feel sour and I'll probably try to actively avoid this and anything else I feel is related to it.
Humans are shit at interacting with systems in a non-linear way. Just look at Jupyter notebooks and the absolute mess that arises when you execute code blocks in arbitrary order.
loved the analogies! Karpathy is consistently one of the clearest thinkers out there.
interesting that Waymo could do uninterrupted trips back in 2013, wonder what took them so long to expand? regulation? tailend of driving optimization issues?
noticed one of the slides had a cross over 'AGI 2027'... ai-2027.com :)
Very advanced machine learning models are used in current self driving cars. It all depends what the model is trying to accomplish. I have a hard time seeing a generalist prompt-based generative model ever beating a model specifically designed to drive cars. The models are just designed for different, specific purposes
I could see it being the case that driving is a fairly general problem, and this models intentionally designed to be general end up doing better than models designed with the misconception that you need a very particular set of driving-specific capabilities.
Speed and Moore's law. You don't need to just make a decision without hallucinations, you need to do it fast enough for it to propagate to the power electronics and hit the gas/brake/turn the wheel/whatever. Over and over and over again on thousands of different tests.
A big problem I am noticing is that the IT culture over the last 70 years has existed in a state of "hardware gun get faster soon". And over the last ten years we had a "hardware cant get faster bc physics sorry" problem.
The way we've been making software in the 90s and 00s just isn't gonna be happening anymore. We are used to throwing more abstraction layers (C->C++->Java->vibe coding etc) at the problem and waiting for the guys in the fab to hurry up and get their hardware faster so our new abstraction layers can work.
Well, you can fire the guys in the fab all you want but no matter how much they try to yell at the nature it doesn't seem to care. They told us the embedded c++-monkeys to spread the message. Sorry, the moore's law is over, boys and girls. I think we all need to take a second to take that in and realize the significance of that.
[1] The "guys in the fab" are a fictional character and any similarity to the real world is a coincidence.
[2] No c++-monkeys were harmed in the process of making this comment.
Driving is not a general problem, though. Its a contextual landscape of fast-based reactions and predictions. Both are required, and done regularly by the human element. The exact nature of every reaction, and every prediction, change vastly within the context window.
You need image processing just as much as you need scenario management, and they're orthoganol to each other, as one example.
If you want a general transport system... We do have that. It's called rail. (And can and has been automated.)
It partially is. You have the specialized part of maneuvering a fast moving vehicle in physical world, trying to keep it under control at all times and never colliding with anything. Then you have the general part, which is navigating the human environment. That's lanes and traffic signs and road works and schoolbuses, that's kids on the road and badly parked trailers.
Current breed of autonomous driving systems have problems with exceptional situations - but based on all I've read about so far, those are exactly of the kind that would benefit from a general system able to understand the situation it's in.
We have multiple parts of the brain that interact in vastly different ways! Your cerebellum won't be running the role of the pons.
Most parts of the brain cannot take over for others. Self-healing is the exception, not the rule. Yes, we have a degree of neuroplasticity, but there are many limits.
Only if that was a singular system, however, it is not. [0]
For example... The nerve cells in your gut may speak to the brain, and interact with it in complex ways we are only just beginning to understand, but they are separate systems that both have control over the nervous system, and other systems. [1]
General Intelligence, the psychological theory, and General Modelling, whilst sharing words, share little else.
exactly! I think that was tesla's vision with self-driving to begin with... so they tried to frame it as problem general enough, that trying to solve it would also solve questions of more general intelligence ('agi') i.e. cars should use vision just like humans would
but in hindsight looks like this slowed them down quite a bit despite being early to the space...
This is (in part) what "world models" are about. While some companies like Tesla bring together a fleet of small specialised models, others like CommaAI and Wayve train generalist models.
One of the issues with deploying models like that is the lack of clear, widely accepted ways to validate comprehensive safety and absence of unreasonable risk. If that can be solved, or regulators start accepting answers like "our software doesn't speed in over 95% of situations", then they'll become more common.
> Karpathy is consistently one of the clearest thinkers out there.
Eh, he ran Teslas self driving division and put them into a direction that is never going to fully work.
What they should have done is a) trained a neural net to represent sequence of frames into a physical environment, and b)leveraged Mu Zero, so that self driving system basically builds out parallel simulations into the future, and does a search on the best course of action to take.
Because thats pretty much what makes humans great drivers. We don't need to know what a cone is - we internally compute that something that is an object on the road that we are driving towards is going to result in a negative outcome when we collide with it.
Aren't continuous, stochastic, partial knowledge environments where you need long horizon planning with strict deadlines and limited compute exactly the sort of environments muzero variants struggle with? Because that's driving.
It's also worth mentioning that humans intentionally (and safely) drive into "solid" objects all the time. Bags, steam, shadows, small animals, etc. We also break rules (e.g. drive on the wrong side of the road), and anticipate things we can't even see based on a theory of mind of other agents. Human driving is extremely sophisticated, not reducible to rules that are easily expressed in "simple" language.
The counter argument is that you can't zoom in and fix a specific bug in this mode of operation. Everything is mashed together in the same neural net process. They needed to ensure safety, so testing was crucial. It is harder to test an end-to-end system than its individual parts.
But if they'd gone for radars and lidars and a bunch of sensors and then enough processing hardware to actually fuse that, then I think they could have built something that had a chance of working.
The programming analogy is convenient but off. The joke has always been “the computer only does exactly what you tell it to do!” regarding logic bugs. Prompts and LLMs most certainly do not work like that.
I loved the parallels with modern LLMs and time sharing he presented though.
> Prompts and LLMs most certainly do not work like that.
It quite literally works like that. The computer is now OS + user-land + LLM runner + ML architecture + weights + system prompt + user prompt.
Taken together, and since you're adding in probabilities (by using ML/LLMs), you're quite literally getting "the computer only does exactly what you tell it to do!", it's just that we have added "but make slight variations to what tokens you select next" (temperature>0.0) sometimes, but it's still the same thing.
Just like when you tell the computer to create encrypted content by using some seed. You're getting exactly what you asked for.
For what it's worth, I've been using it to help me learn math, and I added to my rules an instruction that it should always give me an example in Python (preferably sympy) whenever possible.
Brutal counter take: If AI tooling makes you so much better, then you started very low. In contrast, if you are already insanely productive in creative ways others can hardly achieve then chances are, AI tools don't make much of a difference.
As someone who is starting very low — I very much agree. I'm basically a hobbyist who can navigate around Python code, and LLMs have been a godsend to me, they increased my hobby output tenfold. But as soon as I get into coding something I'm more familiar with, the LLMs usefulness plummets, because it's easier and faster to directly write code than to "translate" from English to code using an LLM (maybe only apart from using basically a smarter one-line tab completion)
Meanwhile, I asked this morning Claude 4 to write a simple EXIF normalizer. After two rounds of prompting it to double-check its code, I still had to point out that it makes no sense to load the entire image for re-orientating if the EXIF orientation is fine in the first place.
Vibe vs reality, and anyone actually working in the space daily can attest how brittle these systems are.
Maybe this changes in SWE with more automated tests in verifiable simulators, but the real world is far to complex to simulate in its vastness.
What do you mean "meanwhile", that's exactly (among other things) the kind of stuff he's talking about? The various frictions and how you need to approach it
> anyone actually working in the space
Is this trying to say that Karpathy doesn't "actually work" with LLMs or in the ML space?
I feel like your whole comment is just reacting to the title of the YouTube video, rather than actually thinking and reflecting on the content itself.
We had exactly the opposite experience. CoPilot was able to answer questions accurately and reformatted the existing documentation to fit the context of users' questions, which made the information much easier to understand.
Code examples, which we offer as sort of reference implementations, were also adopted to fit the specific questions without much issues. Granted these aren't whole applications, but 10 - 25 line examples of doing API setup / calls.
We didn't, of course, just send users' questions directly to CoPilot. Instead there's a bit of prompt magic behind the scenes that tweaks the context so that CoPilot can produce better quality results.
They're reliable already if you change the way you approach them. These probabilistic token generators probably never will be "reliable" if you expect them to 100% always output exactly what you had in mind, without iterating in user-space (the prompts).
Which model? Just tried a bunch of ChatGPT, OpenAI's API, Claude, Anthropic's API and DeepSeek's API with both chat and reasonee, every single one replied with a single "hi".
That's literally the wrong way to use LLMs though.
LLMs think in tokens, the less they emit the dumber they are, so asking them to be concise, or to give the answer before explanation, is extremely counterproductive.
This is relevant. Your example may be simple enough, but for anything more complex, letting the model have its space to think/compute is critical to reliability - if you starve it for compute, you'll get more errors/hallucinations.
Yeah I mean I agree with you, but I'm still not sure how it's relevant. I'd also urge people to have unit tests they treat as production code, and proper system prompts, and X and Y, but it's really beyond the original point of "LLMs aren't reliable" which is the context in this sub-tree.
I remember when people were saying here on HN that AIs will never be able to generate picture of hands with just 5 fingers because they just "don't have common sense"
print("This model that just came out changes everything. It's flawless. It doesn't have any of the issues the model from 6 months ago had. We are 1 year away from AGI and becoming jobless")
sleep(timedelta(days=180).total_seconds)
On the other hand, posts like this are like watching someone writing ask jeeves search queries into google 20 years ago and then gesturing how google sucks while everyone else in the room has figured out how to be productive with it and cringes at his "boomer" queries.
If you're still struggling to make LLMs useful for you by now, you should probably ask someone. Don't let other noobs on HN +1'ing you hold you back.
There's also those instances where Microsoft unleashed Copilot on the .NET repo, and it resulted in the most hilariously terrible PRs that required the maintainers to basically tell Copilot every single step it should take to fix the issue. They were basically writing the PRs themselves at that point, except doing it through an intermediary that was much dumber, slower and less practical than them.
And don't get me started on my own experiences with these things, and no, I'm not a luddite, I've tried my damndest and have followed all the cutting-edge advice you see posted on HN and elsewhere.
Time and time again, the reality of these tools falls flat on their face while people like Andrej hype things up as if we're 5 minutes away from having Claude become Skynet or whatever, or as he puts it, before we enter the world of "Software 3.0" (coincidentally totally unrelated to Web 3.0 and the grift we had to endure there, I'm sure).
To intercept the common arguments,
- no I'm not saying LLMs are useless or have no usecases
- yes there's a possibility if you extrapolate by current trends (https://xkcd.com/605/) that they indeed will be Skynet
- yes I've tried the latest and greatest model released 7 minutes ago to the best of my ability
- yes I've tried giving it prompts so detailed a literal infant could follow along and accomplish the task
- yes I've fiddled with providing it more/less context
- yes I've tried keeping it to a single chat rather than multiple chats, as well as vice versa
- yes I've tried Claude Code, Gemini Pro 2.5 With Deep Research, Roocode, Cursor, Junie, etc.
- yes I've tried having 50 different "agents" running and only choosing the best output form the lot.
I'm sure there's a new gotcha being written up as we speak, probably something along the lines of "Well for me it doubled my productivity!" and that's great, I'm genuinely happy for you if that's the case, but for me and my team who have been trying diligently to use these tools for anything that wasn't a microscopic toy project, it has fallen apart time and time again.
The idea of an application UI or god forbid an entire fucking Operating System being run via these bullshit generators is just laughable to me, it's like I'm living on a different planet.
To add to this, I ran into a lot of issues too.
And similar when using cursor... Until I started creating a mega list of rules for it to follow that attaches to the prompts. Then outputs improved (but fell off after the context window got too large). At that stage I then used a prompt to summarize, to continue with a new context.
You're not the first, nor the last person, to have a seemingly vastly different experience than me and others.
So I'm curious, what am I doing differently from what you did/do when you try them out?
This is maybe a bit out there, but would you be up for sending me like a screen recording of exactly what you're doing? Or maybe even a video call sharing your screen? I'm not working in the space, have no products or services to sell, only curious is why this gap seemingly exists between you and me, and my only motive would be to understand if I'm the one who is missing something, or there are more effective ways to help people understand how they can use LLMs and what they can use them for.
My email is on my profile if you're up for it. Invitation open for others in the same boat as parent too.
I'm a greybeard, 45+ years coding, including active in AI during the mid 80's and used it when it applied throughout my entire career. That career being media and animation production backends, where the work is both at the technical and creative edge.
I currently have an AI integrated office suite, which has attorneys, professional writers, and political activists using the system. It is office software, word processing, spreadsheets, project management and about two dozen types of AI agents that act as virtual co-workers.
No, my users are not programmers, but I do have interns; college students with anything from 3 to 10 years experience writing software.
I see the same AI use problem issues with my users, and my interns. My office system bends over backwards to address this, but people are people: they do not realize that AI does not know what they are talking about. They will frequently ask questions with no preamble, no introduction to the subject. They will change topics, not bothering to start a new session or tell the AI the topic is now different. There is a huge number of things they do, often with escalating frustration evident in their prompts, that all violate the same basic issue: the LLM was not given a context to understand the subject at hand, and the user is acting like many people and when explaining they go further, past the point of confusion, now adding new confusion.
I see this over and over. It frustrates the users to anger, yet at the same time if they acted, communicated to a human, in the same manner they'd have a verbal fight almost instantly.
The problem is one of communications. ...and for a huge number of you I just lost you. You've not been taught to understand the power of communications, so you do not respect the subject. How to communication is practically everything when it comes to human collaboration. It is how one orders their mind, how one collaborates with others, AND how one gets AI to respond in the manner they desire.
But our current software development industry, and by extension all of STEM has been short changed by never been taught how to effectively communicate, no not at all. Presentations and how to sell are not effective communications, that's persuasion, about 5% of what it takes to convey understanding in others which then unblocks resistance to changes.
So AI is simultaneously going to take over everyone's job and do literally everything, including being used as application UI somehow... But you have to talk to it like a moody teenager at their first job lest you get nothing but garbage? I have to put just as much (and usually, more) effort talking to this non-deterministic black box as I would to an intern who joined a week ago to get anything usable out of it?
Yeah, I'd rather just type things out myself, and continue communicating with my fellow humans rather than expending my limited time on this earth appeasing a bullshit generator that's apparently going to make us all jobless Soon™
I'd like to see the prompt. I suspect that "literal infant" is expected to be a software developer without preamble. The initial sentence to an LLM carries far more relevance, it sets the context stage to understand what follows. If there is no introduction to the subject at hand, the response will be just like anyone fed a wall of words: confusion as to what all this is about.
I've got a working theory that models perform differently when used in different timezones... As in during US working hours they dont work as well due to high load.
When used at 'offpeak' hours not only are they (obviously) snappier but the outputs appear to be a higher standard. Thought this for a while but now noticing with Claude4 [thinking] recently. Textbook case of anecdata of course though.
Interesting thought, if nothing less. Unless I misunderstand, it would be easy to run a study to see if this is true; use the API to send the same but slightly different prompt (as to avoid the caches) which has a definite answer, then run that once per hour for a week and see if the accuracy oscillates or not.
Yes good idea - although it appears we would also have to account for the possibility of providers nerfing their models.
I've read others also think models are being quantized after a while to cut costs.
Unironically, your comment mirrors my opinion as of last month.
Since then I've given it another try last week and was quite literally mind blown how much it improved in the context of Vibe coding (Claude code). It actually improved so much that I thought "I would like to try that on my production codebase", (mostly because I want if to fail, because that's my job ffs) but alas - that's not allowed at my dayjob.
From the limited experience I could gather over the last week as a software dev with over 10 yrs of experience (along with another 5-10 doing it as a hobby before employment) I can say that I expect our industry to get absolutely destroyed within the next 5 yrs.
The skill ceiling for devs is going to get mostly squashed for 90% of devs, this will inevitably destroy our collective bargaining positions. Including for the last 10%, because the competition around these positions will be even more fierce.
It's already starting, even if it's currently very misguided and mostly down to short-sightedness.
But considering the trajectory and looking at how naive current llms coding tools are... Once the industry adjusts and better tooling is pioneered... it's gonna get brutal.
And most certainly not limited to software engineering. Pretty much all desk jobs will get hemorrhaged as soon as a llm-player basically replaces SAP with entirely new tooling.
Frankly, I expect this to go bad, very very quickly. But I'm still hoping for a good ending.
I think part of the problem is that code quality is somewhat subjective and developers are of different skill levels.
If you're fine with things that kinda working okay and you're not the best developer yourself then you probably think coding agents work really really well because the slop they produce isn't that much worse than yourself. In fact I know a mid-level dev who believes agent AIs write better code than himself.
If you're very critical of code quality then it's much tougher... This is even more true in complex codebases where simply following some existing pattern to add a new feature isn't going to cut it.
The degree to which it helps any individual developer will vary, and perhaps it's not that useful for yourself. For me over the last few months the tech has got to the point where I use it and trust it to write a fair percentage of my code. Unit tests are an example where I find it does a really good job.
> If you're very critical of code quality then it's much tougher
I'm not sure, I'm hearing developers I know are sloppy and produce shit code both having no luck with LLMs, and some of them having lots of luck with them.
On the other side, those who really think about the design/architecture and are very strict (which is the group I'd probably put myself into, but who wouldn't?) are split in a similar way.
I don't have any concrete proof, but I'm guessing "expectations + workflow" differences would explain the vast difference in perception of usefulness.
A few days ago, I was introduced to the idea that when you're vibe coding, you're consulting a "genie", much like in the fables, you almost never get what you asked for, but if your wishes are small, you might just get what you want.
The primagen reviewed this article[1] a few days ago, and (I think) that's where I heard about it. (Can't re-watch it now, it's members only) 8(
This was my favorite talk at AISUS because it was so full of concrete insights I hadn't heard before and (even better) practical points about what to build now, in the immediate future. (To mention just one example: the "autonomy slider".)
If it were up to me, which it very much is not, I would try to optimize the next AISUS for more of this. I felt like I was getting smarter as the talk went on.
> suggests a lack of understanding of these smaller models capabilities
If anything, you're showing a lack of understanding of what he was talking about. The context is this specific time, where we're early in a ecosystem and things are expensive and likely centralized (ala mainframes) but if his analogy/prediction is correct, we'll have a "Linux" moment in the future where that equation changes (again) and local models are competitive.
And while I'm a huge fan of local models run them for maybe 60-70% of what I do with LLMs, they're nowhere near proprietary ones today, sadly. I want them to, really badly, but it's important to be realistic here and realize the differences of what a normal consumer can run, and what the current mainframes can run.
He understands the technical part, of course, I was referring to his prediction that large models will be always be necessary.
There is a point where an LLM is good enough for most tasks, I don’t need a megamind AI in order to greet clients, and both large and small/medium model size are getting there, with the large models hitting a computing/energy demand barrier. The small models won’t hit that barrier anytime soon.
Did he predict they'd always be necessary? He mostly seemed to predict the opposite, that we're at the early stage of a trajectory that has yet to have it's Linux moment
You can disagree with his conclusions but I don't think his understanding of small models is up for debate. This is the person who created micrograd/makemore/nanoGPT and who has produced a ton of educational materials showing how to build small and local models.
As far as I understood the talk and the analogies, he's saying that local models will eventually replace the current popular "mainframe" architecture. How is that underestimating them?
Sure, but maybe suggesting that the person who literally spent countless hours educating others on how to build small models locally from scratch, is lacking knowledge about local small models is going a bit beyond "people have blind spots".
I was running Qwen3-32B locally even faster, 70T/s, still way too slow for me. I'm generating thousands of tokens of output per request (not coding), running locally I could get 6 mil tokens per day and pay electricity, or I can get more tokens per day from Google Gemini 2.5 Flash for free.
Running models locally is a privilege for the rich and those with too much disposable time.
There were some cool ideas- I particularly liked "psychology of AI"
Overall though I really feel like he is selling the idea that we are going to have to pay large corporations to be able to write code. Which is... terrifying.
Also, as a lazy developer who is always trying to make AI do my job for me, it still kind of sucks, and its not clear that it will make my life easier any time soon.
He says that now we are in the mainframe phase. We will hit the personal computing phase hopefully soon. He says llama (and DeepSeek?) are like Linux in a way, OpenAI and Claude are like Windows and MacOS.
So, No, he’s actually saying it may be everywhere for cheap soon.
I find the talk to be refreshingly intellectually honest and unbiased. Like the opposite of a cringey LinkedIn post on AI.
Being Linux is not a good thing imo, it took decades for tech like proton to run Windows games reliably, if not better as now, than Windows does. Software is still mostly develop for Windows and macOS. Not to mention the Linux Desktop that never took off, I mean one could mention Android but there is a large corporation behind it. Sure Linux is successfull in many ways, it's embedded everywhere but nowhere near being the OS of the everyday people, "traditional linux desktop" never took off
I think it used to be like that before the GNU people made gcc, completely destroying the market of compilers.
> Also, as a lazy developer who is always trying to make AI do my job for me, it still kind of sucks, and its not clear that it will make my life easier any time soon.
Every time I have to write a simple self contained couple of functions I try… and it gets it completely wrong.
It's easier to just write it myself rather than to iterate 50 times and hope it will work, considering iterations are also very slow.
At least proprietary compilers were software you owned and could be airgapped from any network. You didn't create software by tediously negotiating with compilers running on remote machines controlled by a tech corp that can undercut you on whatever you are trying to build (but of course they will not, it says so in the Agreement, and other tales of the fantastic).
On a tangent, I find the analogies interesting as well. However, while Karpathy is an expert in Computer Science, NLP and machine vision, his understanding of how human psychology and brain work is as good as you an I (non-experts). So I take some of those comparisons as a lay person’s feelings about the subject. Still, they are fun to listen to.
Painful to watch. The new tech generation deserves better than hyped presentations from tech evangelists.
This reminds me of the Three Amigos and Grady Booch evangelizing the future of software while ignoring the terrible output from Rational Software and the Unified Process.
And Waymo still requires extensive human intervention. Given Tesla's robotaxi timeline, this should crash their stock valuation...but likely won't.
You can't discuss "vibe coding" without addressing security implications of the produced artifacts, or the fact that you're building on potentially stolen code, books, and copyrighted training data.
And what exactly is Software 3.0? It was mentioned early then lost in discussions about making content "easier for agents."
I find Karpathy's focus on tightening the feedback loop between LLMs and humans interesting, because I've found I am the happiest when I extend the loop instead.
When I have tried to "pair program" with an LLM, I have found it incredibly tedious, and not that useful. The insights it gives me are not that great if I'm optimising for response speed, and it just frustrates me rather than letting me go faster. Worse, often my brain just turns off while waiting for the LLM to respond.
OTOH, when I work in a more async fashion, it feels freeing to just pass a problem to the AI. Then, I can stop thinking about it and work on something else. Later, I can come back to find the AI results, and I can proceed to adjust the prompt and re-generate, to slightly modify what the LLM produced, or sometimes to just accept its changes verbatim. I really like this process.
I would venture that 'tightening the feedback loop' isn't necessarily 'increasing the number of back and forth prompts'- and what you're saying you want is ultimately his argument. i.e. if integral enough it can almost guess what you're going to say next...
I specifically do not want AI as an auto-correct, doing auto-predictions while I am typing. I find this interrupts my thinking process, and I've never been bottlenecked by typing speed anyway.
I want AI as a "co-worker" providing an alternative perspective or implementing my specific instructions, and potentially filling in gaps I didn't think about in my prompt.
Yeah I am currently enjoying giving the LLM relatively small chunks of code to write and then asking it to write accompanying tests. While I focus on testing the product myself. I then don't even bother to read the code it's written most of the time
It's fascinating to see his gears grinding at 22:55 when acknowledging that a human still has to review the thousand lines of LLM-generated code for bugs and security issues if they're "actually trying to get work done". Yet these are the tools that are supposed to make us hyperproductive? This is "Software 3.0"? Give me a break.
Plus coding is the fun bit, reviewing code is the hard and not fun bit, arguing with an overconfident machine sound like it'll be worse even than that. Thankfully I'm going to retire soon.
The slide at 13m claims that LLMs flip the script on technology diffusion and give power to the people. Nothing could be further from the truth.
Large corporations, which have become governments in all but name, are the only ones with the capability to create ML models of any real value. They're the only ones with access to vast amounts of information and resources to train the models. They introduce biases into the models, whether deliberately or not, that reinforces their own agenda. This means that the models will either avoid or promote certain topics. It doesn't take a genius to imagine what will happen when the advertising industry inevitably extends its reach into AI companies, if it hasn't already.
Even open weights models which technically users can self-host are opaque blobs of data that only large companies can create, and have the same biases. Even most truly open source models are useless since no individual has access to the same large datasets that corporations use for training.
So, no, LLMs are the same as any other technology, and actually make governments and corporations even more powerful than anything that came before. The users benefit tangentially, if at all, but will mostly be exploited as usual. Though it's unsurprising that someone deeply embedded in the AI industry would claim otherwise.
Well there are cases like OLMo where the process, dataset, and model are all open source. As expected though, it doesn't really compare well to the worst closed model since the dataset can't contain vast amounts of stolen copyrighted data that noticeably improves the model. Llama is not good because Meta knows what they're doing, it's good because it was pretrained on the entirety of Anna's Archive and every pirated ebook they could get their hands on. Same goes for Elevenlabs and pirated audiobooks.
Lack of compute on the Ai2's side also means the context OLMo is trained for is miniscule, the other thing that you need to throw brazillions of dollars at to make model that's maybe useful in the end if you're very lucky. Training needs high GPU interconnect bandwidth, it can't be done in distributed horde in any meaningful way even if people wanted to.
The only ones who have the power now are the Chinese, since they can easily ignore copyright for datasets, patents for compute, and have infinite state funding.
If anything I wished the conversation turned away from "vibe-coding" which was essentially coined as a "lol look at this go" thing, but media and corporations somehow picked up as "This is the new workflow all developers are adopting".
LLMs as another tool in your toolbox? Sure, use it where it makes sense, don't try to make them do 100% of everything.
LLMs as a "English to E2E product I'm charging for"? Lets maybe make sure the thing works well as a tool before letting it be responsible for stuff.
I'd like to hear from Linux kernel developers. There is no significant software that has been written (plagiarized) by "AI". Why not ask the actual experts who deliver instead of talk?
I think you should not turn things around here. Up to 2021 we had a vibrant software environment that obviously had zero "AI" input. It has made companies and some developers filthy rich.
Since "AI" became a religion, it is used as an excuse for layoffs while no serious software is written by "AI". The "AI" people are making the claims. Since they invading a functioning software environment, it is their responsibility to back up their claims.
Still wonder what your definition of "serious software" is. I kinda concur - I consider most of the webshit to be not serious, but then, this is where software industry makes bulk of its profits, and that space is absolutely being eaten by agentic coding, right now, today.
So if we s/serious/money-making/, you are wrong - or at least about to be proven, as these things enter prod and are talked about.
Microsoft (famously developing somewhat popular office-like software) seems to be going in the direction of almost forcing developers to use LLMs to assist with coding, at least going by what people are willing to admit publicly and seeing some GitHub activity.
Google (made a small browser or something) also develops their own models, I don't think it's far fetched to imagine there is at least one developer on the Chrome/Chromium team that is trying to dogfood that stuff.
As for Autodesk, I have no idea what they're up to, but corporate IT seems hellbent on killing themselves, not sure Autodesk would do anything differently so they're probably also trying to jam LLMs down their employees throats.
Microsoft is also selling "AI", so they want headlines like "30% of our code is written by AI". So they force open source developers to babysit the tools and suffer.
It's also an advertisement for potential "AI" military applications that they undoubtedly propose after the HoloLens failure:
> Can you point to any significant open source software that has any kind of significant AI contributions?
No, but I haven't looked. Can you?
As an actual open source developer too, I do get some value from replacing search engine usage with LLMs that can do the searching and collation for me, as long as they have references I can use for diving deeper, they certainly accelerate my own workflow. But I don't do "vibe-coding" or use any LLM-connected editors, just my own written software that is mostly various CLIs and chat-like UIs.
I was trying to do some reverse engineering with Claude using an MCP server I wrote for a game trainer program that supports Python scripts. The context window gets filled up _so_ fast. I think my server is returning too many addresses (hex) when Claude searches for values in memory, but it’s annoying. These things are so flaky.
The beginning was painful to watch as is the cheering in this comment section.
The 1.0, 2.0, and 3.0 simply aren't making sense. They imply a kind of a succession and replacement and demonstrate a lack of how programming works. It sounds as marketing oriented as "Web 3.0" that has been born inside an echo chamber. And yet halfway through, the need for determinism/validation is now being reinvented.
The analogies make use of cherry picked properties, which could apply to anything.
The whole AI scene is starting to feel a lot like the cryptocurrency bubble before it burst. Don’t get me wrong, there’s real value in the field, but the hype, the influencers, and the flashy “salon tricks” are starting to drown out meaningful ML research (like Apple's critical research that actually improves AI robustness). It’s frustrating to see solid work being sidelined or even mocked in favor of vibe-coding.
Meanwhile, I asked this morning Claude 4 to write a simple EXIF normalizer. After two rounds of prompting it to double-check its code, I still had to point out that it makes no sense to load the entire image for re-orientating if the EXIF orientation is fine in the first place.
Vibe vs reality, and anyone actually working in the space daily can attest how brittle these systems are.
> "Because they all have slight pros and cons, and you may want to program some functionality in 1.0 or 2.0, or 3.0, or you're going to train in LLM, or you're going to just run from LLM"
He doesn't say they will fully replace each other (or had fully replaced each other, since his definition of 2.0 is quite old by now)
Couldn’t believe my eyes. The www is truly bankrupt. If anyone has a browser plugin which automatically redirects to llms.txt sign me up.
Website too confusing for humans? Add more design, modals, newsletter pop ups, cookie banners, ads, …
Website too confusing for LLMs? Add an accessible, clean, ad-free, concise, high entropy, plain text summary of your website. Make sure to hide it from the humans!
PS: it should be /.well-known/llms.txt but that feels futile at this point..
> If anyone has a browser plugin which automatically redirects to llms.txt sign me up.
Not a browser plugin, but you can prefix URLs with `pure.md/` to get the pure markdown of that page. It's not quite a 1:1 to llms.txt as it doesn't explain the entire domain, but works well for one-off pages. [disclaimer: I'm the maintainer]
Even with this future approach, it still can live under the `/.well-known`, think of `/.well-known/llm/<mirrored path>` or `/.well-known/llm.json` with key/value mappings.
The web started dying with mobile social media apps, in which hyperlinks are a poor UX choice. Then again with SEO banning outlinks. Now this. The web of interconnected pages that was the World Wide Web is dead. Not on social media? No one sees you. Run a website? more bots than humans. Unless you sell something on the side with the website it's not profitable. Hyperlinking to other websites is dead.
Gen Alpha doesn't know what a web page is and if they do, it's for stuff like neocities aka as a curiosity or art form only. Not as a source of information anymore. I don't blame them. Apps (social media apps) have less friction than web sites but have a higher barrier for people to create. We are going back to pre World Wide Web days in a way, kind of like Bulletin Board Systems on dial up without hyperlinking, and centralized (social media) Some countries mostly ones with few technical people llike the ones in Central America have moved away from the web almost entirely and into social media like Instagram.
Due to the death of the web, google search and friends now rely mostly on matching queries with titles now so just like before the internet you have to know people to learn new stuff or wait for an algorithm to show it to you or someone to comment it online or forcefully enroll in a university. Maybe that's why search results have declined and poeple search using ChatGPT or maybe perplexity. Scholarly search engines are a bit better but frankly irrelevant for most poeple.
Now I understand why Google established their own DNS server at 8.8.8.8. If you have a directory of all domains on DNS, you can still index sites without hyperlinks between them, even if the web dies. They saw it coming.
If you have different representations of the same thing (llms.txt / HTML), how do you know it is actually equivalent to each other? I am wondering if there are scenarios where webpage publishers would be interested in gaming this.
I love the "people spirits" analogy. For casual tasks like vibecoding or boiling an egg, LLM errors aren't a big deal. But for critical work, we need rigorous checks—just like we do with human reasoning. That's the core of empirical science: we expect fallibility, so we verify. A great example is how early migration theories based on pottery were revised with better data like ancient DNA (see David Reich). Letting LLMs judge each other without solid external checks misses the point—leaderboard-style human rankings are often just as flawed.
I think that Andrej presents “Software 3.0” as a revolution, but in essence it is a natural evolution of abstractions.
Abstractions don't eliminate the need to understand the underlying layers - they just hide them until something goes wrong.
Software 3.0 is a step forward in convenience. But it is not a replacement for developers with a foundation, but a tool for acceleration, amplification and scaling.
If you know what is under the hood — you are irreplaceable.
If you do not know — you become dependent on a tool that you do not always understand.
Funny thing is that in more than one of the iron man movies the suits end up being bad robots. Even the ai iron man made shows up to ruin the day in the avengers movie. So it’s a little in the nose that they’d try to pitch it this way.
I'm old enough to remember when Twitter was new, and for a moment it felt like the old utopian promise of the Internet finally fulfilled: ordinary people would be able to talk, one-on-one and unmediated, with other ordinary people across the world, and in the process we'd find out that we're all more similar than different and mainly want the same things out of life, leading to a new era of peace and empathy.
I believe the opposite happened. People found out that there are huge groups of people with wildly differing views on morality from them and that just encouraged more hate. I genuinely think old school facebook where people only interacted with their own private friend circles is better.
Broadcast networks like Twitter only make sense for influencers, celebrities and people building a brand. They're a net negative for literally anyone else.
| old school facebook where people only interacted with their own private friend circles is better.
100% agree but crazy that option doesn't exist anymore.
why does vibe coding still involve any code at all? why can't an AI directly control the registers of a computer processor and graphics card, controlling a computer directly? why can't it draw on the screen directly, connected directly to the rows and columns of an LCD screen? what if an AI agent was implemented in hardware, with a processor for AI, a normal computer processor for logic, and a processor that correlates UI elements to touches on the screen? and a network card, some RAM for temporary stuff like UI elements and some persistent storage for vectors that represent UI elements and past converstations
I'm not sure this makes sense as a question. Registers are 'controlled' by running code for a given state. An AI can write code that changes registers, as all code does in operation. An AI can't directly 'control registers' in any other way, just as you or I can't.
I would like to make an AI agent that directly interfaces with a processor by setting bits in a processor register, thus eliminating the need for even assembly code or any kind of code. The only software you would ever need would be the AI.
The safety part will probably be either solved or a non-issue or ignored. Similarly to how GPT3 was often seen as dangerous before ChatGPT was released. Some people who have only ever vibe coded are finding jobs today, ignoring safety entirely and lacking a notion of it or what it means. They just copy paste output from ChatGPT or an agentic IDE. To me it's JIT already with extra steps. Or they have pivoted their software engineers to vibe coding most of the time and don't even touch code anymore doing JIT with extra steps again.
In a way he's making sense. If the "code" is the prompt, the output of the llm is an intermediate artifact, like the intermediate steps of gcc.
So why should we still need gcc?
The answer is of course, that we need it because llm's output is shit 90% of the time and debugging assembly or binary directly is even harder, so putting asides the difficulties of training the model, the output would be unusable.
Probably too much snark from me. But the gulf between interpreter and compiler can be decades of work, often discovering new mathematical principles along the way.
The idea that you're fine to risk everything, in the way agentic things allow [0], and want that messing around with raw memory is... A return to DOS' crashes, but with HAL along for the ride.
Because any precise description of what the computer is supposed to do is already code as we know it. AI can fill in the gaps between natural language and programming by guessing and because you don't always care about the "how" only about the "what". The more you care about the "how" you have to become more precise in your language to reduce the guess work of the AI to the point that your input to the AI is already code.
The question is: how much do we really care about the "how", even when we think we care about it? Modern programming language don't do guessing work, but they already abstract away quite a lot of the "how".
I believe that's the original argument in favor of coding in assembler and that it will stay relevant.
Following this argument, what AI is really missing is determinism to a far extend. I can't just save my input I have given to an AI and can be sure that it will produce the exact same output in a year from now on.
With vibe coding, I am under the impression that the only thing that matters for vibe coders is whether the output is good enough in the moment to fullfill a desire. For companies going AI first that's how it seems to be done. I see people in other places and those people have lost interest in the "how"
Should we not treat LLMs more as a UX feature to interact with a domain specific model (highly contextual), rather than expecting LLMs to provide the intelligence needed for software to act as partner to Humans.
I guess Karpathy won't ever become a multi-millionare/billionaire, seeing as he's now at the stage of presenting TedX-like thingies.
That also probably shows that he's out of the loop when it comes to the present work now done in "AI", because had he been there he wouldn't have had time for this kind of fluffy presentations.
It's an interesting presentation, no doubt. The analogies eventually fail as analogies usually do.
A recurring theme presented, however, is that LLM's are somehow not controlled by the corporations which expose them as a service. The presenter made certain to identify three interested actors (governments, corporations, "regular people") and how LLM offerings are not controlled by governments. This is a bit disingenuous.
Also, the OS analogy doesn't make sense to me. Perhaps this is because I do not subscribe to LLM's having reasoning capabilities nor able to reliably provide services an OS-like system can be shown to provide.
A minor critique regarding the analogy equating LLM's to mainframes:
Mainframes in the 1960's never "ran in the cloud" as it did
not exist. They still do not "run in the cloud" unless one
includes simulators.
Terminals in the 1960's - 1980's did not use networks. They
used dedicated serial cables or dial-up modems to connect
either directly or through stat-mux concentrators.
"Compute" was not "batched over users." Mainframes either
had jobs submitted and ran via operators (indirect execution)
or supported multi-user time slicing (such as found in Unix).
> The presenter made certain to identify three interested actors (governments, corporations, "regular people") and how LLM offerings are not controlled by governments. This is a bit disingenuous.
I don't think that's what he said, he was identifying the first customers and uses.
>> A recurring theme presented, however, is that LLM's are somehow not controlled by the corporations which expose them as a service. The presenter made certain to identify three interested actors (governments, corporations, "regular people") and how LLM offerings are not controlled by governments. This is a bit disingenuous.
> I don't think that's what he said, he was identifying the first customers and uses.
The portion of the presentation I am referencing starts at or near 12:50[0]. Here is what was said:
I wrote about this one particular property that strikes me
as very different this time around. It's that LLM's like
flip they flip the direction of technology diffusion that
is usually present in technology.
So for example with electricity, cryptography, computing,
flight, internet, GPS, lots of new transformative that have
not been around.
Typically it is the government and corporations that are
the first users because it's new expensive etc. and it only
later diffuses to consumer. But I feel like LLM's are kind
of like flipped around.
So maybe with early computers it was all about ballistics
and military use, but with LLM's it's all about how do you
boil an egg or something like that. This is certainly like
a lot of my use. And so it's really fascinating to me that
we have a new magical computer it's like helping me boil an
egg.
It's not helping the government do something really crazy
like some military ballistics or some special technology.
Note the identification of historic government interest in computing along with a flippant "regular person" scenario in the context of "technology diffusion."
You are right in that the presenter identified "first customers", but this is mentioned in passing when viewed in context. Perhaps I should not have characterized this as "a recurring theme." Instead, a better categorization might be:
The presenter minimized the control corporations have by
keeping focus on governmental topics and trivial customer
use-cases.
Yeah that's explicitly about first customers and first uses, not about who controls it.
I don't see how it minimizes the control corporations have to note this. Especially since he's quite clear about how everything is currently centralized / time share model, and obviously hopeful we can enter an era that's more analogous to the PC era, even explicitly telling the audience maybe some of
them will work on making that happen.
The team adapted quickly, which is a good sign. I believe getting the videos out sooner (as in why-not-immediately) is going to be a priority in the future.
You can't just put things there any time you want - the RFC requires that they go through a registration process.
Having said that, this won't work for llms.txt, since in the next version of the proposal they'll be allowed at any level of the path, not only the root.
> You can't just put things there any time you want - the RFC requires that they go through a registration process.
Actually, I can for two reasons. First is of course the RFC mentions that items can be registered after the fact, if it's found that a particular well-known suffix is being widely used. But the second is a bit more chaotic - website owners are under no obligation to consult a registry, much like port registrations; in many cases they won't even know it exists and may think of it as a place that should reflect their mental model.
It can make things awkward and difficult though, that is true, but that comes with the free text nature of the well-known space. That's made evident in the Github issue linked, a large group of very smart people didn't know that there was a registry for it.
There was no "large group of very smart people" behind llms.txt. It was just me. And I'm very familiar with the registry, and it doesn't work for this particular case IMO (although other folks are welcome to register it if they feel otherwise, of course).
"""
A well-known URI is a URI [RFC3986] whose path component begins with the characters "/.well-known/", and whose scheme is "HTTP", "HTTPS", or another scheme that has explicitly been specified to use well- known URIs.
Applications that wish to mint new well-known URIs MUST register them, following the procedures in Section 5.1.
"""
What is this "clerk" library he used at this timestamp to tell him what to do? https://youtu.be/LCEmiRjPEtQ?si=XaC-oOMUxXp0DRU0&t=1991
Gemini found it via screenshot or context: https://clerk.com/
This is what he used for login on MenuGen: https://karpathy.bearblog.dev/vibe-coding-menugen/
https://github.com/EvolvingAgentsLabs/llmunix
An experiment to explore Kaparthy ideas
how do i install this thing?
Tight feedback loops are the key in working productively with software. I see that in codebases up to 700k lines of code (legacy 30yo 4GL ERP systems).
The best part is that AI-driven systems are fine with running even more tight loops than what a sane human would tolerate.
Eg. running full linting, testing and E2E/simulation suite after any minor change. Or generating 4 versions of PR for the same task so that the human could just pick the best one.
> Or generating 4 versions of PR for the same task so that the human could just pick the best one.
That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality. Why are we doing this to ourselves and embracing it?
A few years ago, it would have been seen as a joke to say “the future of software development will be to have a million monkey interns banging on one million keyboards and submit a million PRs, then choose one”. Today, it’s lauded as a brilliant business and cost-saving idea.
We’re beyond doomed. The first major catastrophe caused by sloppy AI code can’t come soon enough. The sooner it happens, the better chance we have to self-correct.
>That sounds awful.
Not for the cloud provider. AWS bill to the moon!
I'm not sure that AI code has to be sloppy. I've had some success with hand coding some examples and then asking codex to rigorously adhere to prior conventions. This can end up with very self consistent code.
Agree though on the "pick the best PR" workflow. This is pure model training work and you should be compensated for it.
Yep this is what Andrej talks about around 20 minutes into this talk.
You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail. The second you start being vague, even if it WOULD be clear to a person with common sense, the LLM views that vagueness as a potential aspect of it's own creative liberty.
> the LLM views that vagueness as a potential aspect of it's own creative liberty.
I think that anthropomorphism actually clouds what’s going on here. There’s no creative choice inside an LLM. More description in the prompt just means more constraints on the latent space. You still have no certainty whether the LLM models the particular part of the world you’re constraining it to in the way you hope it does though.
> You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail.
If only there was a language one could use that enables describing all of your requirements in a unambiguous manner, ensuring that you have provided all the necessary detail.
Oh wait.
If it's monkeylike quality and you need a million tries, it's shit. It you need four tries and one of those is top-tier professional programmer quality, then it's good.
if the thing producing the four PRs can't distinguish the top tier one, I have strong doubts that it can even produce it
> A truly terrible and demotivating way to work and produce anything of real quality
You clearly have strong feelings about it, which is fine, but it would be much more interesting to know exactly why it would terrible and demotivating, and why it cannot produce anything of quality? And what is "real quality" and does that mean "fake quality" exists?
> million monkey interns banging on one million keyboards and submit a million PRs
I'm not sure if you misunderstand LLMs, or the famous "monkeys writing Shakespeare" part, but that example is more about randomness and infinity than about probabilistic machines somewhat working towards a goal with some non-determinism.
> We’re beyond doomed
The good news is that we've been doomed for a long time, yet we persist. If you take a look at how the internet is basically held up by duct-tape at this point, I think you'd feel slightly more comfortable with how crap absolutely everything is. Like 1% of software is actually Good Software while the rest barely works on a good day.
If "AI" worked (which fortunately isn't the case), humans would be degraded to passive consumers in the last domain in which they were active creators: thinking.
Moreover, you would have to pay centralized corporations that stole all of humanity's intellectual output for engaging in your profession. That is terrifying.
The current reality is also terrifying: Mediocre developers are enabled to have a 10x volume (not quality). Mediocre execs like that and force everyone to use the "AI" snakeoil. The profession becomes even more bureaucratic, tool oriented and soulless.
People without a soul may not mind.
> And what is "real quality" and does that mean "fake quality" exists?
I think there is no real quality or fake quality, just quality. I am referencing the quality that Persig and C. Alexander have written about.
It’s… qualitative, so it’s hard to measure but easy to feel. Humans are really good at perceiving it then making objective decisions. LLMs don’t know what it is (they’ve heard about it and think they know).
It is actually funny that current AI+Coding tools benefit a lot from domain context and other information along the lines of Domain-Driven Design (which was inspired by the pattern language of C. Alexander).
A few teams have started incorporating `CONTEXT.MD` into module descriptions to leverage this.
> That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality
This is the right way to work with generative AI, and it already is an extremely common and established practice when working with image generation.
"If the only tool you have is a hammer, you tend to see every problem as a nail."
I think the worlds leaning dangerously into LLMs expecting them to solve every problem under the sun. Sure AI can solve problems but I think that domain 1 they Karpathy shows if it is the body of new knowledge in the world doesn't grow with LLMs and agents maybe generation and selection is the best method for working with domain 2/3 but there is something fundamentally lost in the rapid embrace of these AI tools.
A true challenge question for people is would you give up 10 points of IQ for access to the next gen AI model? I don't ask this in the sense that AI makes people stupid but rather that it frames the value of intelligence is that you have it. Rather than, in how you can look up or generate an answer that may or may not be correct quickly. How we use our tools deeply shapes what we will do in the future. A cautionary tale is US manufacturing of precision tools where we give up on teaching people how to use Lathes, because they could simply run CNC machines instead. Now that industry has an extreme lack of programmers for CNC machines, making it impossible to keep up with other precision instrument producing countries. This of course is a normative statement and has more complex variables but I fear in this dead set charge for AI we will lose sight of what makes programming languages and programming in general valuable
I can recognize images in one look.
How about that 400 Line change that touches 7 files?
Exactly!
This is why there has to be "write me a detailed implementation plan" step in between. Which files is it going to change, how, what are the gotchas, which tests will be affected or added etc.
It is easier to review one document and point out missing bits, than chase the loose ends.
Once the plan is done and good, it is usually a smooth path to the PR.
In my prompt I ask the LLM to write a short summary of how it solved the problem, run multiple instances of LLM concurrently, compare their summaries, and use the output of whichever LLM seems to have interpreted instructions the best, or arrived at the best solution.
And you trust that the summary matches what was actually done? Your experience with the level of LLMs understanding of code changes must significantly differ from mine.
It is not. The right way to work with generative AI is to get the right answer in the first shot. But it's the AI that is not living up to this promise.
Reviewing 4 different versions of AI code is grossly unproductive. A human co-worker can submit one version of code and usually have it accepted with a single review, no other "versions" to verify. 4 versions means you're reading 75% more code than is necessary. Multiply this across every change ever made to a code base, and you're wasting a shitload of time.
> Reviewing 4 different versions of AI code is grossly unproductive.
You can have another AI do that for you. I review manually for now though (summaries, not the code, as I said in another message).
Here’s a few problems I foresee:
1. People get lazy when presented with four choices they had no hand in creating, and they don’t look over the four and just click one, ignoring the others. Why? Because they have ten more of these on the go at once, diminishing their overall focus.
2. Automated tests, end-to-end sim., linting, etc—tools already exist and work at scale. They should be robust and THOROUGHLY reviewed by both AI and humans ideally.
3. AI is good for code reviews and “another set of eyes” but man it makes serious mistakes sometimes.
An anecdote for (1), when ChatGPT tries to A/B test me with two answers, it’s incredibly burdensome for me to read twice virtually the same thing with minimal differences.
Code reviewing four things that do almost the same thing is more of a burden than writing the same thing once myself.
With lazy people the same applies for everything, code they do write, or code they review from peers. The issue is not the tooling, but the hands.
The more tedious the work is, the less motivation and passion you get for doing it, and the more "lazy" you become.
Laziness does not just come from within, there are situations that promote behaving lazy, and others that don't. Some people are just lazy most of the time, but most people are "lazy" in some scenarios and not in others.
A simple rule applies: "No matter what tool created the code, you are still responsible for what you merge into main".
As such, task of verification, still falls on hands of engineers.
Given that and proper processes, modern tooling works nicely with codebases ranging from 10k LOC (mixed embedded device code with golang backends and python DS/ML) to 700k LOC (legacy enterprise applications from the mainframe era)
> As such, task of verification, still falls on hands of engineers.
Even before LLM it was a common thing to merge changes which completely brake test environment. Some people really skip verification phase of their work.
Agreed. I think engineers though following simple Test-Driven Development procedures can write the code, unit tests, integration tests, debug, etc for a small enough unit by default forces tight feedback loops. AI may assist in the particulars, not run the show.
I’m willing to bet, short of droid-speak or some AI output we can’t even understand, that when considering “the system as a whole”, that even with short-term gains in speed, the longevity of any product will be better with real people following current best-practices, and perhaps a modest sprinkle of AI.
Why? Because AI is trained on the results of human endeavors and can only work within that framework.
Agreed. AI is just a tool. Letting in run the show is essentially what the vibe-coding is. It is a fun activity for prototyping, but tends to accumulate problems and tech debt at an astonishing pace.
Code, manually crafted by professionals, will almost always beat AI-driven code in quality. Yet, one has still to find such professionals and wait for them to get the job done.
I think, the right balance is somewhere in between - let tools handle the mundane parts (e.g. mechanically rewriting that legacy Progress ABL/4GL code to Kotlin), while human engineers will have fun with high-level tasks and shaping the direction of the project.
I don't think the human is the problem here, but the time it takes to run the full testing suite.
Unless you are doing something crazy like letting the fuzzer run on every change (cache that shit), the full test suite taking a long time suggests that either your isolation points are way too large or you are letting the LLM cross isolated boundaries and "full testing suite" here actually means "multiple full testing suites". The latter is an easy fix: Don't let it. The former is a lot harder to fix, but I suppose ending up there is a strong indicator that you can't trust the human picking the best LLM result in the first place and that maybe this whole thing isn't a good idea for the people in your organization.
Humans tend to lack inhumane patience.
The full test suite is probably tens of thousands of tests.
But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.
Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.
Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.
It's an interesting idea, but reactive, and could cause big delays due to bisecting and testing on those regressions. There's the 'old' saying that the sooner the bug is found the cheaper it is to fix, seems weird to intentionally push finding side effect bugs later in the process because faster CI runs. Maybe AI will get there but it seems too aggressive right now to me. But yeah, put the automation slider where you're comfortable.
It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.
Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?
Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.
If you're feeling brave, you start Robot B and C at the same time.
Worked in such a codebase for about 5 years.
No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.
This is the workflow that ChatGPT Codex demonstrates nicely. Launch any number of «robotic» tasks in parallel, then go on your own. Come back later to review the results and pick good ones.
Well, they're demonstrating it somewhat, it's more of a prototype today. First tell is the low limit, I think the longest task for me been 15 minutes before it gives up. Second tell is still using a chat UI which is simple to implement, easy to implement and familiar, but also kind of lazy. There should be a better UX, especially with the new variations they just added. From the top of my head, some graph-like UX might have been better.
I guess, it depends on the case and the approach.
It works really nice with the following approach (distilled from experiences reported by multiple companies)
(1) Augment codebase with explanatory texts that describe individual modules, interfaces and interactions (something that is needed for the humans anyway)
(2) Provide Agent.MD that describes the approach/style/process that the AI agent must take. It should also describe how to run all tests.
(3) Break down the task into smaller features. For each feature - ask first to write a detailed implementation plan (because it is easier to review the plan than 1000 lines of changes. spread across a dozen files)
(4) Review the plan and ask to improve it, if needed. When ready - ask to draft an actual pull request
(5) The system will automatically use all available tests/linting/rules before writing the final PR. Verify and provide feedback, if some polish is needed.
(6) Launch multiple instances of "write me an implementation plan" and "Implement this plan" task, to pick the one that looks the best.
This is very similar to git-driven development of large codebases by distributed teams.
Edit: added newlines
Yes, and (some near-future) AI is also more patient and better at multitasking than a reasonable human. It can make a change, submit for full fuzzing, and if there's a problem it can continue with the saved context it had when making the change. It can work on 100s of such changes in parallel, while a human trying to do this would mix up the reasons for the change with all the other changes they'd done by the time the fuzzing result came back.
LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.
You can't run N bots in parallel with testing between each attempt unless you're also running N tests in parallel.
If you could run N tests in parallel, then you could probably also run the components of one test in parallel and keep it from taking 2 hours in the first place.
To me this all sounds like snake oil to convince people to do something they were already doing, but by also spinning up N times as many compute instances and run a burn endless tokens along the way. And by the time it's demonstrated that it doesn't really offer anything more than doing it yourself, well you've already given them all of your money so their job is done.
Running tests is already an engineering problem.
In one of the systems (supply chain SaaS) we invested so much effort in having good tests in a simulated environment, that we could run full-stack tests at kHz. Roughly ~5k tests per second or so on a laptop.
I work in web dev, so people sometimes hook code formatting as a git commit hook or sometimes even upon file save. The tests are problematic tho. If you work at huge project it's a no go idea at all. If you work at medium then the tests are long enough to block you, but short enough for you not to be able to focus on anything else in the meantime.
In my experience with Jules and (worse) Codex, juggling multiple pull requests at once is not advised.
Even if you tell the git-aware Jules to handle a merge conflict within the context window the patch was generated, it is like sorry bro I have no idea what's wrong can you send me a diff with the conflict?
I find i have to be in the iteration loop at every stage or else the agent will forget what it's doing or why rapidly. for instance don't trust Jules to run your full test suite after every change without handholding and asking for specific run results every time.
It feels like to an LLM, gaslighting you with code that nominally addresses the core of what you just asked while completely breaking unrelated code or disregarding previously discussed parameters is an unmitigated success.
The problem is that every time you run your full automation with linting and tests, you’re filling up the context window more and more. I don’t know how people using Claude do it with its <300k context window. I get the “your message will exceed the length of this chat” message so many times.
I don't know exactly how Claude works, but the way I work around this with my own stuff is prompting it to not display full outputs ever, and instead temporary redirect the output somewhere then grep from the log-file what it's looking for. So a test run outputting 10K lines of test output and one failure is easily found without polluting the context with 10K lines.
Claude's approach is currently a bit dated.
Cursor.sh agents or especially OpenAI Codex illustrate that a tool doesn't need to keep on stuffing context window with irrelevant information in order to make progress on a task.
And if really needed, engineers report that Gemini Pro 2.5 keeps on working fine within 200k-500k token context. Above that - it is better to reset the context.
I started to use sub agents for that. That does not pollute the context as much
I think it's interesting to juxtapose traditional coding, neural network weights and prompts because in many areas -- like the example of the self driving module having code being replaced by neural networks tuned to the target dataset representing the domain -- this will be quite useful.
However I think it's important to make it clear that given the hardware constraints of many environments the applicability of what's being called software 2.0 and 3.0 will be severely limited.
So instead of being replacements, these paradigms are more like extra tools in the tool belt. Code and prompts will live side by side, being used when convenient, but none a panacea.
I kind of say it in words (agreeing with you) but I agree the versioning is a bit confusing analogy because it usually additionally implies some kind of improvement. When I’m just trying to distinguish them as very different software categories.
What do you think about structured outputs / JSON mode / constrained decoding / whatever you wish to call it?
To me, it's a criminally underused tool. While "raw" LLMs are cool, they're annoying to use as anything but chatbots, as their output is unpredictable and basically impossible to parse programmatically.
Structured outputs solve that problem neatly. In a way, they're "neural networks without the training". They can be used to solve similar problems as traditional neural networks, things like image classification or extracting information from messy text, but all they require is a Zod or Pydantic type definition and a prompt. No renting GPUs, labeling data and tuning hyperparameters necessary.
They often also improve LLM performance significantly. Imagine you're trying to extract calories per 100g of product, but some product give you calories per serving and a serving size, calories per pound etc. The naive way to do this is a prompt like "give me calories per 100g", but that forces the LLM to do arithmetic, and LLMs are bad at arithmetic. With structured outputs, you just give it the fifteen different formats that you expect to see as alternatives, and use some simple Python to turn them all into calories per 100g on the backend side.
Even more than that. With Structured Outputs we essentially control layout of the response, so we can force LLM to go through different parts of the completion in a predefined order.
One way teams exploit that - force LLM to go through a predefined task-specific checklist before answering. This custom hard-coded chain of thought boosts the accuracy and makes reasoning more auditable.
Weights are code being replaced by data; something I've been making heavy use of since the early 00s. After coding for 10 years you start to see the benefits of it and understand where you should use it.
LLMs give us another tool only this time it's far more accessible and powerful.
Great talk, thanks for putting it online so quickly. I liked the idea of making the generation / verification loop go brrr, and one way to do this is to make verification not just a human task, but a machine task, where possible.
Yes, I am talking about formal verification, of course!
That also goes nicely together with "keeping the AI on a tight leash". It seems to clash though with "English is the new programming language". So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English? I think that is possible, if you have a formal language and logic that is flexible enough, and close enough to informal English.
Yes, I am talking about abstraction logic [1], of course :-)
So the goal would be to have English (German, ...) as the ONLY programming language, invisibly backed underneath by abstraction logic.
[1] http://abstractionlogic.com
> So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English?
The problem with trying to make "English -> formal language -> (anything else)" work is that informality is, by definition, not a formal specification and therefore subject to ambiguity. The inverse is not nearly as difficult to support.
Much like how a property in an API initially defined as being optional cannot be made mandatory without potentially breaking clients, whereas making a mandatory property optional can be backward compatible. IOW, the cardinality of "0 .. 1" is a strict superset of "1".
> The problem with trying to make "English -> formal language -> (anything else)" work is that informality is, by definition, not a formal specification and therefore subject to ambiguity. The inverse is not nearly as difficult to support.
Both directions are difficult and important. How do you determine when going from formal to informal that you got the right informal statement? If you can judge that, then you can also judge if a formal statement properly represents an informal one, or if there is a problem somewhere. If you detect a discrepancy, tell the user that their English is ambiguous and that they should be more specific.
lean 4/5 will be a rising star!
You would definitely think so, Lean is in a great position here!
I am betting though that type theory is not the right logic for this, and that Lean can be leapfrogged.
I think type theory is exactly right for this! Being so similar to programming languages, it can piggy back on the huge amount of training the LLMs have on source code.
I am not sure lean in part is the right language, there might be challengers rising (or old incumbents like Agda or Roq can find a boost). But type theory definitely has the most robust formal systems at the moment.
> Being so similar to programming languages
I think it is more important to be close to English than to programming languages, because that is the critical part:
"As close to a programming language as necessary, as close to English as possible"
is the goal, in my opinion, without sacrificing constraints such as simplicity.
Why? By the completeness theorem, shouldn't first order logic already be sufficient?
The calculus of constructions and other approaches are already available and proven. I'm not sure why we'd need a special logic for LLMs unless said logic somehow accounts for their inherently stochastic tendencies.
If first-order logic is already sufficient, why are most mature systems using a type theory? Because type theory is more ergonomic and practical than first-order logic. I just don't think that type theory is ergonomic and practical enough. That is not a special judgement with respect to LLMs, I want a better logic for myself as well. This has nothing to do with "stochastic tendencies". If it is easier to use for humans, it will be easier for LLMs as well.
when I started coding at the age of 11 in machine code and assembly on the C64, the dream was to create software that creates software. Nowadays it's almost reality, almost because the devil is always in the details. When you're used to write code, writing code is relatively fast. You need this knowledge to debug issues with generated code. However you're now telling AI to fix the bugs in the generated code. I see it kind of like machine code becomes overlaid with asm which becomes overlaid with C or whatever higher level language, which then uses dogma/methodology like MVC and such and on top of that there's now the AI input and generation layer. But it's not widely available. Affording more than 1 computer is a luxury. Many households are even struggling to get by. When you see those what 5 7 Mac Minis, which normal average Joe can afford that or does even have to knowledge to construct an LLM at home? I don't. This is a toy for rich people. Just like with public clouds like AWS, GCP I left out, because the cost is too high and running my own is also too expensive and there are cheaper alternatives that not only cost less but also have way less overhead.
What would be interesting to see is what those kids produced with their vibe coding.
> those kids produced with their vibe coding
No one, including Karpathy in this video, is advocating for "vibe coding". If nothing more, LLMs paired with configurable tool-usage, is basically a highly advanced and contextual search engine you can ask questions. Are you not using a search engine today?
Even without LLMs being able to produce code or act as agents they'd be useful, because of that.
But it sucks we cannot run competitive models locally, I agree, it is somewhat of a "rich people" tool today. Going by the talk and theme, I'd agree it's a phase, like computing itself had phases. But you're gonna have to actually watch and listen to the talk itself, right now you're basically agreeing with the video yet wrote your comment like you disagree.
This is most definitely not toys for rich people. Now perhaps depending on your country it may be considered rich but I would comfortably say that for most of the developed world, the costs for these tools are absolutely attainable, there is a reason ChatGPT has such a large subscriber base.
Also the disconnect for me here is I think back on the cost of electronics, prices for the level of compute have generally gone down significantly over time. The c64 launched around the $5-600 price level, not adjusted for inflation. You can go and buy a Mac mini for that price today.
> This is a toy for rich people
GitHub copilot has a free tier.
Google gives you thousands of free LLM API calls per day.
There are other free providers too.
1st dose is free
Agreed. It is worth noting how search has evolved over the years.
LLM APIs are pretty darn cheap for most of the developed worlds income levels.
Yeah, because they're bleeding money like crazy now.
You should consider how much it actually costs, not how much they charge.
How do people fail to consider this?
>You should consider how much it actually costs, not how much they charge. How do people fail to consider this?
Sure, nobody can predict the long-term economics with certainty but companies like OpenAI already have compelling business fundamentals today. This isn’t some scooter startup praying for margins to appear; it’s a platform with real, scaled revenue and enterprise traction.
But yeah, tell me more about how my $200/mo plan is bankrupting them.
how much does it cost?
It's cheap now. But if you take into account all the training costs, then at such prices they cannot make a profit in any way. This is called dumping to capture the market.
No doubt the complete cost of training and to getting where we are today has been significant and I don’t know how the accounting will look years from now but you are just making up the rest based on feelings. We know operationally OpenAI is profitable on purely the runtime side, nobody knows how that will look when accounting for R&D but you have no qualification to say they cannot make a profit in any way.
Yes, if you do not take into account the cost of training, I think it is very likely profitable. The cost of working models is not so high. This is just my opinion based on open models and I admit that I have not carried out accurate calculations.
Software 3.0 is the code generated by the machine, not the prompts that generated it. The prompts don't even yield the same output; there is randomness.
The new software world is the massive amount of code that will be burped out by these agents, and it should quickly dwarf the human output.
How I understood it is that natural language will form relatively large portions of stacks (endpoint descriptions, instructions, prompts, documentations, etc…). In addition to code generated by agents (which would fall under 1.0)
I think that if you give the same task to three different developers you'll get three different implementations. It's not a random result if you do get the functionality that was expected, and at that, I do think the prompt plays an important role in offering a view of how the result was achieved.
> I think that if you give the same task to three different developers you'll get three different implementations.
Yes, but if you want them to be compatible you need to define a protocol and conformance test suite. This is way more work than writing a single implementation.
The code is the real spec. Every piece of unintentional non-determinism can be a hazard. That’s why you want the code to be the unit of maintenance, not a prompt.
I know! Let's encode the spec into a format that doesn't have the ambiguities of natural language.
The comparison of our current methods of interacting with LLMs (back and forth text) to old-school terminals is pretty interesting. I think there's still a lot work to be done to optimize how we interact with these models, especially for non-dev consumers.
Audio maybe the better option.
It’s fascinating to think about what true GUI for LLM could be like.
It immediately makes me think a LLM that can generate a customized GUI for the topic at hand where you can interact with in a non-linear way.
Fun demo of an early idea was posted by Oriol just yesterday :)
https://x.com/OriolVinyalsML/status/1935005985070084197
On one hand, I'm incredibly impressed by the technology behind that demo. On the other hand, I can't think of many things that would piss me off more than a non-deterministic operating system.
I like my tools to be predictable. Google search trying to predict that I want the image or shopping tag based on my query already drives me crazy. If my entire operating system did that, I'm pretty sure I'd throw my computer out a window.
> incredibly impressed by the technology behind that demo
An LLM generating some HTML?
Ah yes, my operating system, most definitely a place I want to stick the Hallucinotron-3000 so that every click I make yields a completely different UI that has absolutely 0 bearing to reality. We're truly entering the "Software 3.0" days (can't wait for the imbeciles shoving AI everywhere to start overusing that dogshit, made-up marketing term incessantly)
Maybe we can collect all of this salt and operate a Thorium reactor with it, this in turn can then power AI.
We'll need to boil a few more lakes before we get to that stage I'm afraid, who needs water when you can have your AI hallucinate some for you after all?
Who needs water when all these hot takes come from sources so dense, they're about to collapse into black holes.
Is me not wanting the UI of my OS to shift with every mouse click a hot take? If me wanting to have the consistent "When I click here, X happens" behavior instead of the "I click here and I'm Feeling Lucky happens" behavior is equal to me being dense, so be it I guess.
it's impressive but it seems like a crappier UX? that none of the patterns can really be memorized
Having different documents come up every time you go into the documents directory seems hellishly terrible.
It's a brand of terribleness I've somewhat gotten used to, opening Google Drive every time, when it takes me to the "Suggested" tab. I can't recall a single time when it had the document I care about anywhere close to the top.
There's still nothing that beats the UX of Norton Commander.
This is crazy cool, even if not necessarily the best use case for this idea
An ever-shifting UI sounds unlearnable, and therefore unusable.
A mixed ever-shifting UI can be excellent though. So you've got some tools which consistently interact with UI components, but the UI itself is altered frequently.
Take for example world-building video games like Cities Skylines / Sim City or procedural sandboxes like Minecraft. There are 20-30 consistent buttons (tools) in the game's UX, while the rest of the game is an unbounded ever-shifting UI.
The rest of the game is very deterministic where its state is controlled by the buttons. The slight variation is caused by the simulation engine and follows consistent patterns (you can’t have building on fire if there’s no building yet).
Like Spotify ugh
It wouldn't be unlearnable if it fits the way the user is already thinking.
AI is not mind reading.
My friend Eric Pelz started a company called Malleable to do this very thing: https://www.linkedin.com/posts/epelz_every-piece-of-software...
This talk https://www.youtube.com/watch?v=MbWgRuM-7X8 explores the idea of generative / malleable personal user interfaces where LLMs can serve as the gateway to program how we want our UI to be rendered.
Like a HyperCard application?
We (https://vibes.diy/) are betting on this
Border-line off-topic, but since you're flagrantly self-promoting, might as well add some more rule breakage to it.
You know websites/apps who let you enter text/details and then not displaying sign in/up screen until you submit it, so you feel like "Oh but I already filled it out, might as well sign up"?
They really suck, big time! It's disingenuous, misleading and wastes people's time. I had no interest in using your thing for real, but thought I'd try it out, potentially leave some feedback, but this bait-and-switch just made the whole thing feel sour and I'll probably try to actively avoid this and anything else I feel is related to it.
I love this concept and would love to know where to look for people working on this type of thing!
Humans are shit at interacting with systems in a non-linear way. Just look at Jupyter notebooks and the absolute mess that arises when you execute code blocks in arbitrary order.
loved the analogies! Karpathy is consistently one of the clearest thinkers out there.
interesting that Waymo could do uninterrupted trips back in 2013, wonder what took them so long to expand? regulation? tailend of driving optimization issues?
noticed one of the slides had a cross over 'AGI 2027'... ai-2027.com :)
You don't "solve" autonomous driving as such. There's a long, slow grind of gradually improving things until failures become rare enough.
I wonder at what point all the self-driving code becomes replaceable with a multimodal generalist model with the prompt “drive safely”
Very advanced machine learning models are used in current self driving cars. It all depends what the model is trying to accomplish. I have a hard time seeing a generalist prompt-based generative model ever beating a model specifically designed to drive cars. The models are just designed for different, specific purposes
I could see it being the case that driving is a fairly general problem, and this models intentionally designed to be general end up doing better than models designed with the misconception that you need a very particular set of driving-specific capabilities.
Speed and Moore's law. You don't need to just make a decision without hallucinations, you need to do it fast enough for it to propagate to the power electronics and hit the gas/brake/turn the wheel/whatever. Over and over and over again on thousands of different tests.
A big problem I am noticing is that the IT culture over the last 70 years has existed in a state of "hardware gun get faster soon". And over the last ten years we had a "hardware cant get faster bc physics sorry" problem.
The way we've been making software in the 90s and 00s just isn't gonna be happening anymore. We are used to throwing more abstraction layers (C->C++->Java->vibe coding etc) at the problem and waiting for the guys in the fab to hurry up and get their hardware faster so our new abstraction layers can work.
Well, you can fire the guys in the fab all you want but no matter how much they try to yell at the nature it doesn't seem to care. They told us the embedded c++-monkeys to spread the message. Sorry, the moore's law is over, boys and girls. I think we all need to take a second to take that in and realize the significance of that.
[1] The "guys in the fab" are a fictional character and any similarity to the real world is a coincidence.
[2] No c++-monkeys were harmed in the process of making this comment.
Driving is not a general problem, though. Its a contextual landscape of fast-based reactions and predictions. Both are required, and done regularly by the human element. The exact nature of every reaction, and every prediction, change vastly within the context window.
You need image processing just as much as you need scenario management, and they're orthoganol to each other, as one example.
If you want a general transport system... We do have that. It's called rail. (And can and has been automated.)
It partially is. You have the specialized part of maneuvering a fast moving vehicle in physical world, trying to keep it under control at all times and never colliding with anything. Then you have the general part, which is navigating the human environment. That's lanes and traffic signs and road works and schoolbuses, that's kids on the road and badly parked trailers.
Current breed of autonomous driving systems have problems with exceptional situations - but based on all I've read about so far, those are exactly of the kind that would benefit from a general system able to understand the situation it's in.
> Driving is not a general problem, though.
But what's driving a car? A generalist human brain that has been trained for ~30 hours to drive a car.
Human brain's aren't generalist!
We have multiple parts of the brain that interact in vastly different ways! Your cerebellum won't be running the role of the pons.
Most parts of the brain cannot take over for others. Self-healing is the exception, not the rule. Yes, we have a degree of neuroplasticity, but there are many limits.
(Sidenote: Driver's license here is 240 hours.)
> We have multiple parts of the brain that interact in vastly different ways!
Yes, and thanks to that human brains are generalist
Only if that was a singular system, however, it is not. [0]
For example... The nerve cells in your gut may speak to the brain, and interact with it in complex ways we are only just beginning to understand, but they are separate systems that both have control over the nervous system, and other systems. [1]
General Intelligence, the psychological theory, and General Modelling, whilst sharing words, share little else.
[0] https://doi.org/10.1016/j.neuroimage.2022.119673
[1] https://doi.org/10.1126/science.aau9973
> Human brain's aren't generalist!
What? Human intelligence is literally how AGI is defined. Brain’s physical configuration is irrelevant.
A human brain is not a general model. We have multiple overlapping systems. The physical configuration is extremely relevant to that.
AGI is defined in terms of "General Intelligence", a theory that general modelling is irrelevant to.
exactly! I think that was tesla's vision with self-driving to begin with... so they tried to frame it as problem general enough, that trying to solve it would also solve questions of more general intelligence ('agi') i.e. cars should use vision just like humans would
but in hindsight looks like this slowed them down quite a bit despite being early to the space...
This is (in part) what "world models" are about. While some companies like Tesla bring together a fleet of small specialised models, others like CommaAI and Wayve train generalist models.
One of the issues with deploying models like that is the lack of clear, widely accepted ways to validate comprehensive safety and absence of unreasonable risk. If that can be solved, or regulators start accepting answers like "our software doesn't speed in over 95% of situations", then they'll become more common.
> Karpathy is consistently one of the clearest thinkers out there.
Eh, he ran Teslas self driving division and put them into a direction that is never going to fully work.
What they should have done is a) trained a neural net to represent sequence of frames into a physical environment, and b)leveraged Mu Zero, so that self driving system basically builds out parallel simulations into the future, and does a search on the best course of action to take.
Because thats pretty much what makes humans great drivers. We don't need to know what a cone is - we internally compute that something that is an object on the road that we are driving towards is going to result in a negative outcome when we collide with it.
Aren't continuous, stochastic, partial knowledge environments where you need long horizon planning with strict deadlines and limited compute exactly the sort of environments muzero variants struggle with? Because that's driving.
It's also worth mentioning that humans intentionally (and safely) drive into "solid" objects all the time. Bags, steam, shadows, small animals, etc. We also break rules (e.g. drive on the wrong side of the road), and anticipate things we can't even see based on a theory of mind of other agents. Human driving is extremely sophisticated, not reducible to rules that are easily expressed in "simple" language.
> We don't need to know what a cone is
The counter argument is that you can't zoom in and fix a specific bug in this mode of operation. Everything is mashed together in the same neural net process. They needed to ensure safety, so testing was crucial. It is harder to test an end-to-end system than its individual parts.
I don't think that would have worked either.
But if they'd gone for radars and lidars and a bunch of sensors and then enough processing hardware to actually fuse that, then I think they could have built something that had a chance of working.
That's absolutely not what makes humans great drivers?
Is that the approach that waymo uses?
Where do these analogies break down?
1. Similar cost structure to electricity, but non-essential utility (currently)?
2. Like an operating system, but with non-determinism?
3. Like programming, but ...?
Where does the programming analogy break down?
Define non-essenti
The way I see dependency in office ("knowledge") work:
- pre-(computing) history. We are at the office, we work
- dawn of the pc: my computer is down, work halts
- dawn of the lan: the network is down, work halts
- dawn of the Internet: the Internet connection is down, work halts (<- we are basically all here)
- dawn of the LLM: ChatGPT is down, work halts (<- for many, we are here already)
I see your point. It's nearing essential.
> programming
The programming analogy is convenient but off. The joke has always been “the computer only does exactly what you tell it to do!” regarding logic bugs. Prompts and LLMs most certainly do not work like that.
I loved the parallels with modern LLMs and time sharing he presented though.
> Prompts and LLMs most certainly do not work like that.
It quite literally works like that. The computer is now OS + user-land + LLM runner + ML architecture + weights + system prompt + user prompt.
Taken together, and since you're adding in probabilities (by using ML/LLMs), you're quite literally getting "the computer only does exactly what you tell it to do!", it's just that we have added "but make slight variations to what tokens you select next" (temperature>0.0) sometimes, but it's still the same thing.
Just like when you tell the computer to create encrypted content by using some seed. You're getting exactly what you asked for.
only in English, and also non-deterministic.
Yeah, wherever possible I try to have the llm answer me in Python rather than English (especially when explaining new concepts)
English is soooooo ambiguous
For what it's worth, I've been using it to help me learn math, and I added to my rules an instruction that it should always give me an example in Python (preferably sympy) whenever possible.
The quite good blog post mentioned by Karpathy for working with LLMs when building software:
- https://blog.nilenso.com/blog/2025/05/29/ai-assisted-coding/
See also:
- https://news.ycombinator.com/item?id=44242051
[flagged]
Brutal counter take: If AI tooling makes you so much better, then you started very low. In contrast, if you are already insanely productive in creative ways others can hardly achieve then chances are, AI tools don't make much of a difference.
As someone who is starting very low — I very much agree. I'm basically a hobbyist who can navigate around Python code, and LLMs have been a godsend to me, they increased my hobby output tenfold. But as soon as I get into coding something I'm more familiar with, the LLMs usefulness plummets, because it's easier and faster to directly write code than to "translate" from English to code using an LLM (maybe only apart from using basically a smarter one-line tab completion)
Meanwhile, I asked this morning Claude 4 to write a simple EXIF normalizer. After two rounds of prompting it to double-check its code, I still had to point out that it makes no sense to load the entire image for re-orientating if the EXIF orientation is fine in the first place.
Vibe vs reality, and anyone actually working in the space daily can attest how brittle these systems are.
Maybe this changes in SWE with more automated tests in verifiable simulators, but the real world is far to complex to simulate in its vastness.
> Meanwhile
What do you mean "meanwhile", that's exactly (among other things) the kind of stuff he's talking about? The various frictions and how you need to approach it
> anyone actually working in the space
Is this trying to say that Karpathy doesn't "actually work" with LLMs or in the ML space?
I feel like your whole comment is just reacting to the title of the YouTube video, rather than actually thinking and reflecting on the content itself.
I'm pretty sure "actually work" part refers to SWE space rather than LLM/ML space
Seems to me that this is just another level of throwing compute at the problem.
Same way programs was way more efficient before and now they are "bloated" with packages, abstractions, slow implementations of algos and scaffolding.
The concept of what is good software development might be changing as well.
LLMs might not write the best code, but they sure can write a lot of it.
A manager in our company introduced Gemini as a chat bot coupled to our documentation.
> It failed to write out our company name.The rest was flawed with hallucinations also, hardly worth to mention.
I wish this is a rage bait towards others, but what should me feelings be? After all this is the tool thats sold to me, I am expected to work with.
We had exactly the opposite experience. CoPilot was able to answer questions accurately and reformatted the existing documentation to fit the context of users' questions, which made the information much easier to understand.
Code examples, which we offer as sort of reference implementations, were also adopted to fit the specific questions without much issues. Granted these aren't whole applications, but 10 - 25 line examples of doing API setup / calls.
We didn't, of course, just send users' questions directly to CoPilot. Instead there's a bit of prompt magic behind the scenes that tweaks the context so that CoPilot can produce better quality results.
The real question is how long it'll take until they're not brittle
Or will they ever be reliable. Your question is already making an assumption.
They're reliable already if you change the way you approach them. These probabilistic token generators probably never will be "reliable" if you expect them to 100% always output exactly what you had in mind, without iterating in user-space (the prompts).
I also think they might never become reliable.
There is a bar below which they are reliable.
"Write a Python script that adds three numbers together".
Is that bar going up? I think it probably is, although not as fast/far as some believe. I also think that "unreliable" can still be "useful".
But what does that mean? If you tell the LLM "Say just 'hi' without any extra words or explanations", do you not get "hi" back from it?
Sometimes I get "Hi!", sometimes "Hey!".
Which model? Just tried a bunch of ChatGPT, OpenAI's API, Claude, Anthropic's API and DeepSeek's API with both chat and reasonee, every single one replied with a single "hi".
That's literally the wrong way to use LLMs though.
LLMs think in tokens, the less they emit the dumber they are, so asking them to be concise, or to give the answer before explanation, is extremely counterproductive.
I was trying to make a point regarding "reliability", not a point about how to prompt or how to use them for work.
This is relevant. Your example may be simple enough, but for anything more complex, letting the model have its space to think/compute is critical to reliability - if you starve it for compute, you'll get more errors/hallucinations.
Yeah I mean I agree with you, but I'm still not sure how it's relevant. I'd also urge people to have unit tests they treat as production code, and proper system prompts, and X and Y, but it's really beyond the original point of "LLMs aren't reliable" which is the context in this sub-tree.
Its perfectly reliable for the things you know it to be, such as operations within its context window size.
Don't ask LLMs to "Write me Microsoft Excel".
Instead, ask it to "Write a directory tree view for the Open File dialog box in Excel".
Break your projects down into the smallest chunks you can for the LLMs. The more specific you are, the more reliable it's going to be.
The rest of this year is going to be companies figuring out how to break down large tasks into smaller tasks for LLM consumption.
I remember when people were saying here on HN that AIs will never be able to generate picture of hands with just 5 fingers because they just "don't have common sense"
∞
“Treat it like a junior developer” … 5 years later … “Treat it like a junior developer”
Usable LLMs are 3 years old at this point. ChatGPT, not Github Copilot, is the marker.
while True:
On the other hand, posts like this are like watching someone writing ask jeeves search queries into google 20 years ago and then gesturing how google sucks while everyone else in the room has figured out how to be productive with it and cringes at his "boomer" queries.
If you're still struggling to make LLMs useful for you by now, you should probably ask someone. Don't let other noobs on HN +1'ing you hold you back.
https://theeducationist.info/everything-amazing-nobody-happy...
AI Snake Oil: https://press.princeton.edu/books/hardcover/9780691249131/ai...
There's also those instances where Microsoft unleashed Copilot on the .NET repo, and it resulted in the most hilariously terrible PRs that required the maintainers to basically tell Copilot every single step it should take to fix the issue. They were basically writing the PRs themselves at that point, except doing it through an intermediary that was much dumber, slower and less practical than them.
And don't get me started on my own experiences with these things, and no, I'm not a luddite, I've tried my damndest and have followed all the cutting-edge advice you see posted on HN and elsewhere.
Time and time again, the reality of these tools falls flat on their face while people like Andrej hype things up as if we're 5 minutes away from having Claude become Skynet or whatever, or as he puts it, before we enter the world of "Software 3.0" (coincidentally totally unrelated to Web 3.0 and the grift we had to endure there, I'm sure).
To intercept the common arguments,
- no I'm not saying LLMs are useless or have no usecases
- yes there's a possibility if you extrapolate by current trends (https://xkcd.com/605/) that they indeed will be Skynet
- yes I've tried the latest and greatest model released 7 minutes ago to the best of my ability
- yes I've tried giving it prompts so detailed a literal infant could follow along and accomplish the task
- yes I've fiddled with providing it more/less context
- yes I've tried keeping it to a single chat rather than multiple chats, as well as vice versa
- yes I've tried Claude Code, Gemini Pro 2.5 With Deep Research, Roocode, Cursor, Junie, etc.
- yes I've tried having 50 different "agents" running and only choosing the best output form the lot.
I'm sure there's a new gotcha being written up as we speak, probably something along the lines of "Well for me it doubled my productivity!" and that's great, I'm genuinely happy for you if that's the case, but for me and my team who have been trying diligently to use these tools for anything that wasn't a microscopic toy project, it has fallen apart time and time again.
The idea of an application UI or god forbid an entire fucking Operating System being run via these bullshit generators is just laughable to me, it's like I'm living on a different planet.
To add to this, I ran into a lot of issues too. And similar when using cursor... Until I started creating a mega list of rules for it to follow that attaches to the prompts. Then outputs improved (but fell off after the context window got too large). At that stage I then used a prompt to summarize, to continue with a new context.
You're not the first, nor the last person, to have a seemingly vastly different experience than me and others.
So I'm curious, what am I doing differently from what you did/do when you try them out?
This is maybe a bit out there, but would you be up for sending me like a screen recording of exactly what you're doing? Or maybe even a video call sharing your screen? I'm not working in the space, have no products or services to sell, only curious is why this gap seemingly exists between you and me, and my only motive would be to understand if I'm the one who is missing something, or there are more effective ways to help people understand how they can use LLMs and what they can use them for.
My email is on my profile if you're up for it. Invitation open for others in the same boat as parent too.
I'm a greybeard, 45+ years coding, including active in AI during the mid 80's and used it when it applied throughout my entire career. That career being media and animation production backends, where the work is both at the technical and creative edge.
I currently have an AI integrated office suite, which has attorneys, professional writers, and political activists using the system. It is office software, word processing, spreadsheets, project management and about two dozen types of AI agents that act as virtual co-workers.
No, my users are not programmers, but I do have interns; college students with anything from 3 to 10 years experience writing software.
I see the same AI use problem issues with my users, and my interns. My office system bends over backwards to address this, but people are people: they do not realize that AI does not know what they are talking about. They will frequently ask questions with no preamble, no introduction to the subject. They will change topics, not bothering to start a new session or tell the AI the topic is now different. There is a huge number of things they do, often with escalating frustration evident in their prompts, that all violate the same basic issue: the LLM was not given a context to understand the subject at hand, and the user is acting like many people and when explaining they go further, past the point of confusion, now adding new confusion.
I see this over and over. It frustrates the users to anger, yet at the same time if they acted, communicated to a human, in the same manner they'd have a verbal fight almost instantly.
The problem is one of communications. ...and for a huge number of you I just lost you. You've not been taught to understand the power of communications, so you do not respect the subject. How to communication is practically everything when it comes to human collaboration. It is how one orders their mind, how one collaborates with others, AND how one gets AI to respond in the manner they desire.
But our current software development industry, and by extension all of STEM has been short changed by never been taught how to effectively communicate, no not at all. Presentations and how to sell are not effective communications, that's persuasion, about 5% of what it takes to convey understanding in others which then unblocks resistance to changes.
So AI is simultaneously going to take over everyone's job and do literally everything, including being used as application UI somehow... But you have to talk to it like a moody teenager at their first job lest you get nothing but garbage? I have to put just as much (and usually, more) effort talking to this non-deterministic black box as I would to an intern who joined a week ago to get anything usable out of it?
Yeah, I'd rather just type things out myself, and continue communicating with my fellow humans rather than expending my limited time on this earth appeasing a bullshit generator that's apparently going to make us all jobless Soon™
But parent explicitly mentioned:
> - yes I've tried giving it prompts so detailed a literal infant could follow along and accomplish the task
Which you are saying that might have missed in the end regardless?
I'd like to see the prompt. I suspect that "literal infant" is expected to be a software developer without preamble. The initial sentence to an LLM carries far more relevance, it sets the context stage to understand what follows. If there is no introduction to the subject at hand, the response will be just like anyone fed a wall of words: confusion as to what all this is about.
You and me both :) But I always try to read the comments here with the most charitable interpretation I can come up with.
I've got a working theory that models perform differently when used in different timezones... As in during US working hours they dont work as well due to high load. When used at 'offpeak' hours not only are they (obviously) snappier but the outputs appear to be a higher standard. Thought this for a while but now noticing with Claude4 [thinking] recently. Textbook case of anecdata of course though.
Interesting thought, if nothing less. Unless I misunderstand, it would be easy to run a study to see if this is true; use the API to send the same but slightly different prompt (as to avoid the caches) which has a definite answer, then run that once per hour for a week and see if the accuracy oscillates or not.
Yes good idea - although it appears we would also have to account for the possibility of providers nerfing their models. I've read others also think models are being quantized after a while to cut costs.
Unironically, your comment mirrors my opinion as of last month.
Since then I've given it another try last week and was quite literally mind blown how much it improved in the context of Vibe coding (Claude code). It actually improved so much that I thought "I would like to try that on my production codebase", (mostly because I want if to fail, because that's my job ffs) but alas - that's not allowed at my dayjob.
From the limited experience I could gather over the last week as a software dev with over 10 yrs of experience (along with another 5-10 doing it as a hobby before employment) I can say that I expect our industry to get absolutely destroyed within the next 5 yrs.
The skill ceiling for devs is going to get mostly squashed for 90% of devs, this will inevitably destroy our collective bargaining positions. Including for the last 10%, because the competition around these positions will be even more fierce.
It's already starting, even if it's currently very misguided and mostly down to short-sightedness.
But considering the trajectory and looking at how naive current llms coding tools are... Once the industry adjusts and better tooling is pioneered... it's gonna get brutal.
And most certainly not limited to software engineering. Pretty much all desk jobs will get hemorrhaged as soon as a llm-player basically replaces SAP with entirely new tooling.
Frankly, I expect this to go bad, very very quickly. But I'm still hoping for a good ending.
I think part of the problem is that code quality is somewhat subjective and developers are of different skill levels.
If you're fine with things that kinda working okay and you're not the best developer yourself then you probably think coding agents work really really well because the slop they produce isn't that much worse than yourself. In fact I know a mid-level dev who believes agent AIs write better code than himself.
If you're very critical of code quality then it's much tougher... This is even more true in complex codebases where simply following some existing pattern to add a new feature isn't going to cut it.
The degree to which it helps any individual developer will vary, and perhaps it's not that useful for yourself. For me over the last few months the tech has got to the point where I use it and trust it to write a fair percentage of my code. Unit tests are an example where I find it does a really good job.
> If you're very critical of code quality then it's much tougher
I'm not sure, I'm hearing developers I know are sloppy and produce shit code both having no luck with LLMs, and some of them having lots of luck with them.
On the other side, those who really think about the design/architecture and are very strict (which is the group I'd probably put myself into, but who wouldn't?) are split in a similar way.
I don't have any concrete proof, but I'm guessing "expectations + workflow" differences would explain the vast difference in perception of usefulness.
A few days ago, I was introduced to the idea that when you're vibe coding, you're consulting a "genie", much like in the fables, you almost never get what you asked for, but if your wishes are small, you might just get what you want.
The primagen reviewed this article[1] a few days ago, and (I think) that's where I heard about it. (Can't re-watch it now, it's members only) 8(
[1] https://medium.com/@drewwww/the-gambler-and-the-genie-08491d...
“You are an expert 10x software developer. Make me a billion dollar app.” Yeah this checks out
that's a really good analogy! It feels like wicked joke that llms behave in such a way that they're both intelligent and stupid at the same time
This was my favorite talk at AISUS because it was so full of concrete insights I hadn't heard before and (even better) practical points about what to build now, in the immediate future. (To mention just one example: the "autonomy slider".)
If it were up to me, which it very much is not, I would try to optimize the next AISUS for more of this. I felt like I was getting smarter as the talk went on.
His dismissal of smaller and local models suggests he underestimates their improvement potential. Give phi4 a run and see what I mean.
> suggests a lack of understanding of these smaller models capabilities
If anything, you're showing a lack of understanding of what he was talking about. The context is this specific time, where we're early in a ecosystem and things are expensive and likely centralized (ala mainframes) but if his analogy/prediction is correct, we'll have a "Linux" moment in the future where that equation changes (again) and local models are competitive.
And while I'm a huge fan of local models run them for maybe 60-70% of what I do with LLMs, they're nowhere near proprietary ones today, sadly. I want them to, really badly, but it's important to be realistic here and realize the differences of what a normal consumer can run, and what the current mainframes can run.
He understands the technical part, of course, I was referring to his prediction that large models will be always be necessary.
There is a point where an LLM is good enough for most tasks, I don’t need a megamind AI in order to greet clients, and both large and small/medium model size are getting there, with the large models hitting a computing/energy demand barrier. The small models won’t hit that barrier anytime soon.
Did he predict they'd always be necessary? He mostly seemed to predict the opposite, that we're at the early stage of a trajectory that has yet to have it's Linux moment
I edited to make it clearer
You can disagree with his conclusions but I don't think his understanding of small models is up for debate. This is the person who created micrograd/makemore/nanoGPT and who has produced a ton of educational materials showing how to build small and local models.
I’m going to edit, it was badly formulated, he underestimates their potential for growth is what I meant by that
> underestimates their potential for growth
As far as I understood the talk and the analogies, he's saying that local models will eventually replace the current popular "mainframe" architecture. How is that underestimating them?
Of all the things you could suggest, a lack of understanding is not one that can be pinned on Karpathy. He does know his technical stuff.
We all have blind spots
Sure, but maybe suggesting that the person who literally spent countless hours educating others on how to build small models locally from scratch, is lacking knowledge about local small models is going a bit beyond "people have blind spots".
Their potential, not how they work, it was very badly formulated, just corrected it
He ain't dismissing them. Comparing local/"open" model to Linux (and closed services to Windows and MacOS) is high praise. It's also accurate.
This is a bad comparison
I tried the local small models. They are slow, much less capable, and ironically much more expensive to run than the frontier cloud models.
Phi4-mini runs on a basic laptop CPU at 20T/s… how is that slow? Without optimization…
I was running Qwen3-32B locally even faster, 70T/s, still way too slow for me. I'm generating thousands of tokens of output per request (not coding), running locally I could get 6 mil tokens per day and pay electricity, or I can get more tokens per day from Google Gemini 2.5 Flash for free.
Running models locally is a privilege for the rich and those with too much disposable time.
There were some cool ideas- I particularly liked "psychology of AI"
Overall though I really feel like he is selling the idea that we are going to have to pay large corporations to be able to write code. Which is... terrifying.
Also, as a lazy developer who is always trying to make AI do my job for me, it still kind of sucks, and its not clear that it will make my life easier any time soon.
He says that now we are in the mainframe phase. We will hit the personal computing phase hopefully soon. He says llama (and DeepSeek?) are like Linux in a way, OpenAI and Claude are like Windows and MacOS.
So, No, he’s actually saying it may be everywhere for cheap soon.
I find the talk to be refreshingly intellectually honest and unbiased. Like the opposite of a cringey LinkedIn post on AI.
Being Linux is not a good thing imo, it took decades for tech like proton to run Windows games reliably, if not better as now, than Windows does. Software is still mostly develop for Windows and macOS. Not to mention the Linux Desktop that never took off, I mean one could mention Android but there is a large corporation behind it. Sure Linux is successfull in many ways, it's embedded everywhere but nowhere near being the OS of the everyday people, "traditional linux desktop" never took off
I think it used to be like that before the GNU people made gcc, completely destroying the market of compilers.
> Also, as a lazy developer who is always trying to make AI do my job for me, it still kind of sucks, and its not clear that it will make my life easier any time soon.
Every time I have to write a simple self contained couple of functions I try… and it gets it completely wrong.
It's easier to just write it myself rather than to iterate 50 times and hope it will work, considering iterations are also very slow.
At least proprietary compilers were software you owned and could be airgapped from any network. You didn't create software by tediously negotiating with compilers running on remote machines controlled by a tech corp that can undercut you on whatever you are trying to build (but of course they will not, it says so in the Agreement, and other tales of the fantastic).
On a tangent, I find the analogies interesting as well. However, while Karpathy is an expert in Computer Science, NLP and machine vision, his understanding of how human psychology and brain work is as good as you an I (non-experts). So I take some of those comparisons as a lay person’s feelings about the subject. Still, they are fun to listen to.
Is it possible to vibe code NFT smart contracts with Software 3.0?
In the era of AI and illiteracy...
Painful to watch. The new tech generation deserves better than hyped presentations from tech evangelists.
This reminds me of the Three Amigos and Grady Booch evangelizing the future of software while ignoring the terrible output from Rational Software and the Unified Process.
At least we got acknowledgment that self-driving remains unsolved: https://youtu.be/LCEmiRjPEtQ?t=1622
And Waymo still requires extensive human intervention. Given Tesla's robotaxi timeline, this should crash their stock valuation...but likely won't.
You can't discuss "vibe coding" without addressing security implications of the produced artifacts, or the fact that you're building on potentially stolen code, books, and copyrighted training data.
And what exactly is Software 3.0? It was mentioned early then lost in discussions about making content "easier for agents."
I find Karpathy's focus on tightening the feedback loop between LLMs and humans interesting, because I've found I am the happiest when I extend the loop instead.
When I have tried to "pair program" with an LLM, I have found it incredibly tedious, and not that useful. The insights it gives me are not that great if I'm optimising for response speed, and it just frustrates me rather than letting me go faster. Worse, often my brain just turns off while waiting for the LLM to respond.
OTOH, when I work in a more async fashion, it feels freeing to just pass a problem to the AI. Then, I can stop thinking about it and work on something else. Later, I can come back to find the AI results, and I can proceed to adjust the prompt and re-generate, to slightly modify what the LLM produced, or sometimes to just accept its changes verbatim. I really like this process.
I would venture that 'tightening the feedback loop' isn't necessarily 'increasing the number of back and forth prompts'- and what you're saying you want is ultimately his argument. i.e. if integral enough it can almost guess what you're going to say next...
I specifically do not want AI as an auto-correct, doing auto-predictions while I am typing. I find this interrupts my thinking process, and I've never been bottlenecked by typing speed anyway.
I want AI as a "co-worker" providing an alternative perspective or implementing my specific instructions, and potentially filling in gaps I didn't think about in my prompt.
Yeah I am currently enjoying giving the LLM relatively small chunks of code to write and then asking it to write accompanying tests. While I focus on testing the product myself. I then don't even bother to read the code it's written most of the time
It's fascinating to see his gears grinding at 22:55 when acknowledging that a human still has to review the thousand lines of LLM-generated code for bugs and security issues if they're "actually trying to get work done". Yet these are the tools that are supposed to make us hyperproductive? This is "Software 3.0"? Give me a break.
Plus coding is the fun bit, reviewing code is the hard and not fun bit, arguing with an overconfident machine sound like it'll be worse even than that. Thankfully I'm going to retire soon.
Thank you YC for posting this before the talk became deprecated[1]
1: https://x.com/karpathy/status/1935077692258558443
We couldn't let that happen!
The slide at 13m claims that LLMs flip the script on technology diffusion and give power to the people. Nothing could be further from the truth.
Large corporations, which have become governments in all but name, are the only ones with the capability to create ML models of any real value. They're the only ones with access to vast amounts of information and resources to train the models. They introduce biases into the models, whether deliberately or not, that reinforces their own agenda. This means that the models will either avoid or promote certain topics. It doesn't take a genius to imagine what will happen when the advertising industry inevitably extends its reach into AI companies, if it hasn't already.
Even open weights models which technically users can self-host are opaque blobs of data that only large companies can create, and have the same biases. Even most truly open source models are useless since no individual has access to the same large datasets that corporations use for training.
So, no, LLMs are the same as any other technology, and actually make governments and corporations even more powerful than anything that came before. The users benefit tangentially, if at all, but will mostly be exploited as usual. Though it's unsurprising that someone deeply embedded in the AI industry would claim otherwise.
Well there are cases like OLMo where the process, dataset, and model are all open source. As expected though, it doesn't really compare well to the worst closed model since the dataset can't contain vast amounts of stolen copyrighted data that noticeably improves the model. Llama is not good because Meta knows what they're doing, it's good because it was pretrained on the entirety of Anna's Archive and every pirated ebook they could get their hands on. Same goes for Elevenlabs and pirated audiobooks.
Lack of compute on the Ai2's side also means the context OLMo is trained for is miniscule, the other thing that you need to throw brazillions of dollars at to make model that's maybe useful in the end if you're very lucky. Training needs high GPU interconnect bandwidth, it can't be done in distributed horde in any meaningful way even if people wanted to.
The only ones who have the power now are the Chinese, since they can easily ignore copyright for datasets, patents for compute, and have infinite state funding.
I hope this excellent talk brings some much needed sense into the discourse around vibe coding.
If anything I wished the conversation turned away from "vibe-coding" which was essentially coined as a "lol look at this go" thing, but media and corporations somehow picked up as "This is the new workflow all developers are adopting".
LLMs as another tool in your toolbox? Sure, use it where it makes sense, don't try to make them do 100% of everything.
LLMs as a "English to E2E product I'm charging for"? Lets maybe make sure the thing works well as a tool before letting it be responsible for stuff.
He sounds like Terrence Howard with his nonsense.
Can't believe they wanted to postpone this video by a few weeks
You can generate 1.0 programs with 3.0 programs. But can you generate 2.0 programs the same way?
2.0 programs (model weights) are created by running 1.0 programs (training runs).
I don't think it's currently possible to ask a model to generate the weights for a model.
But you can generate synthetic data using a 3.0 program to train a smaller, faster, cheaper-to-run 2.0 program.
I'd like to hear from Linux kernel developers. There is no significant software that has been written (plagiarized) by "AI". Why not ask the actual experts who deliver instead of talk?
This whole thing is a religion.
There is no significant software that has been written (plagiarized) by "AI".
How do you know?
As you haven't evidenced your claim, you could start by providing explicit examples of what is significant.
Even if you are correct, the amount of llm-assisted code is increasing all the time, and we are still only a couple of years in - give it time.
Why not ask the actual experts
Many would regard Karpathy in the expert category I think?
I think you should not turn things around here. Up to 2021 we had a vibrant software environment that obviously had zero "AI" input. It has made companies and some developers filthy rich.
Since "AI" became a religion, it is used as an excuse for layoffs while no serious software is written by "AI". The "AI" people are making the claims. Since they invading a functioning software environment, it is their responsibility to back up their claims.
Still wonder what your definition of "serious software" is. I kinda concur - I consider most of the webshit to be not serious, but then, this is where software industry makes bulk of its profits, and that space is absolutely being eaten by agentic coding, right now, today.
So if we s/serious/money-making/, you are wrong - or at least about to be proven, as these things enter prod and are talked about.
The AI people are the ones making the extraordinary claims here.
What counts as "significant software"? Only kernels I guess?
Office software, CAD systems, Web Browsers, the list is long.
Microsoft (famously developing somewhat popular office-like software) seems to be going in the direction of almost forcing developers to use LLMs to assist with coding, at least going by what people are willing to admit publicly and seeing some GitHub activity.
Google (made a small browser or something) also develops their own models, I don't think it's far fetched to imagine there is at least one developer on the Chrome/Chromium team that is trying to dogfood that stuff.
As for Autodesk, I have no idea what they're up to, but corporate IT seems hellbent on killing themselves, not sure Autodesk would do anything differently so they're probably also trying to jam LLMs down their employees throats.
Microsoft is also selling "AI", so they want headlines like "30% of our code is written by AI". So they force open source developers to babysit the tools and suffer.
It's also an advertisement for potential "AI" military applications that they undoubtedly propose after the HoloLens failure:
https://www.theverge.com/2022/10/13/23402195/microsoft-us-ar...
The HoloLens failure is a great example of overhyped technology, just like the bunker busters that are now in the headlines for overpromising.
Can you point to any significant open source software that has any kind of significant AI contributions?
As an actual open source developer I'm not seeing anything. I am getting bogus pull requests full of AI slop that are causing problems though.
> Can you point to any significant open source software that has any kind of significant AI contributions?
No, but I haven't looked. Can you?
As an actual open source developer too, I do get some value from replacing search engine usage with LLMs that can do the searching and collation for me, as long as they have references I can use for diving deeper, they certainly accelerate my own workflow. But I don't do "vibe-coding" or use any LLM-connected editors, just my own written software that is mostly various CLIs and chat-like UIs.
Vibe coding is making a LEGO furniture, getting it run on the cloud is assembling the IKEA table for a busy restaurant
I was trying to do some reverse engineering with Claude using an MCP server I wrote for a game trainer program that supports Python scripts. The context window gets filled up _so_ fast. I think my server is returning too many addresses (hex) when Claude searches for values in memory, but it’s annoying. These things are so flaky.
The beginning was painful to watch as is the cheering in this comment section.
The 1.0, 2.0, and 3.0 simply aren't making sense. They imply a kind of a succession and replacement and demonstrate a lack of how programming works. It sounds as marketing oriented as "Web 3.0" that has been born inside an echo chamber. And yet halfway through, the need for determinism/validation is now being reinvented.
The analogies make use of cherry picked properties, which could apply to anything.
The whole AI scene is starting to feel a lot like the cryptocurrency bubble before it burst. Don’t get me wrong, there’s real value in the field, but the hype, the influencers, and the flashy “salon tricks” are starting to drown out meaningful ML research (like Apple's critical research that actually improves AI robustness). It’s frustrating to see solid work being sidelined or even mocked in favor of vibe-coding.
Meanwhile, I asked this morning Claude 4 to write a simple EXIF normalizer. After two rounds of prompting it to double-check its code, I still had to point out that it makes no sense to load the entire image for re-orientating if the EXIF orientation is fine in the first place.
Vibe vs reality, and anyone actually working in the space daily can attest how brittle these systems are.
> "Because they all have slight pros and cons, and you may want to program some functionality in 1.0 or 2.0, or 3.0, or you're going to train in LLM, or you're going to just run from LLM"
He doesn't say they will fully replace each other (or had fully replaced each other, since his definition of 2.0 is quite old by now)
I think Andrej is trying to elevate the conversation in an interesting way.
That in and on itself makes it worth it.
No one has a crystal clear view of what is happening, but at least he is bringing a novel and interesting perspective to the field.
The version numbers mean abrupt changes.
Analogy: how we "moved" from using Google to ChatGPT is an abrupt change, and we still use Google.
[dead]
llms.txt makes a lot of sense, especially for LLMs to interact with http APIs autonomously.
Seems like you could set a LLM loose and like the Google Bot have it start converting all html pages into llms.txt. Man, the future is crazy.
Couldn’t believe my eyes. The www is truly bankrupt. If anyone has a browser plugin which automatically redirects to llms.txt sign me up.
Website too confusing for humans? Add more design, modals, newsletter pop ups, cookie banners, ads, …
Website too confusing for LLMs? Add an accessible, clean, ad-free, concise, high entropy, plain text summary of your website. Make sure to hide it from the humans!
PS: it should be /.well-known/llms.txt but that feels futile at this point..
PPS: I enjoyed the talk, thanks.
> If anyone has a browser plugin which automatically redirects to llms.txt sign me up.
Not a browser plugin, but you can prefix URLs with `pure.md/` to get the pure markdown of that page. It's not quite a 1:1 to llms.txt as it doesn't explain the entire domain, but works well for one-off pages. [disclaimer: I'm the maintainer]
The next version of the llms.txt proposal will allow an llms.txt file to be added at any level of a path, which isn't compatible with /.well-known.
(I'm the creator of the llms.txt proposal.)
Even with this future approach, it still can live under the `/.well-known`, think of `/.well-known/llm/<mirrored path>` or `/.well-known/llm.json` with key/value mappings.
[flagged]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html
Fair
The web started dying with mobile social media apps, in which hyperlinks are a poor UX choice. Then again with SEO banning outlinks. Now this. The web of interconnected pages that was the World Wide Web is dead. Not on social media? No one sees you. Run a website? more bots than humans. Unless you sell something on the side with the website it's not profitable. Hyperlinking to other websites is dead.
Gen Alpha doesn't know what a web page is and if they do, it's for stuff like neocities aka as a curiosity or art form only. Not as a source of information anymore. I don't blame them. Apps (social media apps) have less friction than web sites but have a higher barrier for people to create. We are going back to pre World Wide Web days in a way, kind of like Bulletin Board Systems on dial up without hyperlinking, and centralized (social media) Some countries mostly ones with few technical people llike the ones in Central America have moved away from the web almost entirely and into social media like Instagram.
Due to the death of the web, google search and friends now rely mostly on matching queries with titles now so just like before the internet you have to know people to learn new stuff or wait for an algorithm to show it to you or someone to comment it online or forcefully enroll in a university. Maybe that's why search results have declined and poeple search using ChatGPT or maybe perplexity. Scholarly search engines are a bit better but frankly irrelevant for most poeple.
Now I understand why Google established their own DNS server at 8.8.8.8. If you have a directory of all domains on DNS, you can still index sites without hyperlinks between them, even if the web dies. They saw it coming.
If you have different representations of the same thing (llms.txt / HTML), how do you know it is actually equivalent to each other? I am wondering if there are scenarios where webpage publishers would be interested in gaming this.
<link rel="alternate" /> is a standards-friendly way to semantically represent the same content in a different format
That's not what llms.txt is. You can just use a regular markdown URL or similar for that.
llms.txt is a description for an LLM of how to find the information on your site needed for an LLM to use your product or service effectively.
[dead]
I love the "people spirits" analogy. For casual tasks like vibecoding or boiling an egg, LLM errors aren't a big deal. But for critical work, we need rigorous checks—just like we do with human reasoning. That's the core of empirical science: we expect fallibility, so we verify. A great example is how early migration theories based on pottery were revised with better data like ancient DNA (see David Reich). Letting LLMs judge each other without solid external checks misses the point—leaderboard-style human rankings are often just as flawed.
I think that Andrej presents “Software 3.0” as a revolution, but in essence it is a natural evolution of abstractions.
Abstractions don't eliminate the need to understand the underlying layers - they just hide them until something goes wrong.
Software 3.0 is a step forward in convenience. But it is not a replacement for developers with a foundation, but a tool for acceleration, amplification and scaling.
If you know what is under the hood — you are irreplaceable. If you do not know — you become dependent on a tool that you do not always understand.
Love his analogies and clear eyed picture
"We're not building Iron Man robots. We're building Iron Man suits"
Funny thing is that in more than one of the iron man movies the suits end up being bad robots. Even the ai iron man made shows up to ruin the day in the avengers movie. So it’s a little in the nose that they’d try to pitch it this way.
That’s looking too much into this. It’s just an obvious plot twist to justify making another movie, nothing else.
[flagged]
I'm old enough to remember when Twitter was new, and for a moment it felt like the old utopian promise of the Internet finally fulfilled: ordinary people would be able to talk, one-on-one and unmediated, with other ordinary people across the world, and in the process we'd find out that we're all more similar than different and mainly want the same things out of life, leading to a new era of peace and empathy.
It was a nice feeling while it lasted.
I believe the opposite happened. People found out that there are huge groups of people with wildly differing views on morality from them and that just encouraged more hate. I genuinely think old school facebook where people only interacted with their own private friend circles is better.
Broadcast networks like Twitter only make sense for influencers, celebrities and people building a brand. They're a net negative for literally anyone else.
| old school facebook where people only interacted with their own private friend circles is better.
100% agree but crazy that option doesn't exist anymore.
Believe it or not, humans did in fact have forms of written language and communication prior to twitter.
Can you please make your substantive points without snark? We're trying for something a bit different here.
https://news.ycombinator.com/newsguidelines.html
You missed the point, but that's fine, it happens.
Him claiming govts don't use AI or are behind the curve is not accurate.
Modern military drones are very much AI agents
[dead]
why does vibe coding still involve any code at all? why can't an AI directly control the registers of a computer processor and graphics card, controlling a computer directly? why can't it draw on the screen directly, connected directly to the rows and columns of an LCD screen? what if an AI agent was implemented in hardware, with a processor for AI, a normal computer processor for logic, and a processor that correlates UI elements to touches on the screen? and a network card, some RAM for temporary stuff like UI elements and some persistent storage for vectors that represent UI elements and past converstations
I'm not sure this makes sense as a question. Registers are 'controlled' by running code for a given state. An AI can write code that changes registers, as all code does in operation. An AI can't directly 'control registers' in any other way, just as you or I can't.
I would like to make an AI agent that directly interfaces with a processor by setting bits in a processor register, thus eliminating the need for even assembly code or any kind of code. The only software you would ever need would be the AI.
That's called a JIT compiler. And ignoring how bad an idea blending those two... It wouldn't be that difficult a task.
The hardest parts of a jit is the safety aspect. And AI already violates most of that.
The safety part will probably be either solved or a non-issue or ignored. Similarly to how GPT3 was often seen as dangerous before ChatGPT was released. Some people who have only ever vibe coded are finding jobs today, ignoring safety entirely and lacking a notion of it or what it means. They just copy paste output from ChatGPT or an agentic IDE. To me it's JIT already with extra steps. Or they have pivoted their software engineers to vibe coding most of the time and don't even touch code anymore doing JIT with extra steps again.
As "jit" to you means running code, and not "building and executing machine code", maybe you could vibe code this. And enjoy the segfaults.
In a way he's making sense. If the "code" is the prompt, the output of the llm is an intermediate artifact, like the intermediate steps of gcc.
So why should we still need gcc?
The answer is of course, that we need it because llm's output is shit 90% of the time and debugging assembly or binary directly is even harder, so putting asides the difficulties of training the model, the output would be unusable.
Probably too much snark from me. But the gulf between interpreter and compiler can be decades of work, often discovering new mathematical principles along the way.
The idea that you're fine to risk everything, in the way agentic things allow [0], and want that messing around with raw memory is... A return to DOS' crashes, but with HAL along for the ride.
[0] https://msrc.microsoft.com/update-guide/vulnerability/CVE-20...
Ah don't worry, llms are a return to crashes as it is :)
The other day it managed to produce code that made python segfault.
what he means is why are the tokens not directly machine code tokens
Because any precise description of what the computer is supposed to do is already code as we know it. AI can fill in the gaps between natural language and programming by guessing and because you don't always care about the "how" only about the "what". The more you care about the "how" you have to become more precise in your language to reduce the guess work of the AI to the point that your input to the AI is already code.
The question is: how much do we really care about the "how", even when we think we care about it? Modern programming language don't do guessing work, but they already abstract away quite a lot of the "how".
I believe that's the original argument in favor of coding in assembler and that it will stay relevant.
Following this argument, what AI is really missing is determinism to a far extend. I can't just save my input I have given to an AI and can be sure that it will produce the exact same output in a year from now on.
With vibe coding, I am under the impression that the only thing that matters for vibe coders is whether the output is good enough in the moment to fullfill a desire. For companies going AI first that's how it seems to be done. I see people in other places and those people have lost interest in the "how"
Nice try, AI.
All you need is a framebuffer and AI.
Should we not treat LLMs more as a UX feature to interact with a domain specific model (highly contextual), rather than expecting LLMs to provide the intelligence needed for software to act as partner to Humans.
He's selling something.
I guess Karpathy won't ever become a multi-millionare/billionaire, seeing as he's now at the stage of presenting TedX-like thingies.
That also probably shows that he's out of the loop when it comes to the present work now done in "AI", because had he been there he wouldn't have had time for this kind of fluffy presentations.
[flagged]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
"Don't be snarky."
https://news.ycombinator.com/newsguidelines.html
Wait ... but this is true.
Maybe I missed a source but I assumed it was somehow common knowledge.
https://en.m.wikipedia.org/wiki/List_of_Tesla_Autopilot_cras...
[flagged]
It's an interesting presentation, no doubt. The analogies eventually fail as analogies usually do.
A recurring theme presented, however, is that LLM's are somehow not controlled by the corporations which expose them as a service. The presenter made certain to identify three interested actors (governments, corporations, "regular people") and how LLM offerings are not controlled by governments. This is a bit disingenuous.
Also, the OS analogy doesn't make sense to me. Perhaps this is because I do not subscribe to LLM's having reasoning capabilities nor able to reliably provide services an OS-like system can be shown to provide.
A minor critique regarding the analogy equating LLM's to mainframes:
Hang in there! Your comment makes some really good points about the limits of analogies and the real control corporations have over LLMs.
Plus, your historical corrections were spot on. Sometimes, good criticisms just get lost in the noise online. Don't let it get to you!
> The presenter made certain to identify three interested actors (governments, corporations, "regular people") and how LLM offerings are not controlled by governments. This is a bit disingenuous.
I don't think that's what he said, he was identifying the first customers and uses.
>> A recurring theme presented, however, is that LLM's are somehow not controlled by the corporations which expose them as a service. The presenter made certain to identify three interested actors (governments, corporations, "regular people") and how LLM offerings are not controlled by governments. This is a bit disingenuous.
> I don't think that's what he said, he was identifying the first customers and uses.
The portion of the presentation I am referencing starts at or near 12:50[0]. Here is what was said:
Note the identification of historic government interest in computing along with a flippant "regular person" scenario in the context of "technology diffusion."You are right in that the presenter identified "first customers", but this is mentioned in passing when viewed in context. Perhaps I should not have characterized this as "a recurring theme." Instead, a better categorization might be:
0 - https://youtu.be/LCEmiRjPEtQ?t=770Yeah that's explicitly about first customers and first uses, not about who controls it.
I don't see how it minimizes the control corporations have to note this. Especially since he's quite clear about how everything is currently centralized / time share model, and obviously hopeful we can enter an era that's more analogous to the PC era, even explicitly telling the audience maybe some of them will work on making that happen.
[dead]
[dead]
Well that showed up significantly faster than they said it would.
The team adapted quickly, which is a good sign. I believe getting the videos out sooner (as in why-not-immediately) is going to be a priority in the future.
Classic under promise and over deliver.
I'm glad they got it out quickly.
Me too. It was my favorite talk of the ones I saw.
[dead]
[dead]
[dead]
Can we please stop standardizing on putting things in the root?
/.well-known/ exists for this purpose.
example.com/.well-known/llms.txt
https://en.m.wikipedia.org/wiki/Well-known_URI
You can't just put things there any time you want - the RFC requires that they go through a registration process.
Having said that, this won't work for llms.txt, since in the next version of the proposal they'll be allowed at any level of the path, not only the root.
> You can't just put things there any time you want - the RFC requires that they go through a registration process.
Actually, I can for two reasons. First is of course the RFC mentions that items can be registered after the fact, if it's found that a particular well-known suffix is being widely used. But the second is a bit more chaotic - website owners are under no obligation to consult a registry, much like port registrations; in many cases they won't even know it exists and may think of it as a place that should reflect their mental model.
It can make things awkward and difficult though, that is true, but that comes with the free text nature of the well-known space. That's made evident in the Github issue linked, a large group of very smart people didn't know that there was a registry for it.
https://github.com/AnswerDotAI/llms-txt/issues/2#issuecommen...
There was no "large group of very smart people" behind llms.txt. It was just me. And I'm very familiar with the registry, and it doesn't work for this particular case IMO (although other folks are welcome to register it if they feel otherwise, of course).
I put stuff in /.well-known/ all the time whenever I want. They’re my servers.
> You can't just put things there any time you want - the RFC requires that they go through a registration process.
Excuse me???
From the RFC:
""" A well-known URI is a URI [RFC3986] whose path component begins with the characters "/.well-known/", and whose scheme is "HTTP", "HTTPS", or another scheme that has explicitly been specified to use well- known URIs.
Applications that wish to mint new well-known URIs MUST register them, following the procedures in Section 5.1. """
https://github.com/AnswerDotAI/llms-txt/issues/2
[dead]