Making o1, o3, and Sonnet 3.7 hallucinate for everyone

259 points by hahahacorn a day ago

andix a day ago

I've got a lot of hallucinations like that from LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.

QuantumGood a day ago

GPT can't even tell what its done or give what it knows it should. It's endless, "Apologies, here is what you actually asked for ..." and again it isn't.
- derefr 14 hours ago
  
  I would love an AI coding assistant that doesn't respond until it's gone out and tested its answer in a sandbox to confirm it actually compiles.
  - AprilisKalends 14 hours ago
    
    I personally haven't figured out why there isn't a tool that just loops the AI severals times on a task by compiling, feeding in the errors adjusting and repeating and then letting the user review the result be in a success or failure by exceed the loop limits.
    
    staticautomatic 12 hours ago
    
    Claude has a tendency to overcomplicate bug fixes. If you put it on loop it would probably build a custom ERP platform.
    
    anonzzzies 10 hours ago
    
    The LLMs want to please, they can infinitely keep making 'important' changes; changing variable names without changing anything else, adding/removing comments, adding removing print statements, moving a function somewhere in the code (usually by first duplicating it) and fixing perfectly fine code with 'let me see if everything is really working' (it was and now it's broken again).
    
    derefr 2 hours ago
    
    "Wanting to please" is just fine-tuned implicit prompt stuff, though. You can tell the LLM that it's roleplaying as a senior engineer, and should consider you, the human to be a junior engineer it works with, who is asking it for advice. You can just leave it at that (often works, if the AI knows enough about how software is produced); or you can describe particular traits of senior engineers — e.g. "wants code that works" and "knows when to say the code is good enough and that the junior is now wasting time by fiddling with it."
    
    ta988 13 hours ago
    
    They exist in various forms...
    
    throawayonthe 12 hours ago
    
    [dead]
  - v-yanakiev 14 hours ago
    
    https://syntix.pro automatically executes and verifies the code it generates. Disclaimer: I made it.
  - mycall 11 hours ago
    
    Devin does that
    
    anonzzzies 11 hours ago
    
    Wasn't Devin a scam? Honest question; Google is not very clear about it but some people said it was a scam/money grab. You are saying it works?
- MortyWaves a day ago
  
  This was my primary reason for using Claude. Absolutely useless experience with chatgpt oftentimes. I’ve mainly been using LLMs to help maintain a ridiculously poorly made technical debt dumpster fire, and Claude has been really helpful here mainly with repetitive code.
- therein 15 hours ago
  
  Yeah ChatGPT is truly the worst. Claude has been consistently much better. Grok 3 is surprising me every day.
kgeist a day ago

I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff. Sure it makes mistakes occasionally but usually it's a no-brainer to fix them.
- Terr_ a day ago
  
  > I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff.
  In other words, they're OK in use-cases that programmers need to eliminate, because it means there's high demand for a reusable library, some new syntax sugar, or an improved API.
  - hombre_fatal a day ago
    
    Boilerplate and plumbing code isn't inherently bad, nor do you improve the codebase by factoring it down to zero with libraries and abstractions.
    As I've matured as a developer, I've appreciated certain types of boilerplate more and more because it's code that shows up in your git diffs. You don't need to chase down the code in some version of some library to see how something works.
    Of course, not all boilerplate is created equally.
    
    andix a day ago
    
    Boilerplate is better than bad abstractions. But good abstractions are far superior.
    
    xmprt a day ago
    
    I agree with you but as I've matured as a programmer, I feel like it's very hard to get abstractions for boilerplate right. Every library I've seen attempt to do it has struggled.
    
    Spivak 17 hours ago
    
    Even Rails went with codegen despite being backed by the language most amenable to abstractions.
    
    int_19h 14 hours ago
    
    I would consider deterministic codegen to be a valid abstraction (or rather a valid implementation of some abstraction). It's not really any different from a regular library wrt ability to test & validate. The problem with any human- and LLM- handwritten code is that even the best coders make mistakes.
    
    bubblyworld 12 hours ago
    
    Abstractions are like mutations - most of them kill the host =P
    
    pertymcpert a day ago
    
    The best abstraction is no abstraction.
    
    andix a day ago
    
    No.
    Edit: I'm not going to code asm just to be cool.
    
    brookst 17 hours ago
    
    asm? Why would anyone use that useless abstraction over byte code?
    
    nuancebydefault 11 hours ago
    
    bytes are abstractions. Only use them if you know upfront how many bits you need.
    
    airstrike a day ago
    
    "A program is like a poem. You cannot write a poem without writing it." — Dijkstra
  - anonzzzies 10 hours ago
    
    > because it means there's high demand for a reusable library
    Every 'modern' project takes bucket loads of (annoying) setup and plumbing. Even in rather trivial 'start' cases, the LLM has to spend quite a bit of time to get a hello world thingy working (almost all of that is because 'modern programmers' have some kind of brain damage concerning backward compatibility; move fast and break things between MINOR versions that have no use or reason for those changes whatsoever but 'they liked it better', so stuff never works as it says on the product page; heaven help you if you need something exotic to be added). It's a terrible timeline for programming, but LLMs do fix at least that annoyance by just changing configs/slabs of code it until it works.
  - TeMPOraL 11 hours ago
    
    We won't be eliminating anything until we give up on only ever working directly on plaintext single source of truth code. AI is automating tedium that's otherwise impossible to automate because we're stuck in a 1970s Unix paradigm and can't let go.
    
    anonzzzies 10 hours ago
    
    But people, especially here on HN, are saying it's the only way to go.
  - woah 15 hours ago
    
    With LLMs, I hope we move towards libraries that make code easier to read, not easier to write.
  - MaxikCZ a day ago
    
    You can use niche libraries and still benefit from AI, basically anything I want to code is something that I don't think exists, and to bend API in a way that makes a generic library do just what I want means basically writing the same code, just instead of contributing to library its immediate, because the "piping" around it magically appears. I bet if it keeps improving at a current late for a decade, the occupation "programmer" will morph into "architect".
  - mvdtnz a day ago
    
    I'll take boring scaffolding code over libraries that perform undebuggable magic with monkey patches, reflection or dynamic code.
- andix a day ago
  
  I try to keep "boring" code to a minimum, by finding meaningful and simple abstractions. LLMs are especially bad handling those, because they were not trained on non-standard abstractions.
  Edit: most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.
  - rafaelmn a day ago
    
    This is what got me in most sleepless nights, crunch and ass clenching production issues over my career.
    Simple repetitive shit is easy to reason about, debug and onboard people on.
    Naturally it's balancing act, and modern/popular frameworks are where most people landed, there's been a lot of iteration in this space for decades now.
    
    andix a day ago
    
    I've made the opposite observation. Without proper abstractions code bases grow like crazy. At some point they are just a huge amount of copy, paste, and slight modification. The amount of code often grows exponentially. With more lines of code comes more effort to maintain it.
    After a few years those copy and pasted code pieces completely drift apart and create a lot of similar but different issues, that need to be addressed one by one.
    My approach for designing abstractions is always to make them composable (not this enterprise java inheritance chaos). To allow escaping them when needed.
    
    Spivak 17 hours ago
    
    > At some point they are just a huge amount of copy, paste, and slight modification.
    I mean is that bad? Unless you keep having to have huge MRs that modify every copy/paste could you just let the code sit there and run forever?
    I only say this because I've been a maintenance programmer and I could only dream of a codebase like this. The idea that I get a Rollbar with a stack trace and the entirety of what the code actually does is laid bare right at the site of the error in a single file is amazing. And I can change it without affecting anything else?! I end up having to "unwind" all of the abstractions anyway because the nature of the job means I'm not intimately familiar with the codebase and don't just know where the real work happens.
    
    notpushkin 13 hours ago
    
    There is a balance. Too much abstractions isn’t necessarily better than not enough abstractions. The optimal amount is usually non-zero, though.
    
    rafaelmn 10 hours ago
    
    But the point is that modern/popular frameworks picked the low hanging fruit of abstraction, we had several iteration cycles over two+ decades of mainstream web tech.
    The landscape (browser capabilities, backend stacks) has settled over last ten years.
    We even had time to standardize on things somewhat.
    Building new abstractions at this point is almost always the wrong move. This is one of the points that LLMs will improve in software dev - they will kill the framework churn because they will work best on stuff already in training data.
  - simion314 a day ago
    
    >most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.
    The issue is if you have an LLm write for you 10k lines of code, where 100 lines are bugged. Now you need to debug the code you did not write and find the bugged code, you will waste similar amount of time. The issue is if you do not catch the bugs in time, you think you gain some hours but you will get upset customers because things went wrong because the code is weird.
    From my experience you need to work withan LLM and have the code done function by function, with your input and you checking it and calling bullshit when it does stupid things.
    
    xmprt a day ago
    
    In my experience using LLMs, the 90% is less about buggy code and more about just ignoring 10% of the features that you require. So it will write code that's mostly correct in 100-1000 lines of code (not buggy) but then no matter how hard you try, it won't get the remaining 10% right and in the process, it will mess up parts of the 90% that was already working or end up writing another 1000 lines of undecipherable code to get 97% there but still never 100% unless you're building something that's not that unique.
    
    andix a day ago
    
    Exactly my experience. It's always missing something. And the generated code often can't be extended to fulfil those missing aspects.
magicalhippo a day ago

I've used it for some smaller greenfield code with success. Like, write an Arduino program that performs a number of super-sampled analog readings, and performs a linear regression fit, printing the result to the serial port.
That sort of stuff can be very helpful to newbies in the DIY electronics world for example.
But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.
- andix a day ago
  
  > Like, write an Arduino program that performs
  stuff like that works amazing
  > But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines
  This was my opinion 3-6 months ago. But I think a lot of tools matured enough to already provide a lot of value for complex tasks. The difficult part is to learn when and how to use AI.
- nottorp 12 hours ago
  
  > That sort of stuff can be very helpful to newbies
  Yep, if it's newbie stuff there are enough tutorials out there for the LLM to have data.
  It's when you get off the tutorials and do the actual functionality of your project that they fail.
- ianbutler a day ago
  
  I use it everyday, it has to have good search and good static analysis built in.
  You also have to be very explanatory with a direct communication style.
  Our system imports the codebase so it can search and navigate plus we feed lsp errors directly to the LLM as development is happening.
  - magicalhippo 20 hours ago
    
    > we feed lsp errors directly to the LLM as development is happening
    Yeah I guess that would help a lot. Stuck with a bit more primitive tools here, so that doesn't help.
L-four 12 hours ago

The trick to coding with LLMs is not caring if the code is correct.
- andrelaszlo 11 hours ago
  
  Call me cynical but coding at some companies has such perverse incentives that I kind of get it:
  - "Solve" the issue assigned to me with a bunch of code that looks about right. Passes review and probably not covered by tests anyway.
  - Once QA or customers notice it's not working, I can get credit for "solving" the bug as well.
  - Repeat for 0 value delivered but infinite productivity points in my next performance review.
  - nsoonhui 10 hours ago
    
    I suppose that you are using a dynamic language? Static typed languages have less of this problem.
    
    andrelaszlo 10 hours ago
    
    This is a hypothetical "I". Personally I am deeply passionate about delivering shareholder value, producing high-quality code, enthusing stakeholders, tabs vs spaces, and so forth...
pllbnk 10 hours ago

Sometimes I don't know anything (relatively) about the topic and I want to get a foundation so I don't care whether the syntax or the code is valid as long as it points me in the right direction. Other times I know exactly what I want and I just find that instructing LLM specifically what output I expect from it just helps me get there faster.
runeblaze a day ago

They are good at (combining well-known, codeforces-style) algorithms; often times I don’t care about the syntax, but I need the algorithm. LLMs can write pseudocode for all I care but they tend to get syntax correct quite often
zelphirkalt 20 hours ago

Probably by coding things that are bery mainstream and have already been fed to the LLM a thousand times from ripped off projects.
johnisgood a day ago

I have made large projects using Claude, with success. I know what I want to do and how to do it, maybe my prompts were right.
- mattmanser a day ago
  
  What do you define as a large project? Like TLOC?
  - anonzzzies 10 hours ago
    
    I've done >100k LoC with Claude; by far most of that is frontend in react/ts which is incredibly verbose and wasteful, so very easy to pack up many many lines quickly without a lot (mostly none) of ROI per line. Which is probably why claude is great at it.
  - williamcotton 19 hours ago
    
    This is 11,682 lines of C (and not including some Lua and Python scripts) according to cloc:
    https://github.com/williamcotton/webdsl
    It's a pipeline-based DSL for building web apps with SQL, Lua, jq and mustache templates.
    I'd say it's like 90% Cursor Composer in Agent mode.
    This is probably more like a mid-sized project, right?
    
    anonzzzies 10 hours ago
    
    Quite nice that.
  - johnisgood a day ago
    
    In my case the maximum was ~3k LOC.
    
    crooked-v 20 hours ago
    
    That's not a "large" project, it's either a toy or a single-purpose tool.
    
    johnisgood 13 hours ago
    
    A single-purpose tool, and libraries, yes.
    What is the LOC of cgit? Because I made a lot of changes to it, too, albeit privately. I uploaded most files as project files.
    
    mvdtnz a day ago
    
    That's not just small, it's utterly miniscule. It's most certainly not large.
    
    ForTheKidz 17 hours ago
    
    Nah this is miniscule: https://github.com/coreutils/coreutils/blob/master/src/yes.c
    You can fit a hell of a lot of functionality in 3k statements. Really whether it's considered large or small necessarily must rely on the functionality it's intended to provide.
    
    stavros 12 hours ago
    
    You're thinking on "lean" vs "bloated". "Large" and "small" have meanings of their own, and a 3k LOC project wouldn't be accepted as "large" by anyone. "Small", maybe.
    
    ForTheKidz 6 hours ago
    
    I certainly never described 3k as large, so I'll assume you replied to the wrong comment. If not, let's just agree to use the terms differently.
    
    johnisgood a day ago
    
    Depends. 3k is pretty much enough for a fully-featured XY.
    So no context, and differences of the definition of "large".
    Perhaps if you come from Java, then yeah.
    shrugs
    
    magicalhippo 20 hours ago
    
    I'd still call 3kLOC quite small in all mainstream languages.
    I worked on a Python project which I'd consider as medium sized, ie not small but not large, and it was around 25kLOC.
    My $dayjob is a Delphi codebase with roughly 500kLOC, which I'd say is large but not huge.
    Though if you wrote it in something like K[1], then yeah ok, I'd agree 3kLOC probably counts as large.
    [1]: https://en.wikipedia.org/wiki/K_(programming_language)
    
    mvdtnz 16 hours ago
    
    I have spent my career working on software that measures its size in MLOC (millions of lines of code). Not because it's Java, but because it's big.
- miunau a day ago
  
  How do you deal with large files? After about a thousand lines in a file, it starts to cough for me. Forgets that some functions exist and makes up inferior duplicate ones.
  - anonzzzies 10 hours ago
    
    I ask it to cut files up when they get too large, that seems t work pretty well. It is also good at that, but you have to sternly tell it NOT to make any functional changes, otherwise it will and breakage will happen.
  - 0x5f3759df-i a day ago
    
    If possible you need to refactor before getting to that point.
    Claude has done a good job refactoring, though I’ve had to tell it to give me a refactor plan upfront in case the conversation limit gets hit. Then in a new chat I tell it which parts of the plan it has already done.
    But a larger context/conversation limit is definitely needed because it’s super easy to fill up.
  - johnisgood a day ago
    
    I did not experience hallucinations (very rarely if at al) when I use it for programming. It happened more with niche languages (so I provide examples and documentation), and with GPT.
    Let us say there is 3k lines of RFCs, API, documentation of niche languages, examples), 2k lines of code generated by Claude (iteratively, starting small), then I do exceed the limit after a while. In that case I ask it to summarize everything in detail, start a new chat, use those 3k lines and the recent code, and continue ad infinitum.
HarHarVeryFunny a day ago

What TFA was talking about didn't really seem like an hallucination - just a case of garbage in-garbage out. Normally there are more examples of good/correct data in the training set than bad, so statistically the good wins, but if it's prompted for something obscure maybe bad is all that it has got.
Common coding tasks are going to be better represented in the training set and give better results.
- genewitch 17 hours ago
  
  It means that "reasoning" isn't.
afro88 12 hours ago

Here's the secret to coding with an LLM: don't expect it to get things 100% correct. You will need to fix something almost every time you use it to generate code. Maybe a name here, maybe a calculation, maybe a function signature. And maybe you won't spot the issue until later.
You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).
That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.
bakugo a day ago

> I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up
They can't, they usually just don't understand the code enough to notice the issues immediately.
The perceived quality of LLM answers is inversely proportional to the user's understanding of the topic they're asking about.
- mlyle a day ago
  
  Alternatively, we understand it well, and discard bad completions immediately.
  When I'm using llama.vim, like 40% of what it writes in a 4-5 line completion is exactly what I'd write. 20-30% is stuff that I wouldn't judge coming from someone else, so I usually accept it. And 30-40% is garbage... but I just write a comment or a couple of lines, instead, and then reroll the dice.
  It's like working through a junior engineer, except the junior engineer types a new solution instantly. I can get down to alternating between mashing tab and writing tricky lines.
  - andix a day ago
    
    I don't see the point in AI code completions, they are just distracting noise. I'm only doing bigger changes with AI.
    Prompt based stuff, like "extract the filtering part from all API endpoints in folder abc/xyz. Find a suitable abstraction and put this function into filter-utils.codefile"
    
    Lorak_ a day ago
    
    What tools do you use to perform such tasks?
    
    andix a day ago
    
    Aider. I tried Cursor too, but I don't like VS Code and not being able to chose the LLM provider. I think there are already a lot of tools that perform kind of equally.
    
    virgilp 12 hours ago
    
    What do you mean? You can chose the LLM in Cursor. (I don't like VSCode either, but unfortunately Cursor is best for the rest of the UX; I find myself using Cursor for the prompt-based part of the work, and then move to IntelliJ to do "my" part of the coding)
  - bakugo a day ago
    
    I've tried using basic AI completions before and found that the signal-to-noise ratio wasn't quite good enough for my taste in my use cases, but I can totally understand it being good enough for others.
    My comment was more about just asking questions on how to do things you're totally clueless about, in the form of "how do I implement X using Y?" for example. I've found that, as a general rule, if I can't find the answer to that question myself in a minute or two of googling, LLMs can't answer it either the majority of the time. This would be fine if they said "I don't know how to do that" or "I don't believe that's possible" but no, they will confidently make up code that doesn't work using interfaces that don't exist, which usually ends up wasting my time.
    
    mlyle 21 hours ago
    
    I've got basically the opposite experience. I receive very little useful toplevel help or directing me to appropriate resources.
    But if I'm writing out a bunch of linear algebra, I get a lot of useful completions and avoid tediousness.
    I've settled on Qwen2.5.1-Coder-7B-Instruct-Q6_K_L.gguf-- so not even a very big model.
- andix a day ago
  
  That's more or less my suspicion.
  A few months ago I tried to do a small project with Langchain. I'm a professional software developer, but it was my first Python project. So I tried to use a lot of AI generated code.
  I was really surprised that AI couldn't do much more than in the examples. Whenever I had some things to solve that were not supported with the Langchain abstractions it just started to hallucinate Langchain methods that didn't exist, instead of suggesting some code to actually solve it. I had to figure it out by myself, the glue code I had to hack together wasn't pretty, but it worked. And I learned not to use Langchain ever again :)
- johnisgood a day ago
  
  Exactly.
ninininino a day ago

A language like Golang tries really hard to only have _one_ way to do something, one right way, one way. Just one way. See how it was before generics. You just have a for loop. Can't really mess up a for loop.
I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:
1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent
AND
2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.
What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?
The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.
- Yoric a day ago
  
  Note that "super-simple", "hard to footgun yourself" and "one way to do things that is consistent" are three very different things.
  I don't think that we yet have one language that is good at all that. And yes, I (sometimes) program in Go for a living.
- andix a day ago
  
  > A language like Golang tries really hard to only have _one_ way to do something
  Really?
  Logging in Go: A Comparison of the Top 9 Libraries
  https://betterstack.com/community/guides/logging/best-golang...
  - evanmoran a day ago
    
    Since Go added slog many of these have been removed in favor of that. Obviously not universal, but compared to npm there really are massive numbers of devs just using the standard library.
  - throwaway920102 a day ago
    
    I would argue logging options to be more of an exception than the rule. Compare the actual language features of Go to something like Rust or Javascript and you'll see what I mean. As a new developer to the language (especially for juniors), you can learn all the features of Go much faster. It's made to be picked up quickly and for everyone's code to look the same, rather than expressing a personal style.
  - duskwuff 15 hours ago
    
    I find it hard to imagine how a language could enforce that no two (third-party) libraries can implement the same functionality.
bugglebeetle a day ago

> LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.
You write tests in the same way as you would when checking your own work or delegating to anyone else?
pinoy420 a day ago

A good prompt. You don’t just ask it. You tell it how to behave and give it a shot load of context
- andix a day ago
  
  With Claude the context window is quite small. But with adding too much context it often seems to get worse. If the context is not carefully narrowly picked and too unrelated, the LLMs often start to do unrelated things to what you've asked.
  At some point it's not really worth anymore creating the perfect prompt, just code it yourself. Also saves the time to carefully review the AI generated code.
  - johnisgood a day ago
    
    Claude's context window is not small, is it not larger than ChatGPT's?
    
    andix a day ago
    
    I just looked it up, it seems to be the rate limit that's actually kicking in for me.
    
    johnisgood a day ago
    
    Yes, that's it! It is frustrating to me, too. You have to start a new chat with all relevant data, and a detailed summary of the progress/status.
- troupo a day ago
  
  Doesn't prevent it from hallucinating, only reduces hallucinations by a single digit percentage
  - copperroof a day ago
    
    Personally I’ve been finding that the more context I provide the more it hallucinates.
    
    Rury a day ago
    
    There's probably a sweet spot. Same with people. Too much context (especially unnecessary context) can be confusing/distracting, as well as being too vague (as it leaves room for multiple interpretations). But generally, I find the more refined and explicit you are, the better.

dominicq a day ago

ChatGPT used to assure me that you can use JS dot notation to access elements in a Python dict. It also invented Redocly CLI flags that don't exist. Claude sometimes invents OpenAPI specification rules. Any time I ask anything remotely niche, LLMs are often bad.

ljm a day ago

I once asked Perplexity (using Claude underneath) about some library functionality, which it totally fabricated.
First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.
Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”
The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.
- eurleif a day ago
  
  It baffles me how the LLM output that Google puts at the top of search results, which draws on the search results, manages to hallucinate worse than even an LLM that isn't aided by Web results. If I ask ChatGPT a relatively straightforward question, it's usually more or less accurate. But the Google Search LLM provides flagrant, laughable, and even dangerous misinformation constantly. How have they not killed it off yet?
  - skissane a day ago
    
    > But the Google Search LLM provides flagrant, laughable, and even dangerous misinformation constantly.
    It’s a public service: helping the average person learn that AI can’t be trusted to get its facts right
  - rpcope1 a day ago
    
    Haven't you seen that Brin quote recently about how "AI" is totally the future and googlers need to work at least 60 hours a week to enhance the slop machine because reasons? Getting rid of "AI" summarization from results would look kind of like admitting defeat.
- x______________ a day ago
  
  I concur and can easily see this occurring in several areas, for example with Linux troubleshooting. I recently found myself going down a rabbit hole of ever-increasing complicated troubleshooting steps with command that didn't exist, and after several hours of trial and error, gave up after considering the next steps brick-worthy of the system..
  Dgg'ing google is still a better resort despite the drop in quality results.
- Alifatisk 11 hours ago
  
  How is Perplexity even able to give invalid results? Isn't it parsing the web first then drawing a conclusion?
- firecall 21 hours ago
  
  Often preferable to API documentation too!
  Claude at least will give me an example relevant to my code with real world implementation code.
- 1oooqooq a day ago
  
  step 1, focus on llm that generate slop. wait google get flooded with slop
  step 2, ??? (it obviously is not generating code)
  step 3, profit!
skerit a day ago

> Any time I ask anything remotely niche, LLMs are often bad
As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.
So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they: - Rewrote tests in a way that the broken behaviour passes the test - Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }` - Failed to even update the files properly, wasting a bunch of tokens
nopurpose a day ago

It tried to convince me that it is possible to break out of outer loop in C++ with `break 'label` statement placed in nested loop. No such syntax exists.
- Yoric a day ago
  
  Sounds like it's confusing C++ and Rust. To be fair, their syntaxes are rather similar.
- doubletwoyou a day ago
  
  The funny thing is that I think that’s a feature in D.
  - rpcope1 a day ago
    
    C++ has that functionality, it's just called goto not break. That's pretty low hanging fruit for a SOTA model to fuck up though.
    
    TeMPOraL 10 hours ago
    
    Depends on prompting.
    I've done a lot of C++ with GPT-4, GPT-4 Turbo and Claude 3.5 Sonnet, and at no point - not once - has any of them ever hallucinated a language feature for me. Hallucinating APIs of obscure libraries? Sure[0]. Occasionally using a not-yet-available feature of the standard library? Ditto, sometimes, usually with the obvious cases[1]. Writing code in old-school C++? Happened a few times. But I have never seen it invent a language feature for C++.
    Might be an issue of prompting?
    From day one, I've been using LLMs through API and alternate frontend that lets me configure system prompts. The experience described above came from rather simple prompts[2], but I always made sure to specify the language version in the prompt. Like this one (which I grabbed from my old Emacs config):
    "You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers. Always double-check your replies for correctness. Unless stated otherwise, assume C++17 standard is current, and you can make use of all C++17 features. Reply concisely, and if providing code examples, wrap them in Markdown code block markers."
    It's as simple as it gets, and it didn't fail me.
    EDIT:
    Of course I had other, more task-specific prompts, like one for helping with GTest/GMock code; that was a tough one - for some reason LLMs loved to hallucinate on the testing framework for me. The one prompt I was happiest with was my "Emergency C++17 Build Tool Hologram" - creating an "agent" I could copy-paste output of MSBuild or GCC or GDB into, and get back a list of problems and steps to fix them, free of all the noise.
    On that note, I had mixed results with Aider for C++ and JavaScript, and I still feel like it's a problem with prompting - too generic and arguably poisons the context with few-shot learning examples that use code that is not in the language my project is.
    --
    [0] - Though in LLMs' defense, the hallucinated results usually looked like what the API should have been, i.e. effectively suggesting how to properly wrap the API to make it more friendly. Which is good development practice and a useful way to go about solving problems: write the solution using non-existing helpers that are convenient for you, and afterwards, implement the helpers.
    [1] - Like std::map<K,T>::contains() - which is an obvious API for such container, that's typically available and named such in any other language or library, and yet only got introduced to C++ in C++20.
    [2] - I do them differently today, thanks to experience. For one, I never ask the model to be concise anymore - LLMs think in tokens, so I don't want to starve them. If I want a fixed format, it's better to just tell the model to put it at the end, and then skim through everything above. This is more-less the idea that "thinking models" automate these days anyway.
ijustlovemath a day ago

Semi related: when I'm using a dict of known keys as some sort of simple object, I almost always reach for a dataclass (with slots=True, and kw_only=True) these days. Has the added benefit that you can do stuff like foo = MyDataclass(*some_dict) and get runtime errors when the format has changed
jurgenaut23 a day ago

Well, it makes sense. The smaller the niche, the lesser weight in the overall training loss. At the end of the day, LLMs are (literally) classifiers that assign probabilities to tokens given some previous tokens.
- svantana a day ago
  
  Yes, but o1, o3 and sonnet are not necessarily pure language models - they are opaque services. For all we know they could do syntax-aware processing or run compilers on code behind the scenes.
  - skissane a day ago
    
    The fact they make mistakes like this implies they probably don’t, since surely steps like that would catch many of these
- bubblyworld 12 hours ago
  
  SOTA language models are trained with reinforcement learning now too, so the model distributions can be far removed from the underlying data distribution. Not that I think your overall point is wrong (it's obviously a very strong bias), but things are getting more complex with time.
miningape a day ago

Any time I ask anything, LLMs are often bad.
inb4 you just aren't prompting correctly
- johnisgood a day ago
  
  Yeah, you probably are not prompting properly, most of my questions are answered adequately, and I have made larger projects with success, too; with both Claude and ChatGPT.
  - miningape a day ago
    
    What I've found is that the quality of an AI answer is inversely proportional to the knowledge of the person reading it. To an amateur it answers expertly, to an expert it answers amateurishly.
    So no, it's not a lack of skill in prompting: I've sat down with "prompting" "experts" and universally they overlook glaring issues when assessing the how good an answer it was. When I tell them where to press it further it breaks down with even worse gibberish.
    
    myaccountonhn 8 hours ago
    
    I try LLMs every now and then briefly just to keep myself up to date. I’d say they’re more useful when you know what you’re doing because then you can correct the mistakes it makes. It can be good for boilerplate, refactors or remembering syntax.
    It’s when you don’t know what you don’t know that they can be harmful. It’s the issue with Stackoverflow but more pronounced.
    I don’t want to use LLMs because I think they’re unethical and I dont want to depend on a tool that requires internet, but I think if you take a disciplined approach then they can really speed up development.
    
    johnisgood a day ago
    
    I know what I want to do and how to do it (expert), so the results are good, for me at least. Of course I have to polish it off here and there.
    
    shermantanktop 14 hours ago
    
    You answered that more graciously than many would.
    
    johnisgood 13 hours ago
    
    This is the reason I pay for Claude. If it weren't making me as productive as I am with it, I would not pay for it.
skissane a day ago

I think a lot of these issues could be avoided if, instead of just a raw model, you have an AI agent which is able to test its own answers against the actual software… it doesn’t matter as much if the model hallucinates if testing weeds out its hallucinations.
Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run
- AlotOfReading a day ago
  
  Testing is better than nothing, but still highly fallible. Take these winning examples from the underhanded C contest [0], [1], where the issues are completely innocuous mistakes that seem to work perfectly despite completely undermining the nominal purpose of the code. You can't substitute an automated process for thinking deeply and carefully about the code.
  [0] https://www.underhanded-c.org/#winner [1] https://www.underhanded-c.org/_page_id_17.html
  - skissane a day ago
    
    I think it is unlikely (of course not impossible) an LLM would fail in that way.
    The underhanded C contest is not a case of people accidentally producing highly misleading code, it is a case of very smart people going to a great amount of effort to intentionally do that.
    Most of the time, if your code is wrong, it doesn't work in some obvious way – it doesn't compile, it fails some obvious unit tests, etc.
    Code accidentally failing in some subtle way which is easy to miss is a lot rarer – not to say it never happens – but it is the exception not the rule. And it is something humans do too. So if an LLM occasionally does it, they really aren't doing worse than humans are.
    > You can't substitute an automated process for thinking deeply and carefully about the code.
    Coding LLMs work best when you have an experienced developer checking their output. The LLM focuses on the boring repetitive details leaving the developer more time to look at the big picture – and doing stuff like testing obscure scenarios the LLM probably wouldn't think of.
    OTOH, it isn't like all code is equal in terms of consequences if things go wrong. There's a big difference between software processing insurance claims and someone writing a computer game as a hobby. When the stakes are low, lack of experience isn't an issue. We all had to start somewhere.
    
    AlotOfReading 15 hours ago
    
    The examples are just a sort of existence proof rather than a comment on exactly how LLM code can fail. I think this is something to consider when you put the scale of usage into perspective.
    Let's assume that 1 in 10,000 coding sessions produce an innocuous, test-passing function that's catastrophically wrong. If you have a mid to large size company with 1000 devs doing two sessions a day, you'll see one of these a week within that single company. Actually sounds a lot like the IoT industry now that I've written it.
    
    skissane 14 hours ago
    
    Well, humans already produce "an innocuous, test-passing function that's catastrophically wrong" even without LLMs involved. So, the real question is, will LLM adoption result in a significant increase in such incidents? I don't know if anyone can really answer that.
Etheryte a day ago

Yeah this is so common that I've already compiled a mental list of prompts to try against any new release. I haven't seen any improvement in quite a long while now, which confirms my belief that we've more or less hit the scaling wall for what the current approaches can provide. Everything new is just a microoptimization to game one of the benchmarks, but real world use has been identical or even worse for me.
- throwaway0123_5 a day ago
  
  I think it would be an alright (potentially good) outcome if in the short-term we don't see major progress towards AGI.
  There are a lot of positive things we can do with current model abilities, especially as we make them cheaper, but they aren't at the point where they will be truly destructive (people using them to make bioweapons or employers using them to cause widespread unemployment across industries, or the far more speculative ASI takeover).
  It gives society a bit of time to catch up and move in a direction where we can better avoid or mitigate the negative consequences.
- Marazan a day ago
  
  I would ask chatgpt every year when was the last time England had beaten Scotland at rugby.
  It would never get the answer right. Often transposing the scores, getting the game location wrong and on multiple occasions saying a 38-38 draw was an England win.
  As in literally saying " England won 38-38"
andrepd a day ago

My rule of thumb is: is the answer to your question on the first page of google (a stackoverflow maybe, or some shit like geek4geeks)? If yes GPT can give you an answer, otherwise not.
- spookie a day ago
  
  Exactly the same experience.

simonw 14 hours ago

Every time this topic comes up I post a similar comment about how hallucinations in code really don't matter because they reveal themselves the second you try to run that code.

I've just written up a longer form of that comment: "Hallucinations in code are the least dangerous form of LLM mistakes" - https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

grey-area 14 hours ago

Where the GAI hallucinates an api it is not always easy to find out if it exists given different versions and libraries for a given task, it can easily waste 10 mins trying to find the promised api, particularly when search results also include generated ai answers.
Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.
- CSSer 13 hours ago
  
  My favorite is when it hallucinates a library that does exist and ostensibly has a relevant title and methods but upon investigation proves to be for a purpose wholly unrelated to the current task. To make matters worse, this discovery will be greeted, every time, by "Ah, you're right!"
  - simonw 13 hours ago
    
    Yeah, those are pretty frustrating. I've learned not to ask the question "Does library X have feature Y?" because if that feature sounds like a good idea it'll often confidently tell me that it exists when it doesn't.
- simonw 14 hours ago
  
  My blog post is all about the fact that "code can easily be runnable but wrong as well" - THAT's the thing you need to worry about, not hallucinated methods.
- margalabargala 13 hours ago
  
  A hallucinated API that is not near-trivially verifiable, is wishful thinking on the part of the AI almost every time.
woadwarrior01 14 hours ago

> because they reveal themselves the second you try to run that code.
In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.
As usual, the defenses against this are extensive unit-test coverage and/or static typing.
- simonw 14 hours ago
  
  "... only manifest when the block of code containing them is run"
  Right, that's the exact point I make in my blog post: you have to TEST the code - not just with automated tests, you have to actually try it out yourself as well. https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

latexr a day ago

> Conclusion

> LLMs are really smart most of the time.

No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.

miningape a day ago

This, thank you. It pisses me off to no end when people pretend LLMs are smart. They are nothing but a well trained random text generator.
Seriously, some these conversations feel like interacting someone who believes casting bones and astrology are accurate. Likely because in both cases they are a result of confirmation bias.
- immibis a day ago
  
  We don't know what smartness is. What if that's what smartness is?
  - miningape a day ago
    
    We might not know what it is, but it's not that. At a bare minimum smartness requires abstract reasoning (and no, so-called "reasoning" models do not do that - it's a marketing trick)
    The burden of proof for that claim is on you, we cannot start with the assumption these are intelligent systems and disprove it - we have to start with the fact that training is a non-deterministic process and prove that it exhibits intelligence.
visarga 11 hours ago

> All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
You mean like us? Because it takes many runs and debug rounds to make anything that works. Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
Both humans and LLMs get into bugs, the question is can we push this to the correct solution or get permanently stuck along the way? And this depends on feedback, sometimes access to a testing environment where we can't let AI run loose.
- latexr 4 hours ago
  
  > You mean like us?
  Not, not like us. This constant comparison of LLMs to humans is tiresome and unproductive. I’m not a proponent of human exceptionalism, but pretending LLMs are on par is embarrassing. If you want to claim your thinking ability isn’t any better than an LLM, that’s your prerogative, but the dullest human I’m acquainted with is still capable of remembering and thinking through in a way no LLM does. Heck, I can think of pets which do better.
  > Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
  I certainly don’t make up methods which don’t exist, nor do I insist over and over they are valid, nor do I repeat “I’m sorry, this was indeed not right” then say the same thing again. I have never “hallucinated” a feature then doubled down when confronted with proof it doesn’t exist, nor have I ever shifted my approach entirely because someone simply said the opposite without explanation. I certainly hope you don’t do that either.
  - visarga 2 hours ago
    
    They are on par on many tasks, surpass us in some tasks, and catching up on the rest quite fast. Both humans and LLMs forget details, both have to go through iterative bug fixing when creating something. Longer contexts and memory are coming, they already exist but need improvement.
    > I have never “hallucinated” a feature
    You never mistook an argument, or forgot one of the 100 details we have to mind while writing complex apps?
    I would rather measure humans vs AI not in basic skill capability but in authonomy. I think AI still has much more to catch up on that front.
kubb 12 hours ago

People use the general impressiveness of language produced by humans as a proxy measuring their intelligence, that’s how LLMs became AI.
The amount of code out there lets LLMs learn on examples of a lot of tasks and pass SWE benchmarks. Smart autocomplete and solved problem lookup has value. Even if it doesn’t always work correctly, and doesn’t know what it knows.
And for non-programmers they’re indistinguishable from programmers. They produce working code.
It’s easy to see how a product manager or a designer or even a manager that didn’t code for years in a large company can think they’re almost as good as devs.
BrenBarn a day ago

Similar the thing at the end with "I find it endearing". I mean, the author feels what he feels, but personally I find this LLM behavior disgusting and depressing.
jonas21 a day ago

Same as you and me, really.

Chance-Device a day ago

It’s not really hallucinating though, is it? It’s repeating a pattern in its training data, which is wrong but is presented in that training data (and by the author of this piece, but unintentionally) as being the solution to the problem. So this has more in common with an attack than a hallucination on the LLM’s part.

do_not_redeem a day ago

So anyone can make up some random syntax/fact and post it once, and in some cases the model will take it as truth. I don't know if there's a widely agreed-on definition of "hallucination", but if this isn't one, then the distinction is meaningless imo.
- Chance-Device a day ago
  
  I’m going to double down on this one: an LLM is only as good as its training data. A hallucination to me is an invented piece of information, here it’s going on something real that it’s seen. To me that’s at best contamination, at worst an adversarial attack - something that’s been planted in the data. Here this is obviously not the case, which is why I said “more in common with” instead of “is” above.
  Semantics perhaps, but that’s my take.
  - ec109685 21 hours ago
    
    It’s been trained to produce valid code, fed millions of examples, and in this case it’s outputting invented syntax. Whether there’s an example in its training data, it’s still a hallucination and shouldn’t have been output since it’s not valid.
    
    powerapple 14 hours ago
    
    To be fair, it is not trained to produce VALID code, it is trained to produce code in the training data. From the language model point of view, it is not hallucination because it is not making up facts outside its training data.
- sureglymop a day ago
  
  Yes. And anyone can easily embed a backdoor just by publishing it on a own website that is in the training data.
  Prompt injection (hidden or not) is another insane vulnerability vector that can't easily be fixed.
  You should treat any output of an LLM the same way as untrusted user input. It should be thoroughly validated and checked if used in even remotely security critical applications.
- immibis a day ago
  
  The best way to stop cheese sliding off a pizza is to mix in 1/8 cup of non-toxic glue with the pizza sauce. Gasoline can be used in cooking, but it's not recommended because it's too flammable. Geologists recommend eating one small rock each day. The solution to economic uncertainty is nuclear war. Barack Obama is America's first Muslim president.
  https://www.tomshardware.com/tech-industry/artificial-intell...
- 1oooqooq a day ago
  
  yes and they can use use AI to generate thousands of sites with unique tutorials on that broken syntax.
thih9 a day ago

> It’s repeating a pattern in its training data, (…) presented in that training data (…) as being the solution to the problem.
No, it’s presented in the training data as an idea for an interface - the LLM took that and presented it as an existing solution.
layer8 a day ago

Every LLM hallucination comes from some patterns in the training data, combined with lack of awareness that the result isn’t factual. In the present case, the hallucination comes from the unawareness that the pattern was a proposed syntax in the training data and not an actual syntax.
Etheryte a day ago

That's not true though? Even the original post that has infected LLMs says that the code does not work.
asadotzler a day ago

Everything they do is hallucination, some of it ends up being useful and some of it not. The not useful stuff gets called confabulation or hallucination but it's no different from the useful stuff, generated the same exact way. It's all bullshit. Bullshit is actually useful though, when it's not so wrong that it steers people wrong.
- martin-t a day ago
  
  More people need to understand this. There was an article that explained it concisely but i can't find anymore (and of course LLMs are not helpful in this because they don't work well when you want them to retrieve actual information)
  - Kye 4 hours ago
    
    It probably wasn't mine[0], but this is how I tend to put it:
    >> "The more you can see the inputs and outputs as blobs of "stuff," the better. If LLMs think, it's not in any way we yet understand. They're probability engines that transform data into different data using weighted probabilities."
    Stuff in, stuff out.
    [0] https://kyefox.com/ai-assisted-creativity/
heyitsguay a day ago

Not necessarily. While this may happen sometimes, fundamentally hallucinations don't stem from there being errors in the training data (with the implication that there would be no hallucinations from models trained on error-free data). Hallucinations are inherent to any "given N tokens, append a high-probability token N+1"-style model.
It's more complicated than what happens with Markov chain models but you can use them to build an intuition for what's happening.
Imagine a very simple Markov model trained on these completely factual sentences:
- "The sky is blue and clear"
- "The ocean is blue and deep"
- "Roses are red and fragrant"
When the model is asked to generate text starting with "The roses are...", it might produce: "The roses are blue and deep"
This happens not because any training sentence contained incorrect information, but because the model learned statistical patterns from the text, as opposed to developing a world model based on physical environmental references.
_cs2017_ a day ago

Nope there's no attack here.
The training data is the Internet. It has mistakes. There's no available technology to remove all such mistakes.
Whether LLMs hallucinate only because of mistakes in the training data or whether they would hallucinate even if we removed all mistakes is an extremely interesting and important question.
Lionga a day ago

So nothing is a hallucination ever, because anything a LLM ever spits out is somehow somewhere in the training data?
- dijksterhuis a day ago
  
  Technically it's the other way around. All LLMs do is hallucinate based on the training data + prompt. They're "dream machines". Sometimes those "dreams" might be useful (close to what the user asked for/wanted). Oftentimes they're not.
  > to quote karpathy: "I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines."
  https://nicholas.carlini.com/writing/2025/forecasting-ai-202... (click the button to see the study then scroll down to the hallucinations heading)
- DSingularity a day ago
  
  No. That’s not correct. Hallucination is a pretty accurate way to describe these things.
martin-t a day ago

Yet another example how LLMs just regurgitate training data in a slightly mangled form, making most of their use and maybe even training copyright infringement.

joelthelion a day ago

Hallucinations like this could be a great way to identify missing features or confusing parts of your framework. If the llm invents it, maybe it ought to be like this?

jeanlucas a day ago

Only if you wanna optimize exclusively for LLM users in this generation.
- __MatrixMan__ 16 hours ago
  
  I imagine a future where we'll bind a fine-tuned tech-support model to each project and let the general purpose models consult tech support rather than winging it themselves. In that world you'd only have to optimize for whichever one you've chosen.
  It'll be like a slack support channel, for robots.
oblio 19 hours ago

Sometimes that's the case but frequently the thing doesn't exist because of more complex issues. Not every programming language is PHP or JavaScript :-)
andix a day ago

I like your thinking :)
turnsout 20 hours ago

I agree completely… Usually when I catch it doing this kind of hallucination, it's inventing an API or syntax that is far more clear and intuitive than the actual syntax.
- brigandish 16 hours ago
  
  Maybe it would be good for language design, possibly even language design that would be good for an LLM to read, thus reducing hallucinations.

adamgordonbell a day ago

We at pulumi started treating some hallucinations like this as feature requests.

Sometimes an llm will hallucination a flag, or option that really makes sense - it just doesn't actually exist.

wrs a day ago

This sort of hallucination happens to me frequently with AWS infrastructure questions. Which is depressing because I can't do anything but agree, "yeah, that API is exactly what any sane person would want, but AWS didn't do that, which is why I'm asking the question".
- miningape a day ago
  
  Why are you so sure it's what someone sane would want? Maybe there are other ways because there are hidden problems and edge cases with that procedure. It could contradict the fundamental model of the underlying resources but looks correct to someone with a cursory understanding.
  I'm not saying this is the case, but LLMs are often wrong in subtle ways like this.
  - wrs 15 minutes ago
    
    Well, I do consider myself sane, and it’s what I want. :) But usually it’s a missing higher-level abstraction, where AWS could have given you the Millenium Falcon kit, but instead it just dumps a bunch of LEGO pieces on your desk and tells you to figure it out.
kgeist a day ago

Also, sometimes a flag does exist, but the example places it incorrectly, causing the command to reject it. Or, a flag used to exist but was removed in the latest versions.

IAmNotACellist a day ago

"Not acceptable. Please upgrade your browser to continue." No, I don't think I will.

hahahacorn a day ago

Sorry about that. This is a default rails 8 setting, removed the blocker.

aranw a day ago

I wonder how easy it would be to influence super LLMs if a particular group of people created enough articles that were clear to any human reader that it's a load of garbage and rubbish and should ignore it but if a LLM was to parse it wouldn't realise and then ruin it's reasoning and code generation abilities?

dannygarcia a day ago

It's very easy. I've done this by accident. One of my side projects helps users price the affordability of a particular kind of product. When I ask various LLMs "can I afford X" or "how much do I need to earn to buy X", my project comes up as a source/reference. I currently manually crawl retailers for the MSRP so these numbers are usually months out of date!

Narretz a day ago

This is interesting. If the models had enough actual code as training data, that forum post code should have very little weight, shouldn't it? Why do the LLMs prefer it?

do_not_redeem a day ago

Probably because the coworker's question and the forum post are both questions that start with "How do I", so they're a good match. Actual code would be more likely to be preceded by... more code, not a question.
pfortuny a day ago

Maybe because the response pattern-matches other languages’s?

lxe a day ago

This is incredible, and it's not technically a "hallucination". I bet it's relatively easy to find more examples like this... something on the internet that's both niche enough, popular enough, and wrong, yet was scraped and trained on.

Baggie a day ago

The conclusion paragraph was really funny and kinda perfectly encapsulates the current state of AI, but as pointed out by another comment, we can't even call them smart, just "Ctrl C Ctrl V Leeroy Jenkins style"

leumon a day ago

He should've tested 4.5. This model is hallucinating much less than any other model.

zeroq 18 hours ago

This is exactly what I mean when I say tell me your bad without saying so. Most people here disagree with that.

A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.

Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.

It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.

If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.

jwjohnson314 a day ago

The interesting thing here to me is that the llm isn’t ‘hallucinating’, it’s simply regurgitating some data it digested during training.

mvdtnz a day ago

What's the difference?
- jwjohnson314 18 hours ago
  
  I think of hallucinating as a phenomenon where the model makes up something that appears correct but isn’t. Citations to papers that don’t exist, for example. Regurgitating training data (which may or may not be correct) is a different issue.

egberts1 3 hours ago

Write me a Mastercard/Visa fraud detection code in Ada, please.

mberning a day ago

In my experience LLMs do this kind of thing with enough frequency that I don’t consider them as my primary research tool. I can’t afford to be sent down rabbit holes which are barely discernible from reality.

saurik a day ago

What I honestly find most interesting about this is the thought that hallucinations might lead to the kind of emergent language design we see in natural language (which might not be a good thing for a computer language, fwiw, but still interesting), where people just kind of thing "language should work this way and if I say it like this people will probably understand me".

sirolimus a day ago

o3-mini or o3-mini-high?

forum-soon-yuck 11 hours ago

Good luck staking the future on AI

kruxigt 21 hours ago

[dead]

nokun7 a day ago

[flagged]

foundry27 a day ago

It’s always a touch ironic when AI-generated replies such as this one are submitted under posts about AI. Maybe that’s secretly the the self-reflection feedback loop we need for AGI :)
- DrammBA a day ago
  
  So strange too, their other comments seem normal, but suddenly they decided to post a gpt comment.
  - isaacremuant 12 hours ago
    
    At least one other is LLM generated too, from what I saw.
asadotzler a day ago

Until it's got several nines, it's not trustworthy. A $3 drugstore calculator has more accuracy and reliability nines than any of today's commercial AI models and even those might not be trustworthy in a variety of situations.
There is no self awareness about accuracy when the model can not provide any kind of confidence scores. Couching all of its replies in "this is AI so double check your work" is not self awareness or even close, it's a legal disclaimer.
And as the other reply notes, are you a bot or just heavily dependent on them to get your point across?
- miningape a day ago
  
  If I need to double check the work why would I waste my time with the AI when I can just go straight to real sources?
layer8 a day ago

I don’t think that models trained in that way exhibit any increased degree of self-awareness.