renegade-otter 4 hours ago

In every single system I have worked on, tests were not just tests - they were their own parallel application, and it required careful architecture and constant refactoring in order for it to not get out of hand.

"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).

Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.

  • swatcoder 3 hours ago

    Fully agreed.

    It's bad enough when human team members are submitting useless, brittle tests with their PR's just to satisfy some org pressure to write them. The lazy ones provide a false sense of security even though they neglect critical scenarios, the unstable ones undermine trust in the test output because they intermittently raise false negatives that nobody has time to debug, and the pointless ones do nothing but reify architecture so it becomes too laborious to refactor anything.

    As contextually aware generators, there are doubtless good uses for LLM's in test developement, but (as with many other domains) they threaten to amplify an already troubling problem with low-quality, high-volume content spam.

  • nrnrjrjrj 39 minutes ago

    There is an art to writing tests especially getting absraction levels right. For example do you integration test hitting the password field with 1000 cases or do that as a unit test, and does doing it as a unit test sufficiently cover this.

    AI could do all this thinking in the future but not yet I believe!

    Let alone the codebase is likely a mess of bad practice already (never seen one that isn't! That is life) so often part of the job is leaving the campground a bit better than how you found it.

    LLMs can help now on last mile stuff. Fill in this one test. Generate data for 100 test cases. Etc.

  • skissane 3 hours ago

    > "More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code.

    Are there ways we can measure this?

    One idea that I’ve had, is collect code coverage separately for each test. If a test isn’t covering any unique code or branches, maybe it is superfluous - although not necessarily, it can make sense to separately test all the boundary conditions of a function, even if doing so doesn’t hit any unique branches.

    Maybe prefer a smaller test which covers the same code to a bigger one. However, sometimes if a test is very DRY, it can be more brittle, since it can be non-obvious how to update it to handle a code change. A repetitive test, updating it can be laborious, but at least reasonably obvious how to do so.

    Could an LLM evaluate test quality, if you give it a prompt containing some expert advice on good and bad testing practices?

    • fijiaarone 6 minutes ago

      Sometimes you actually have to think, or hire someone who can. Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.

  • viraptor 3 hours ago

    Pretty much this and I prefer the opposite. "Here's the new test case from me, make the code pass it" is a decent workflow with Aider.

    I get that occasionally there are some really trivial but important tests that take time and would be nice to automate. But that's a minority in my experience.

mastersummoner 2 hours ago

I actually tested Claude Sonnet to see how it would fare at writing a test suite for a background worker. My previous experience was with some version of GPT via Copilot, and it was... not good.

I was, however, extremely impressed with Claude this time around. Not only did it do a great job off the bat, but it taught me some techniques and tricks available in the language/framework (Ruby, Rspec) which I wasn't familiar with.

I'm certain that it helped having a decent prompt, asking it to consider all the potential user paths and edge cases, and also having a very good understanding of the code myself. Still, this was the first time for me I could honestly say that an LLM actually saved me time as a developer.

nazgul17 2 hours ago

Should we not, instead, write tests ourselves and have LLMs write the code to make them pass?

  • jayd16 2 hours ago

    Just ask it to do both.

    • sdesol an hour ago

      And remember to always challenge the response with both the same and different models. No joke. Just continue the conversation for the example in the blog and ask the LLM "Do you see anything wrong with the code?" and it will spit out "Yes" and explain why.

iambateman 5 hours ago

I did this for Laravel a few months ago and it’s great. It’s basically the same as the article describes, and it has definitely increased the number of tests I write.

Happy to open source if anyone is interested.

simonw 6 hours ago

If you add "white-space: pre-wrap" to the elements containing those prompt examples you'll avoid the horizontal scrollbar (which I'm getting even on desktop) and make them easier to read.

  • johnjwang 5 hours ago

    Thanks for the suggestion -- I'll take a look into adding this!

satisfice 5 hours ago

Like nearly all the articles about AI doing "testing" or any other skilled activity, the last part of it admits that it is an unreliable method. What I don't see in this article-- which I suspect is because they haven't done any-- is any description of a competent and reasonably complete testing process of this method of writing "tests." What they probably did is to try this, feel good about it (because testing is not their passion, so they are easily impressed), and then mark it off in their minds as a solved problem.

The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.

LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

  • johnjwang 5 hours ago

    Author here: Yes, there are certain functions where writing good tests will be difficult for an LLM, but in my experience I've found that the majority of functions that I write don't need anything out of the ordinary and are relatively straightforward.

    Using LLMs allows us to have much higher coverage than if we didn't use it. To me and our engineering team, this is a pretty good thing because in the time prioritization matrix, if I can get a higher quality code base with higher test coverage with minimal extra work, I will definitely take it (and in fact it's something I encourage our engineering teams to do).

    Most of the base tests that we use were created originally by some of our best engineers. The patterns they developed are used throughout our code base and LLMs can take these and make our code very consistent, which I also view as a plus.

    re: Complacency: We actually haven't found this to be the case. In fact, we've seen more tests being written with this method. Just think about how much easier it is to review a PR and make edits vs write a PR. You can actually spend your time enforcing higher quality tests because you don't have to do most of the boilerplate for writing a test.

    • youoy 5 hours ago

      I would say that the complacency part is identifying good test with good coverage. I agree that writing test is one of the best use cases for LLMs, and it definitely saves engineers a lot of time. But if you follow them to blindly it is easy to get carried away by how easy it is to write tests that focus on coverage instead of actually testing more quality things. Which is what the previous comment was pointing at:

      > which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

    • satisfice 3 hours ago

      Have you systematically tested this approach? It sounds like you are reporting on your good vibes. Your writing is strictly anecdotal.

      I’ve been working with AI, too. I see what I’m guessing is the same unreliability that you admit in the last part of your article. For some reason, you are sanguine about it, whereas I see it as a serious problem.

      You say you aren’t complacent, but your words don’t seem to address the complacency issue. “More tests” does not mean better testing, or even good enough testing.

      Google “automation bias” and tell me what policies and procedures or training is in place to avoid it.

  • wenc 4 hours ago

    I do use LLMs to bootsrap my unit testing (because there is a lot boilerplate in unit tests and mocks), but I tend to finish the unit tests myself. This gives me confidence that my tests are accurate to the best of my knowledge.

    Having good tests allows me to be more liberal with LLMs on implementation. I still only use LLMs to bootstrap the implementation, and I finish it myself. LLMs, being generative, are really good for ideating different implementations (it proposes implementations that I would never have thought of), but I never take any implementation as-is -- I always try to step through it and finish it off manually.

    Some might argue that it'd be faster if I wrote the entire thing myself, but it depends on the problem domain. So much of what I do is involve implementing code for unsolved problems (I'm not writing CRUD apps for instance) that I really do get a speed-up from LLMs.

    I imagine folks writing conventional code might spend more time fixing LLM mistakes and thus think that LLMs slow them down. But this is not true for my problem domain.

  • simonw 5 hours ago

    The answer to this is code review. If an LLM writes code for you - be it implementation or tests - you review it before you land it.

    If you don't understand how the code works, don't approve it.

    Sure, complacent developers will get burned. They'll find plenty of other non-AI ways to burn themselves too.

    • hitradostava 5 hours ago

      100% agree. We don't expect human developers to be perfect, why should we expect AI assistants. Code going to production should go through review.

      I do think that LLMs will increase the volume of bad code though. I use Cursor a lot, and occasionally it will produce perfect code, but often I need to direct and refine, and sometimes throw away. But I'm sure many devs will get lazy and just push once they've got the thing working...

      • sdesol 4 hours ago

        > 100% agree. We don't expect human developers to be perfect, why should we expect AI assistants.

        I think the issue is that we are currently being sold that it is. I'm blown away by how useful AI is, and how stupid it can be at the same time. Take a look at the following example:

        https://app.gitsense.com/?doc=f7419bfb27c896&highlight=&othe...

        If you click on the sentence, you can see how dumb Sonnet-3.5 and GPT-4 can be. Each model was asked to spell-check and grammar-check the sentence 5 times each, and you can see that GPT-4o-mini was the only one that got this right all 5 times. The other models mostly got it comically wrong.

        I believe LLM is going to change things for the better for developers, but we need to properly set expectations. I suspect this will be difficult, since a lot of VC money is being pumped into AI.

        I also think a lot of mistakes can be prevented if you include in your prompt, how and why it did what it did. For example, the prompt that was used in the blog post should include "After writing the test, summarize how each rule was applied."

        • simonw 4 hours ago

          "I think the issue is that we are currently being sold that it is."

          The message that these systems are flawed appears to be pretty universal to me:

          ChatGPT footer: "ChatGPT can make mistakes. Check important info."

          Claude footer: "Claude can make mistakes. Please double-check responses."

          https://www.meta.ai/ "Messages are generated by AI and may be inaccurate or inappropriate."

          etc etc etc.

          I still think the problem here is science fiction. We have decades of sci-fi telling us that AI systems never make mistakes, but instead will cause harm by following their rules too closely (paperclip factories, 2001: A Space Odyssey etc).

          Turns out the actual AI systems we have make mistakes all the time.

          • sdesol 3 hours ago

            You do have to admit, the footer is extremely small and it's also not in the most prominent place. I think most "AI companies" probably don't go into a sales pitch saying "It's awesome, but it might be full of shit".

            I do see your science fiction angle, but I think the bigger issue is the media, VCs, etc. are not clearly spelling out that we are nowhere near science fiction AI.

          • jazzyjackson 2 hours ago

            I appreciate the footer on Kagi Assistant: "Assistant can make mistakes. Think for yourself when using it" - a reminder that theres a tendency to outsource your own train of thought

            • sdesol an hour ago

              I would have to imagine 90+ percent of people use LLM and AI to outsource their thought and most will not heed this warning. OpenAI might say "Check important info." but they know most people probably won't do a google search or visit their library to fact check things.

apwell23 4 hours ago

i would love to used to use it change code in ways that compiles and see if test fails. Coverage metric sometimes doesn't really tell you if some piece of code is covered or not.

  • taberiand 2 hours ago

    I believe that's called mutation testing. Using an LLM to perform the mutation sounds like a great idea

  • sesm 3 hours ago

    Coverage metric can tell if lines of code were executed, but they can't tell if execution result was checked.