Agent design is still hard

lucumr.pocoo.org

285 points by the_mitsuhiko 9 hours ago

_pdp_ 6 hours ago

I've started a company in this space about 2 years ago. We are doing fine. What we've learned so far is that a lot of these techniques are simply optimisations to tackle some deficiency in LLMs that is a problem "today". These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.

So yah, cool, caching all of that... but give it a couple of months and a better technique will come out - or more capable models.

Many years ago when disc encryption on AWS was not an option, my team and I had to spend 3 months to come up with a way to encrypt the discs and do so well because at the time there was no standard way. It was very difficult as that required pushing encrypted images (as far as I remember). Soon after we started, AWS introduced standard disc encryption that you can turn on by clicking a button. We wasted 3 months for nothing. We should have waited!

What I've learned from this is that often times it is better to do absolutely nothing.

siva7 3 hours ago

This is the most important observation. I'm getting so many workshop invitations from my corporate colleagues about AI and agents. What most people don't get that these clever patterns they "invented" will be obsolete next week. This nice company blog about agents - one which got viral recently - will be obsolete next month. It's hard to swallow for my colleagues that in these age - like when you studied gang of four or a software architecture pattern book that you have learned a common language - no, these days the half-life of a pattern for AI is about a week. Even when you ask 10 professionals what an agent actually is - you will get 10 different answers yet they assume that how they use it is the common understanding.
- Vinnl 2 hours ago
  
  This is also why it's perfectly fine to wait out this AI hype and see what sticks afterward. It probably won't cost too much time to catch up, because at that point everyone who knows what they're doing only learned that a month or two ago anyway.
- hibikir 3 hours ago
  
  Note that even many of those "long knowledge" things people learned are today obsolete, but people that follow them just haven't figured it out yet. See how many of those object oriented design patters just look very silly the minute you use immutable data structures, and have access to functional programming constructs in your language. And nowadays most do. Many seminal books on how to program in the early 2000s, especially those covering "pure" OO, look quite silly today.
lelanthran 2 hours ago

> I've started a company in this space about 2 years ago. We are doing fine.
You have a positive cash flow from sales of agents? Your revenue exceeds your operating costs?
I've been very skeptical that it is possible to make money from agents, having seen how difficult it was for the current well-known players to do so.
What is your secret sauce?
gchamonlive 6 hours ago

I think knowing when to do nothing is being able to evaluate if the problem the team is tackling is essential or tangential to the core focus of the project, and also whether the problem is something new or if it's been around for a while and there is still no standard way to solve it.
- gessha 6 hours ago
  
  Yeah, that will be the make it to brake it moment because if it’s too essential, it will be implemented but if it’s not, it may become a competitive advantage
toddmorey 29 minutes ago

In some ways, the fact that the technology will shift is the problem as model behavior keeps changing. It's rather maddening unstable ground to build on. Really hard to gauge the impact to customer experience from a new model.
- ares623 19 minutes ago
  
  For a JS dev, it’s just another Tuesday
  - cheschire 4 minutes ago
    
    Is JS dev really still so mercurial as it was 5 to 10 years ago? I'm not so sure. Back then, there would be a new topic daily about some new JS framework etc etc.
    I still occasionally see a blip of activity but I can't say it's anything like what we witnessed in the past.
    Though I will agree that gen AI trends feel reminiscent of that period of JS dev history.
nowittyusername 3 hours ago

I agree with the sentiment. things are moving so fast that waiting now is a legitimate strategy. though it is also easy to fall in to the trap of. well if we continue along these lines might as well wait 4-5 years and we get agi. which still true imo does feel off as you arent participating in the process.
verdverm an hour ago

Vendor choice matters.
You could use the like of Amazon / Anthropic, or use Google who has had transparent disk encryption for 10+ years, and Gemini which already had the transparent caching discussed built in.
- te_chris an hour ago
  
  If you’ve spent any time with the vertex LLM apis you wouldn’t be so enthusiastic about using Google’s platform (I say this as someone who prefers GCP to aws for compute and networking).
an0malous 5 hours ago

> These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.
What technology shifts have happened for LLMs in the last 2 years?
- dcre 5 hours ago
  
  One example is that there used to be a whole complex apparatus around getting models to do chain of thought reasoning, e.g., LangChain. Now that is built in as reasoning and they are heavily trained to do it. Same with structured outputs and tool calls — you used to have to do a bunch of stuff to get models to produce valid JSON in the shape you want, now it’s built in and again, they are specifically trained around it. It used to be you would have to go find all relevant context up front and give it to the model. Now agent loops can dynamically figure out what they need and make the tool calls to retrieve it. Etc etc.
  - mewpmewp2 2 hours ago
    
    LangChain generally felt pointless for me to use, not a good abstraction. It would rather keep you from the most important thing that you need in this fast evolving ecosystem, and it's direct prompt level (if you can even call that low level) understanding of what is going on.
- postalcoder 5 hours ago
  
  If we expand this to 3 years, the single biggest shift that totally changed LLM development is the increase in size of context windows from 4,000 to 16,000 to 128,000 to 256,000.
  When we were at 4,000 and 16,000 context windows, a lot of effort was spent on nailing down text splitting, chunking, and reduction.
  For all intents and purposes, the size of current context windows obviates all of that work.
  What else changed?
  - Multimodal LLMs - Text extraction from PDFs was a major issue for rag/document intelligence. A lot of time was wasted trying to figure out custom text extraction strategies for documents. Now, you can just feed the image of a PDF page into an LLM and get back a better transcription.
  - Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline. Boris Cherny created a stir when he talked about claude code doing it that way[0]
  https://news.ycombinator.com/item?id=43163011#43164253
- throwaway13337 4 hours ago
  
  I'm amazed at this question and the responses you're getting.
  These last few years, I've noticed that the tone around AI on HN changes quite a bit by waking time zone.
  EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.
  It's really puzzling to me. This is the first time I noticed such a disconnect in the community about what the reality of things are.
  To answer your question personally, genAI has changed the way I code drastically about every 6 months in the last two years. The subtle capability differences change what sorts of problems I can offload. The tasks I can trust them with get larger and larger.
  It started with better autocomplete, and now, well, agents are writing new features as I write this comment.
  - GoatInGrey 2 hours ago
    
    The main line of contention is how much autonomy these agents are capable of handling in a competitive environment. One side generally argues that they should be fully driven by humans (i.e. offloading tedious tasks you know the exact output of but want to save time not doing) while the other side generally argues that AI agents should handle tasks end-to-end with minimal oversight.
    Both sides have valid observations in their experiences and circumstances. And perhaps this is simply another engineering "it depends" phenomenon.
  - GiorgioG 4 hours ago
    
    Despite the latest and greatest models…I still see glaring logic errors in the code produced in anything beyond basic CRUD apps. They still make up fields that don’t exist, assign a value to a variable that is nonsensical. I’ll give you an example, in the code in question, Codex assigned a required field LoanAmount to a value from a variable called assessedFeeAmount…simply because as far as I can tell, it had no idea how to get the correct value from the current function/class.
    
    lbreakjai 2 hours ago
    
    That's why I don't get people that claim to be letting an agent run for an hour on some task. LLMs tend to do so many small errors like that, that are so hard to catch if you aren't super careful.
    I wouldn't want to have to review the output of an agent going wild for an hour.
  - bdangubic 4 hours ago
    
    the disconnect is quite simple, there are people that are professionals and are willing to put the time in to learn and then there’s vast majority of others who don’t and will bitch and moan how it is shit etc. if you can’t get these tools to make your job easier and more productive you ought to be looking for a different career…
    
    overfeed 3 hours ago
    
    You're not doing yourself any favors by labeling people who disagree with you undereducated or uninformed. There is enough over-hyped products/techniques/models/magical-thinking to warrant skepticism. At the root of this thread is an argument to (paraphrasing) encouraging people to just wait until someone solves major problems instead of tackling it themselves. This is a broad statement of faith, if I've ever seen one, in a very religious sense: "Worry not, the researchers and foundation models will provide."
    My skepticism and intuition that AI innovations are not exponential, but sigmoid are not because I don't understand what gradient-descent, transformers, RAG, CoT, or multi-head attention are. My statement of faith is: the ROI economics are going to catch up with the exuberance way before AGI/ASI is achieved; sure, you're getting improving agents for now, but that's not going to justify the 12- or 13-digit USD investments. The music will stop, and improvements slow to a drip
    Edit: I think at it's root, the argument is between folk who think AI will follow the same curve as past technological trends, and those who believe "It's different this time".
    
    bdangubic 2 hours ago
    
    > labeling people who disagree with you undereducated or uninformed
    I did neither of these two things... :) I personally could not care about
    - (over)hype
    - 12/13/14/15 ... digit USD investment
    - exponential vs. sigmoid
    There are basically two groups of industry folk:
    1. those that see technology as absolutely transformational and are already doing amazeballs shit with it
    2. those that argue how it is bad/not-exponential/ROI/...
    If I was a professional (I am) I would do everything in my power to learn everything there is to learn (and then more) and join the Group #1. But it is easier to be in Group #2 as being in Group #1 requires time and effort and frustrations and throwing laptop out the window and ... :)
    
    gmm1990 an hour ago
    
    If there is really amazing stuff happening with this technology how did we have two recent major outages that were cause by embarrassing problems? I would guess that at least in the cloud flare instance some of the responsible code was ai generated
    
    ctoth 26 minutes ago
    
    > I would guess that at least in the cloud flare instance some of the responsible code was ai generated
    Your whole point isn't supported by anything but ... a guess?
    If given the chance to work with an AI who hallucinates sometimes or a human who makes logical leaps like this
    I think I know what I'd pick.
    Seriously, just what even? "I can imagine a scenario where AI was involved, therefore I will treat my imagination as evidence."
    
    siva7 3 hours ago
    
    It's disheartening. I got a colleague, very senior, who dislikes AI for a myriad of reasons and doesn't want to adapt if not forced by mgmt. I feel from 2022-2024 the majority of my colleagues were in this camp - either afraid from AI or because they looked at it as not something a "real" developer would ever use. 2025 it seemed to change a bit. American HN seemed to adapt more quickly while EU companies are still lacking the foresight to see what is happening on the grand scale.
  - nickphx 4 hours ago
    
    ai is useless. anyone claiming otherwise is dishonest
    
    la_fayette 3 hours ago
    
    I use GenAI for text translation, text 2 voice and voice 2 text, there it is extremely useful. For coding I often have the feeling it is useless, but also sometimes it is useful, like most tools...
    
    wiseowise 3 hours ago
    
    Useless for what?
    
    whattheheckheck 4 hours ago
    
    What are you doing at your job that ai can't help with at all to consider is completely use less?
    
    wahnfrieden 4 hours ago
    
    no
    
    ghurtado 4 hours ago
    
    That could even be argued (with an honest interlocutor, which you clearly are not)
    The usefulness of your comment, on the other hand, is beyond any discussion.
    "Anyone who disagrees with me is dishonest" is some kindergarten level logic.
    
    ulfw 4 hours ago
    
    [Deleted as Hackernews is not for discussion of divergent opinions]
    
    wiseowise 3 hours ago
    
    > It's not useless but it's not good for humanity as a whole.
    Ridiculous statement. Is Google also not good for humanity as a whole? Is Internet not good for humanity as a whole? Wikipedia?
  - the_mitsuhiko 4 hours ago
    
    > EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.
    I don't think it's because the audience is different but because the moderators are asleep when Europeans are up. There are certain topics which don't really survive on the frontpage when moderators are active.
    
    jagged-chisel 4 hours ago
    
    I'm unsure how you're using "moderators." We, the audience, are all 'moderators' if we have the karma. The operators of the site are pretty hands-off as far as content in general.
    This would mean it is because the audience is different.
    
    the_mitsuhiko 3 hours ago
    
    I’m referring to the actual moderators of this website removing posts from the front page.
    
    verdverm 8 minutes ago
    
    that's a conspiracy theory
    The by far more common action is for the mods to restore a story which has been flagged to oblivion by a subset of the HN community, where it then lands on the front page because it already has sufficient pointage
    
    uoaei 4 hours ago
    
    The people who "operate" the website are different from the people who "moderate" the website but both are paid positions.
    This fru-fru about how "we all play a part" is only serving to obscure the reality.
    
    delinka 4 hours ago
    
    I'm sure this site works quite differently from what you say. There's no paid team of moderators flicking stories and comments off the site because management doesn't like them.
    There's dang who I've seen edit headlines to match the site rules. Then there's the army of users upvoting and flagging stories, voting (up and down) and flagging comments. If you have some data to backup your sentiments, please do share it - we'd certainly like to evaluate it.
    
    verdverm 3 minutes ago
    
    HN brought on a second mod (Tim, this year, iirc)
    My email exchanges with Dang, as part of the moderation that happens around here, have all been positive
    1. I've been moderated, got a slowdown timeout for a while
    2. I've emailed about specific accounts, (some egregious stuff you've probably never seen)
    3. Dang once emailed me to ask why I flagged a story that was near the top, but getting heavily flagged by many users. He sought understanding before making moderation choices
    I will defend HN moderation people & policies 'til the cows come home. There is nothing close to what we have here on HN, which is largely about us being involved in the process and HN having a unique UX and size
    
    throwaway13337 4 hours ago
    
    As an anonymous coward on HN for at least a decade, I'd say that's not really true.
    When paul graham was more active and respected here, I spoke negatively about how revered he was. I was upvoted.
    I also think VC-backed companies are not good for society. And have expressed as much. To positive response here.
    We shouldn't shit on one of the few bastions of the internet we have left.
    I regret my negativity around pg - he was right about a lot and seems to be a good guy.
    
    jamesblonde 2 hours ago
    
    Anything sovereign AI or whatever is gone immediately when the mods wake up. Got an EU cloud article? Publish it at 11am CET, it's disappears around 12.30.
- deepdarkforest 4 hours ago
  
  On the foundational level, test time compute(reasoning), heavy RL post training, 1M+ plus context length etc.
  On the application layer, connecting with sandboxes/VM's is one of the biggest shifts. (Cloudfares codemode etc). Giving an llm a sandbox unlocks on the fly computation, calculations, RPA, anything really.
  MCP's, or rather standardized function calling is another one.
  Also, local llm's are becoming almost viable because of better and better distillation, relying on quick web search for facts etc.
- WA 5 hours ago
  
  Not the LLMs. The APIs got more capabilities such as tool/function calling, explicit caching etc.
  - dcre 5 hours ago
    
    It is the LLMs because they have to be RLed to be good at these things.
- echelon 5 hours ago
  
  We started putting them in image and video models and now image and video models are insane.
  I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.
  It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.
  The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.
  I think more companies would be wise to ignore text for now and focus on visual domain problems.
  Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.
  Don't sleep on image and video. That's where the growth salient is.
  - wild_egg 5 hours ago
    
    > Nano Banana has so much more utility than agents.
    I am so far removed from multimedia spaces that I truly can't imagine a universe where this could be true. Agents have done incredible things for me and Nano Banana has been a cool gimmick for making memes.
    Anyone have a use case for media models that'll expand my mind here?
    
    echelon 4 hours ago
    
    We now have capacity to program and automate in the optics, signals, and spatial domains.
    As someone in the film space, here's just one example: we are getting extremely close to being able to make films with only AI tools.
    Nano Banana makes it easy to create character and location consistent shots that adhere to film language and the rules of storytelling. This still isn't "one shot", and considerable effort still needs to be put in by humans. Not unlike AI assistance in IDEs requiring a human engineer pilot.
    We're entering the era of two person film studios. You'll undoubtedly start seeing AI short films next year. I had one art school professor tell me that film seems like it's turning into animation, and that "photorealism" is just style transfer or an aesthetic choice.
    The film space is hardly the only space where these models have utility. There are so many domains. News, shopping, gaming, social media, phone and teleconference, music, game NPCs, GIS, design, marketing, sales, pitching, fashion, sports, all of entertainment, consumer, CAD, navigation, industrial design, even crazy stuff like VTubing, improv, and LARPing. So much of what we do as humans is non-text based. We haven't had effective automation for any of this until this point.
    This is a huge percentage of the economy. This is actually the beating heart of it all.
    
    yunwal 4 hours ago
    
    > we are getting extremely close to being able to make films with only AI tools
    AI still can’t reliably write text on background details. It can’t get shadows right. If you ask it to shoot things from a head on perspective, for example a bookshelf, it fails to keep proportions accurate enough. The bookshelf will not have parallel shelves. The books won’t have text. If in a library, the labels will not be in Dewey decimal order.
    It still lacks a huge amount of understanding about how the world works necessary to make a film. It has its uses, but pretending like it can make a whole movie is laughable.
    
    wild_egg 2 hours ago
    
    I don't think they're suggesting AI could one-shot a whole movie. It would be iterative, just like programming.
    
    gabriel666smith 2 hours ago
    
    I don't think equating "extremely close" with "pretending like it can" is a fair way to frame the sentiment of the comment you were replying to. Saying something is close to doing something is not the same as saying it already can.
    In terms of cinema tech, it took us arguably until the early 1940s to achieve "deep focus in artificial light". About 50 years!
    The last couple of years of development in generative video looks, to me, like the tech is improving more quickly than the tech it is mimicking did. This seems unsurprising - one was definitely a hardware problem, and the other is most likely a mixture of hardware and software problems.
    Your complaints (or analogous technical complaints) would have been acceptable issues - things one had to work around - for a good deal of cinema history.
    We've already reached people complaining about "these book spines are illegible", which feels very close to "it's difficult to shoot in focus, indoors". Will that take four or five decades to achieve, based on the last 3 - 5 years of development?
    The tech certainly isn't there yet, nor am I pretending like it is, and nor was the comment you replied to. To call it close is not laughable, though, in the historical context.
    The much more interesting question is: At what point is there an audience for the output? That's the one that will actually matter - not whether it's possible to replicate Citizen Kane.
exe34 4 hours ago

if we wait long enough, we just end up dead, so it turns out we didn't need to do anything at all whatsoever. of course there's a balance - often times starting out and growing up with the technology gives you background and experience that gives you an advantage when it hits escape velocity.

moinism 5 hours ago

Amen. Been seeing these agent SDKs coming out left and right for a couple of years and thought it'd be a breeze to build an agent. Now I'm trying to build one for ~3 weeks, and I've tried three different SDKs and a couple of architectures.

Here's what I found:

- Claude Code SDK (now called Agent SDK) is amazing, but I think they are still in the process of decoupling it from the Claude Code, and that's why a few things are weird. e.g, You can define a subagent programmatically, but not skills. Skills have to be placed in the filesystem and then referenced in the plugin. Also, only Anthoripic models are supported :(

- OpenAI's SDK's tight coupling with their platform is a plus point. i.e, you get agents and tool-use traces by default in your dashboard. Which you can later use for evaluation, distillation, or fine-tuning. But: 2. They have agent handoffs (which works in some cases), but not subagents. You can use tools as subagents, though. 1. Not easy to use a third-party model provider. Their docs provide sample codes, but it's not as easy as that.

- Google Agent Kit doesn't provide any Typescript SDK yet. So didn't try.

- Mastra, even though it looks pretty sweet, spins up a server for your agent, which you can then use via REST API. umm.. why?

- SmythOS SDK is the one I'm currently testing because it provides flexibility in terms of choosing the model provider and defining your own architecture (handoffs or subagents, etc.). It has its quirks, but I think it'll work for now.

Question: If you don't mind sharing, what is your current architecture? Agent -> SubAgents -> SubSubAgents? Linear? or a Planner-Executor?

I'll write a detailed post about my learnings from architectures (fingers crossed) soon.

peab 3 hours ago

I think the term sub-agent is almost entirely useless. An agent is an LLM loop that has reasoning and access to tools.
A "sub agent" is just a tool. It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)
- moinism 2 hours ago
  
  I agree, technically, "sub agent" is also another tool. But I think it's important to differentiate tools with deterministic input/output from those with reasoning ability. A simple 'Tool' will take the input and try to execute, but the 'subagent' might reason that the action is unnecessary and that the required output already exists in the shared context. Or it can ask a clarifying question from the main agent before using its tools.
- verdverm an hour ago
  
  ADK differentiates between tools and subagents based on the ability to escalate or transfer control (subagents), where as tools are more basic
  I think this is a meaningful distinction, because it impacts control flow, regardless what they are called. The lexicon are quite varied vendor-to-vendor
  - peab 37 minutes ago
    
    Are there any examples of implementations of this that actually work, and/or are useful? I've seen people write about this, but I haven't seen it anywhere
    
    verdverm 16 minutes ago
    
    I think in ADK, the most likely place to find them actually used is the Workflow agent interfaces (sequential, parallel, loop). Perhaps looping, where it looks like they suggest you have an agent that determines if the loop is done and escalates with that message to the Looper.
    https://google.github.io/adk-docs/agents/workflow-agents/
    I haven't gotten there yet, still building out the basics like showing diffs instead of blind writing and supporting rewind in a session
- Vinnl an hour ago
  
  What does "has reasoning" mean? Isn't that just a system prompt that says something like "make a plan" and includes that in the loop?
  - peab 36 minutes ago
    
    You actually probably don't need reasoning, as the old non reasoning models like 4o can do this too.
    In the past, the agent type flows would work better if you prompted the LLM to write down a plan, or reasoning steps on how to accomplish the task with the available tools. These days, the new models are trained to do this without promoting
- ColinEberhardt 3 hours ago
  
  Oh, so _that_ is what a sub-agent is. I have been wondering about that for a while now!
verdverm an hour ago

Google's ADK is pretty nice, I'm using the Go version, which is less mature than the python on. Been at it a bit over a week and progress is great. This weekend I'm aiming for tracking file changes in the session history to allow rewinding / forking
It has a ton of day 2 features, really nice abstractions, and positioned itself well in terms of the building blocks and constructing workflows.
ADK supports working with all the vendors and local LLMs
- dragonwriter an hour ago
  
  I really wish ADK had a local persistent memory implementation, though.
blancm 5 hours ago

Hello, about Claude Code where only Anthoripic models are supported, in reality you can use Claude Code router (https://github.com/musistudio/claude-code-router) to use other models in Claude Code. I use it since some weeks with opensource models and it works pretty well. You can even use "free" models from openrouter
- moinism 3 hours ago
  
  Thank you. But the main blocker for me right now is their skill definition: https://platform.claude.com/docs/en/agent-sdk/skills#how-ski...
otterley 4 hours ago

Have you tried AWS’s Strands Agents SDK? I’ve found it to be a very fluent and ergonomic API. And it doesn’t require you to use Bedrock; most major vendor native APIs are supported.
(Disclaimer: I work for AWS, but not for any team involved. Opinions are my own.)
- moinism 3 hours ago
  
  This looks good. Even though it's only in Python, I think its worth a try. Thanks.
ph4rsikal 4 hours ago

My favourite is Smolagents from Huggingface. You can easily mix and match their models in your agents.
- moinism 2 hours ago
  
  Dude, it looks great, but I just spent half an hour learning about its 'CodeAgents' feature. Which essentially is 'actions written as code'.
  This idea has been floating around in my head, but it wasn't refined enough to implement. It's so wild that what you're thinking of may have already been done by someone else on the internet.
  https://huggingface.co/docs/smolagents/conceptual_guides/int...
  For those who are wondering, it's kind of similar to the 'Code Mode' idea implemented by Cloudflare and now being explored by Anthropic; Write code to discover and call MCPs instead of stuffing context window with their definations.
kordlessagain 4 hours ago

If you are still open to trying Codex, I'm working on a containerized version with various features: https://github.com/DeepBlueDynamics/codex-container
- moinism 2 hours ago
  
  This looks good, but a bit overkill for what I'm trying to build tbh.
mountainriver 5 hours ago

The frameworks are all pointless, just use AI assist to create agents in python or ideally a language with concurrency.
You will be happy you did
- moinism 5 hours ago
  
  How do you deal with the different APIs/Tooluse schema in a custom build? As other people have mentioned, it's a bigger undertaking than it sounds.
  - koakuma-chan 3 hours ago
    
    You can just tell the AI which format you want the input in, in natural language.
    
    verdverm an hour ago
    
    you're wasting valuable context with approaches like that
    save it for more interesting tasks
  - catlover76 3 hours ago
    
    [dead]
langitbiru 4 hours ago

What about AI SDK from Vercel?
https://ai-sdk.dev/docs/agents/overview
- moinism 2 hours ago
  
  Haven't tried it yet, but it looks similar to OpenAI's. What is your experience?
thewhitetulip 2 hours ago

Did you try langchain/langgraph? Am I confusing what the OP means aa agents?

Vanclief 34 minutes ago

We have been working hard on the past two months implementing agents for different tasks. We started with Claude code, as we really liked working with hooks. However being vendor locked and having usage limit problems, we ended up implementing our own "runtime" that keeps the conversation structure while having hooks. Right now it only works with OpenAI but its designed to be able to incorporate Claude / Gemini down the road

We ended up open sourcing that runtime if anyone is interested:

https://github.com/Vanclief/agent-composer

postalcoder 7 hours ago

I've been building agent type stuff for a couple years now and the best thing I did was build my own framework and abstractions that I know like the back of my hand.

I'd stay clear of any llm abstraction. There are so many companies with open source abstractions offering the panacea of a single interface that are crumbling under their own weight due to the sheer futility of supporting every permutation of every SDK evolution, all while the same companies try to build revenue generating businesses on top of them.

sathish316 6 hours ago

I agree with your analysis of building your own Agent framework to have some level of control and fewer abstractions. Agents at their core are about programmatically talking to an LLM and performing these basic operations: 1. Structured Input and String Interpolation in prompts 2. Structured Output and Unmarshalling String response to Structured output (This is getting easier now with LLMs supporting Structured output) 3. Tool registry/discovery (of MCP and Function tools), Tool calls and response looping 4. Composability of Tools 5. Some form of Agent to Agent delegation
I’ve had good luck with using PydanticAI which does these core operations well (due to the underlying Pydantic library), but still struggles with too many MCP servers/Tools and composability.
I’ve built an open-source Agent framework called OpusAgents, that makes the process of creating Agents, Subagents, Tools that are simpler than MCP servers without overloading the context. Check it out here and tutorials/demos to see how it’s more reliable than generic Agents with MCP servers in Cursor/ClaudeDesktop - https://github.com/sathish316/opus_agents
It’s built on top of PydanticAI and FastMCP, so that all non-core operations of Agents are accessible when I need them later.
- drittich 5 hours ago
  
  This sounds interesting. What about the agent behavior itself? How it decides how to come at a problem, what to show the user along the way, and how it decides when to stop? Are these things you have attempted to grapple with in your framework?
  - sathish316 3 hours ago
    
    The framework has the following capabilities:
    1. A way to create function tools
    2. A way to create specialised subagents that can use their own tool or their own model. The main agent can delegate to subagent exposed as a tool. Subagents don’t get confused because they have their own context window, tools and even models (mix and match Remote LLM with Local LLM if needed)
    3. Don’t use all tools of the MCP servers you’ve added. Filter out and select only the most relevant ones for the problem you’re trying to solve
    4. HigherOrderTool is a way to callMCPTool(toolName, input) in places where the Agent to MCP interface can be better suited for the problem than what’s exposed as a generic interface by the MCP provider - https://github.com/sathish316/opus_agents/blob/main/docs/GUI... . This is similar to Anthropic’s recent blogpost on Code tools being better than MCP - https://www.anthropic.com/engineering/code-execution-with-mc...
    5. MetaTool is a way to use ready made patterns like OpenAPI and not having to write a tool or add more MCP servers to solve a problem - https://github.com/sathish316/opus_agents/blob/main/docs/GUI... . This is similar to a recent HN post on Bash tools being better for context and accuracy than MCP - https://mariozechner.at/posts/2025-11-02-what-if-you-dont-ne...
    Other than AgentBuilder, CustomTool, HigherOrderTool, MetaTool, SubagentBuilder the framework does not try to control PydanticAI’s main agent behaviour. The high level approach is to use fewer, more relevant tools and let LLM orchestration and prompt tool references drive the rest. This approach has been more reliable and predictable for a given Agent based problem.
- spacecadet 5 hours ago
  
  I also recommend this. I have tried all of the frameworks, and deploy some still for some clients- but for my personal agents, its my own custom framework that is dead simple and very easy to spin up, extend, etc.
the_mitsuhiko 7 hours ago

Author here. I’m with you on the abstractions part. I dumped a lot of my though so this into a follow up post: https://lucumr.pocoo.org/2025/11/22/llm-apis/
- thierrydamiba 6 hours ago
  
  Excellent write up. I’ve been thinking a lot about caching and agents so this was right ilup my alley.
  Have you experimented with using semantic cache on the chain of thought(what we get back from the providers anyways) and sending that to a dumb model for similar queries to “simulate” thinking?
NitpickLawyer 6 hours ago

Yes, this is great advice. It also applies to interfaces. When we designed a support "chat bot", we went with a diferent architecture than what's out there already. We designed the system with "chat rooms" instead, and the frontend just dumps messages to a chatroom (with a session id). Then on the backend we can do lots of things, incrementally adding functionality, while the front end doesn't have to keep up. We can also do things like group messages, have "system" messages that other services can read, etc. It also feels more natural, as the client can type additional info while the server is working, etc.
If you have to use some of the client side SDKs, another good idea is to have a proxy where you can also add functionality without having to change the frontend.
- verdverm 21 minutes ago
  
  This is not so unlike the coding agent I'm building for vs code. One of the things I'm doing is keeping a snapshot of the current vs code state (files open, terminal history, etc) in the agent server. Similarly, I track the file changes without actually writing them until the user approves the diff, so there are some "filesystem" like things that need to be carefully managed on each side.
  tl;dr, Both sides are broadcasting messages and listening for the ones they care about.
- postalcoder 6 hours ago
  
  Creativity is an underrated hard part of building agents. The fun part of building right now is knowing how little of the design space for building agents has been explored.
  - spacecadet 5 hours ago
    
    This! I keep telling people that if tool use was not a an aha moment relative to AI agents for you, then you need to be more creative...
_pdp_ 6 hours ago

This is a huge undertaking though. Yes it is quite simple to build some basic abstraction on top of openai.complete or similar but this like 1% of an agent need to do.
My bet is that agent frameworks and platform will become more like game engines. You can spin your own engine for sure and it is fun and rewarding. But AAA studios will most likely decide to use a ready to go platform with all the batteries included.
- postalcoder 6 hours ago
  
  In totality, yes. But you don't need every feature at once. You add to it once you hit boundaries. But I think the most important thing about this exercise is that you leave nothing to the imagination when building agents.
  The non-deterministic nature of LLMs already makes the performance of agents so difficult to interpret. Building agents on top of code that you cannot mentally trace through leads to so much frustration when addressing model underperformance and failure.
  It's hard to argue that after the dust settles, companies will default to batteries-included frameworks but, right now, a lot of people i've regretted adopting a large framework off the bat.

mritchie712 6 hours ago

Some things we've[0] learned on agent design:

1. If your agent needs to write a lot of code, it's really hard to beat Claude Code (cc) / Agent SDK. We've tried many approaches and frameworks over the past 2 years (e.g. PydanticAI), but using cc is the first that has felt magic.

2. Vendor lock-in is a risk, but the bigger risk is having an agent that is less capable then what a user gets out of chatgpt because you're hand rolling every aspect of your agent.

3. cc is incredibly self aware. When you ask cc how to do something in cc, it instantly nails it. If you ask cc how to do something in framework xyz, it will take much more effort.

4. Give your agent a computer to use. We use e2b.dev, but Modal is great too. When the agent has a computer, it makes many complex features feel simple.

0 - For context, Definite (https://www.definite.app/) is a data platform with agents to operate it. It's like Heroku for data with a staff of AI data engineers and analysts.

CuriouslyC 6 hours ago

Be careful about what you hand off to Claude versus another agent. Claude is a vibe project monster, but it will fail at hard things, come up with fake solutions, and then lie to you about them. To the point that it'll add random sleeps and do pointless work to cover up the fact that it's reward hacking. It's also very messy.
For brownfield work, work on hard stuff or work in big complex codebases you'll save yourself a lot of pain if you use Codex instead of CC.
- wild_egg 3 hours ago
  
  Claude is amazing at brownfield if you take the time to experiment with your approach.
  Codex is stronger out of the box but properly customized Claude can't be matched at the moment
  - CuriouslyC 2 hours ago
    
    The issue with Claude are twofold:
    1. Poor long context performance compared to GPT5.1, so Claude gets confused about things when it has to do exploration in the middle of a task.
    2. Claude is very completion driven, and very opinionated, so if your codebase has its own opinions Claude will fight you, and if there are things that are hard to get right, rather than come back and ask for advice, Claude will try to stub/mock it ("let's try a simpler solution...") which would be fine, except that it'll report that it completed the task as written.
  - gnat 2 hours ago
    
    What have you done to make Claude stronger on brownfields work? This is very interesting to me.
verdverm 29 minutes ago

yes, we should all stop experimenting and outsource our agentic workflows to our new overlords...
this will surely end up better than where big tech has already brought our current society...
For real though, where did the dreamers about ai / agentic free of the worst companies go? Are we in the seasons of capitulation?
My opinion... build, learn, share. The frameworks will improve, the time to custom agent will be shortened, the knowledge won't be locked in another unicorn
anecdotally, I've come quite far in just a week with ADK and VS Code extensions, having never done extensions before, which has been a large part of the time spent
faxmeyourcode 4 hours ago

Point 2 is very often overlooked. Building products that are worse than the baseline chatgpt website is very common.
smcleod 6 hours ago

It's quite worrying that I have several times in the last few months had to really drive home why people should probably not be building bespoke agentic systems just to essentially act as a half baked version of an agentic coding tool when they could just go use Claude code and instead focus their efforts on creating value rather than instant technical debt.
- CuriouslyC 6 hours ago
  
  You can pretty much completely reprogram agents just by passing them through a smart proxy. You don't need to rewrite claude/codex, just add context engineering and tool behaviors at the proxy layer.
  - RamtinJ95 17 minutes ago
    
    This sounds very intriguing, any resources around this approach you can point me towards?
  - mritchie712 5 hours ago
    
    yep, that's exactly where we've landed.
    focus on the tools and context, let claude handle the execution.

ReDeiPirati 3 hours ago

> We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.

I'm curious about the solutions the op has tried so far here.

ColinEberhardt 3 hours ago

Likewise. I have a nasty feeling that most AI agent deployments happen with nothing more than some cursory manual testing. Going with the ‘vibes’ (to coin an over used term in the industry).
- radarsat1 an hour ago
  
  A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.
  In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
  Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
verdverm an hour ago

ADK has a few pages and some API for evaluating agentic systems
https://google.github.io/adk-docs/evaluate/
tl;dr - challenging because different runs produce different output, also how do you pass/fail (another LLM/agent is what people do)

eclipsetheworld 4 hours ago

We're repeating the same overengineering cycle we saw with early LangChain/RAG stacks. Just a couple of months ago the term agent was hard to define, but I've realized the best mental model is just a standard REPL:

Read: Gather context (user input + tool outputs). Eval: LLM inference (decides: do I need a tool, or am I done?). Print: Execute the tool (the side effect) or return the answer. Loop: Feed the result back into the context window.

Rolling a lightweight implementation around this concept has been significantly more robust for me than fighting with the abstractions in the heavy-weight SDKs.

throw310822 3 hours ago

I don't think this has much to do with SDKs. I've developed my own agent code from scratch (starting from the simple loop) and eventually- unless your use case is really simple- you always have to deal with the need for subagents specialised for certain tasks, that share part of their data (but not all) with the main agent, with internal reasoning and reinforcement messages, etc.
- eclipsetheworld 3 hours ago
  
  Interestingly, sticking to the "Agent = REPL" mental model is actually what helped me solve those specific scaling problems (sub-agents and shared data) without the SDK bloat.
  1. Sub-agents are just stack frames. When the main loop encounters a complex task, it "pushes" a new scope (a sub-agent with a fresh, empty context). That sub-agent runs its own REPL loop, returns only the clean result with out any context pollution and is then "popped".
  2. Shared Data is the heap. Instead of stuffing "shared data" into the context window (which is expensive and confusing), I pass a shared state object by reference. Agents read/write to the heap via tools, but they only pass "pointers" in the conversation history. In the beginning this was just a Python dictionary and the "pointers" were keys.
  My issue with the heavy SDKs isn't that they try to solve these problems, but that they often abstract away the state management. I’ve found that explicitly managing the "stack" (context) and "heap" (artifacts) makes the system much easier to debug.

mitchellh 5 hours ago

This is why I use the agent I use. I won't name the company, because I don't want people to think I'm a shill for them (I've already been accused of it before, but I'm just a happy, excited customer). But it's an agentic coding company that isn't associated with any of the big model providers.

I don't want to keep up with all the new model releases. I don't want to read every model card. I don't want to feel pressured to update immediately (if it's better). I don't want to run evals. I don't want to think about when different models are better for different scenarios. I don't want to build obvious/common subagents. I don't want to manage N > 1 billing entities.

I just want to work.

Paying an agentic coding company to do this makes perfect sense for me.

pjm331 2 hours ago

I’ve been surprised at the lack of discussion about sourcegraph’s Amp here which I’m pretty sure you’re referring to - it started a bit rough but these days I find that it’s really good

havkom 5 hours ago

My tip is: don’t use SDK:s for agents. Use a while loop and craft your own JSON, handle context size and handle faults yourself. You will in practice need this level of control if you are not doing something trivial.

cedws 6 hours ago

Going to hijack this to ask a question that’s been on my mind: does anybody know why there’s no agentic tools that use tree-sitter to navigate code? It seems like it would be much more powerful than having the LLM grepping for strings or rewriting entire files to change one line.

stanleykm 2 hours ago

In my own measurements (made a framework to test number of tokens used / amount of reprompting to accomplish a battery of tests) i found that using an ast type tool makes the results worse. I suspect it just fills the context with distractors. LLMs know how to search effectively so it’s better to let them do that, as far as I can tell. At this point I basically dont use MCPs. Instead I just tell it that certain tools are available to it if it wants to use them.
cryptoz 6 minutes ago

My side project parses Python, HTML, CSS, and JS into trees and instructs the LLM to write ast/astor/bs4/etc code to make changes to files. I haven't yet run a large test to see how much better this is (or worse?) than alternative approaches, but in general it is working well for my cases. I've packaged this up in a webapp that builds out AI-powered webapps by iterating on the code with AST transformations for all changes to files.
You can try it out here: https://codeplusequalsai.com/ (and I would massively appreciate any feedback!)
esafak 4 hours ago

There are tools for that, such as:
https://github.com/wrale/mcp-server-tree-sitter
https://github.com/nendotools/tree-sitter-mcp
https://github.com/NightTrek/treesitter-mcp
dangoodmanUT 2 hours ago

This is how the embeddings generation works, they just convert it to embeddings so it can use natrual language
postalcoder 5 hours ago

This doesn't fully satisfy your question, but it comes straight from bcherny (claude code dev):
> Claude Code doesn't use RAG currently. In our testing we found that agentic search out-performed RAG for the kinds of things people use Code for.
source thread: https://news.ycombinator.com/item?id=43163011#43164253
- cedws 5 hours ago
  
  Thanks, I was also wondering why they don't use RAG.
  - the_mitsuhiko 5 hours ago
    
    They are using RAG. Grep is also just RAG. The better question is why they don’t use a vector database and honestly the answer is that these things are incredibly hard to keep in sync. And if they are not perfectly in sync, the utility drops quickly.
    
    esafak 4 hours ago
    
    By RAG they mean vector search. They're calling grep "agentic search".
  - postalcoder 5 hours ago
    
    I forgot to mention that Aider uses a tree sitter approach. It's blazing fast and great for quick changes. It's not the greatest agent for doing heavy edits but I don't it has to do with them not using grep.
the_mitsuhiko 5 hours ago

> does anybody know why there’s no agentic tools that use tree-sitter to navigate code?
I would guess the biggest reason is that there is no RL happening on the base models with tree-sitter as tool. But there is a lot of RL with bash and so it knows how to grep. I did experiment with giving tree sitter and ast-grep to agents and my experience the results are mixed.
spacecadet 5 hours ago

Create an agent with these tools and you will. Agent tools are just functions, but think of them like skills. The more knowledge and skills your agents (plural, Id recommend more than one) have access to, the more useful they become.

lvl155 5 hours ago

Design is hard because models change almost on a weekly basis. Quality abruptly falls off or drastically changes without announcements. It’s like building a house without proper foundation. I don’t know how many tokens I wasted because of this. I want to say 30% of the cost and resources this year for me.

CuriouslyC 6 hours ago

The 'Reinforcement in the Agent Loop' section is a big deal, I use this pattern to enable async/event steered agents, it's super powerful. In long context you can use it to re-inject key points ("reminders"), etc.

pjm331 3 hours ago

Yes that was the one point in here where I thought to myself oh yeah I’m going to go implement that immediately

fudged71 an hour ago

If I understand correctly, Claude Code Agent SDK can’t edit itself (to incrementally improve), can’t edit or create its own agents and skills.

I’ve found that by running Claude Code within Manus sandbox I can inspect the reasoning traces and edit the Agents/Skills with the Manus agent.

elvin_d 2 hours ago

My experience with BAML was really good. For structured outputs, JSON schema is slow and sluggish while BAML performs well. Interestingly don't see much attention on HN, wondering is it hype of other SDKs or BAML doesn't deliver enough.

Frannky 3 hours ago

I use CLI Terminals as agent frameworks. You have big tech and open source behind them. All the problems are solved with zero work. They take care of new problems. You don't need to remove all the stuff that becomes outdated because the latest model doesn't make the same mistakes. Free via free tiers, cheap via openai compatible open source models like z.ai. Insanely smart and can easily talk to MCP servers.

dangoodmanUT 2 hours ago

I think output functions aren't necessary, you should just use the text output when the agentic loop ends, and prompt to what kind of output you want (markdown, summary of changes, etc.)

srameshc 6 hours ago

I still feel there is no sure shot way to build an abstraction yet. Probably that is why Loveable decided to build on Gemini AI rather than giving options of choosing model. On the other hand I like Pydantic AI framework and got myself a decent working solution where my preference is to stick with cheaper models by default and only use expensive only in cases where failure rate is too high.

_pdp_ 6 hours ago

This is true. Based on my experience with real customers, they really don't know what is the difference between the different models.
What they want is to get things done. The model is simply means to an end. As long as the task is completed, everything else is secondary.
- morkalork 6 hours ago
  
  I feel like there's always that "one guy" who has a preference and asks for an option to choose, and, it's a way for developers to offload the decision. You can't be blamed for the wrong one if you put that decision on the user.
  Tbh I agree with your approach and use it when building stuff for myself. I'm so over yak shaving that topic.

jfghi 2 hours ago

Technologically and mathematically this is all quite interesting. However, I have no desire to ever use an agent and I doubt there are many that will.

kvirani 2 hours ago

No desire because you don't need them to optimize anything for yourself, or ?

Yokohiii 6 hours ago

Most of it reads like too high expectation of overhyped technology.

jonmagic 5 hours ago

Great post. Re: frameworks, I tried a number of them and then found Pocketflow and haven’t found a reason to try anything else since. It’s now been ported to half a dozen or more languages (including my port to Ruby). The simple api and mental model makes it easy for everyone on my team to jump into, extend, and compose. Highly recommend for anyone frustrated with the larger SDKs.

ColinEberhardt 3 hours ago

> We find testing and evals to be the hardest problem here …

I wonder what this means for the agents that people are deploying into production? Are they tested at all? Or just manual ad-hoc testing?

Sounds risky!

Fiveplus 6 hours ago

I liked reading this but got a silly question as I am a noob in these things. If explicit caching is better, does that mean the agent is just forgetting stuff unless we manually save its notes? Are these things really that forgetful? Also why is there a virtual file system? So the agent is basically just running around a tiny digital desktop looking for its files? Why can't the agent just know where the data is? I'm sorry if these are juvenile questions.

the_mitsuhiko 6 hours ago

> If explicit caching is better, does that mean the agent is just forgetting stuff unless we manually save its notes?
Caching is unrelated to memory, it's about how to not do the same work over and over again due to the distributed nature of state. I wrote a post that goes into detail from first principles here with my current thoughts on that topic [1].
> Are these things really that forgetful?
No, not really. They can get side-tracked which is why most agents do a form of reinforcement in-context.
> Why is there a virtual file system?
So that you don't have dead-end tools. If a tool creates or manipulates state (which we represent on a virtual file system), another tool needs to be able to pick up the work.
> Why can't the agent just know where the data is?
And where would that data be?
[1]: https://lucumr.pocoo.org/2025/11/22/llm-apis/
pjm331 6 hours ago

You are maybe confusing caching and context windows. Caching is mainly about keeping inference costs down

munro 2 hours ago

I had built an agent with LangGraph a 9 months ago--now seems React agents are in LangChain. Over all pretty happy with that, I just don't use any of the dumb stuff like embedding/search layer: just tools&state

But I was actually playing with a few frameworks yesterday and struggling--I want what I want without having to write it. ;). Ended up using pydantic_ai package, literally just want tools w/ pydantic validation--but out of the box it doesn't have good observability, you would have to use their proprietary SaaS; and it comes bundled with Temporal.io (yo odio eso proyecto). I had to write my own observability which was annoying, and it sucks.

If anyone has any things they've built, I would love to know, and TypeScript is an option. I want: - ReAct agent with tools that have schema validation - built in REALTIME observability w/ WebUI - customizable playground ChatUI (This is where TypeScript would shine) - no corporate takeover tentacles

p.s.s: I know... I purposely try to avoid hard recommendations on HN, to avoid enshittification. "reddit best X" has been gamed. And generally skip these subtle promotional posts..

DouweM 24 minutes ago

Hey munro, Douwe from Pydantic AI here. Our docs have a page dedicated to observability: https://ai.pydantic.dev/logfire/, which is all based on OpenTelemetry and its gen_ai conventions.
The Logfire SDK that Pydantic AI uses is a generic OTel client that defaults to sending data to Pydantic Logfire, our SaaS observability platform: https://pydantic.dev/logfire, but can easily be pointed at any OTel sink: https://ai.pydantic.dev/logfire/#logfire-with-an-alternative...
Temporal is one of multiple durable execution solutions (https://ai.pydantic.dev/durable_execution/overview/) we support and its SDK is indeed included by default in the "fat" `pydantic-ai` package, as are the SDKs for all model providers. There's also a `pydantic-ai-slim` package that doesn't include any optional dependencies: https://ai.pydantic.dev/install/#slim-install
thewhitetulip 2 hours ago

I have also worked on react agents using langgraph and I've not faced much issues! This thread has been confusing! Am I doing anything incorrectly? Or am I misunderstanding what people call as agents?

rbren 4 hours ago

If you want to hack on an emerging agent SDK (specializing in software development), join us at https://github.com/OpenHands/software-agent-sdk

nowittyusername 3 hours ago

If you are building an agent, start from scratch and build your own framework. This will save you more headache and wasted time down the line. one of the issues when you use someone else framework is that you miss out on learning and understanding important fundamentals about LLM's, how they work, context, etc... Also many developers don't learn the fundamentals of running LLM's locally and thus miss crucial context (giggidy) that would have helped them better understand the whole system. It seems to me that the author here came to a similar conclusion like many of us. I do want to add my own insights though that might be of use to some.

One of the things he talked about was issues with reliable tool calling by the model. I recommend he try the following approach. Have the agent perform a self calibration exercise that makes the agent use his tools at the beginning of the context. Make him perform some complex stuff. Do it many times to test for coherence and accuracy while adjusting the system prompt towards more accurate tool calls. Once the agent had performed that calibration process successfully, you "freeze" that calibration context/history by broadening the --keep n to include not just the system prompt in the rolling window but also up to the end of this calibration session. then no matter how far the context window drifts the conditioned tokens generated by that calibration session steer the agent towards proper tool use. From then on your "system prompt" includes those turns. Note that this is probably not possible on cloud based models as you don't have access to the inference engine directly. A hacky way around that is emulate the conversation turns inside the system prompt.

On the note of benchmark's. The calibration test is your benchmark from then on. When introducing new tools to the system prompt or adjusting any important variable, you must always rerun the same test to make sure the new adjustments you made don't negatively affect the system stability.

On context engineering. That is a must as a bloated history will decohere the whole system. So its important to device an automated system that compresses the context down but retains overall essence of the history. there are about a billion ways you could do this and you will need to experiment a lot. LLM's are conditioned quite heavily from their own outputs, so having the ability to remove error tool calls from the context is a big boon as now the model is less likely to repeat its same mistakes. There are trade offs though, like he said caching is a no go when going this route but you gain a lot more control and stability within the whole system if you do this right. its basically reliability vs cost here. I tend to lean towards reliability. Also i don't recommend using the whole context size of the model. Most llms perform very poorly past a specific amount and I find that using maximum of 50% of the whole context window is recommended for cohesion. Meaning that if lets say max context window is 100k tokens, treat 50k as the max limit and start compressing the history around 35k tokens. Granular and step wise system can be set up. Where the most recent context is most detailed and uncompressed but as it goes further from the current time it gets less and less detailed. Obviously you want to store the full uncompressed history for a subagent that uses rag. This allows the agent to see in detail the whole thing if it finds the need to.

ahh also on the matter of output. I found great success with making input and output channels for my agent. there are many channels that the model is conditioned in using for specific interactions. <think> channel for cot and reasoning. <message_to_user> channel for explicit messages to user. <call_agent> channel for calling agents and interacting with them. <call_tool> for tool use. and then a few other environment and system channels that are input channels from error scripts and environment towards the agent. This channel segmentation also allows for better management of internal automated scripts, and focus the model. Oh also one important thing is the fact that you need at least 2 separate output layers. meaning you need to separate your llm outputs from what is displayed to the user. and they have their own rules they follow. what that allows you to do is display information in a very human readable way to the real human while hiding all the noise but also retaining the crucial context thats needed for the model to function appropriately.

bah anyways i rambled for long enough. good luck folks. hope this info helps someone.

nehalem 3 hours ago

I am glad Vercel works on agents now. After all, Next is absolutely perfect and recommends them for greater challenges. /s

callamdelaney 3 hours ago

Agent design is still boring and I’m tired of hearing about it.