About AI Evals

159 points by TheIronYuppie 3 days ago

calebkaiser 12 hours ago

I'm biased in that I work on an open source project in this space, but I would strongly recommend starting with a free/open source platform for debugging/tracing, annotating, and building custom evals.

This niche of the field has come a very long way just over the last 12 months, and the tooling is so much better than it used to be. Trying to do this from scratch, beyond a "kinda sorta good enough for now" project, is a full-time engineering project in and of itself.

I'm a maintainer of Opik, but you have plenty of options in the space these days for whatever your particular needs are: https://github.com/comet-ml/opik

mbanerjeepalmer 9 hours ago

Yes I'm not sure I really want to vibe code something that does auto evals on a sample of my OTEL traces any more than I want to build my own analytics library.
Alternatives to Opik include Braintrust (closed), Promptfoo (open, https://github.com/promptfoo/promptfoo) and Laminar (open, https://github.com/lmnr-ai/lmnr).
- Onawa 8 hours ago
  
  I've used and liked Promptfoo a lot, but I've run into issues when trying to do evaluations with too many independent variables. Works great for `models * prompts * variables`, but broke down when we wanted `models * prompts * variables^x`.

dbish 5 hours ago

Hamel has really great practical eval advice and I always share his advice and posts to any new teams developing AI features/agents/assistants that I'm working with, both internally and with new startups in the AI applications space.

What I'd love to see one day is a way to capture this advice in a "Hamel in a box" eval copilot, or the agent that helps eval and improve other ai agents :). An eval expert who can ask the questions he's asking, look at data flowing through your system, make suggestions about how to improve your eval process, and automatically guide non experts into following good practices for their eval loop.

hamelsmu 4 hours ago

I think that will be very possible soon! We continue to write about it publicly :) Also thanks to my friends and colleagues who write a lot on this subject that I frequently collaborate with:
- Shreya Shankar https://www.sh-reya.com/ - Eugene Yan https://eugeneyan.com/ - Bryan Bischof https://bio.site/Docdonut

ReDeiPirati 8 hours ago

> Q: What makes a good custom interface for reviewing LLM outputs? Great interfaces make human review fast, clear, and motivating. We recommend building your own annotation tool customized to your domain ...

Ah! This is a horrible advice. Why should you recommend reinventing the wheel where there is already great open source software available? Just use https://github.com/HumanSignal/label-studio/ or any other type of open source annotation software you want to get started. These tools cover already pretty much all the possible use-cases, and if they aren't you can just build on top of them instead of building it from zero.

dbish 6 hours ago

I think the truth is somewhere in between. I find label studio to be lacking a lot of niceties and generally built for very the average text labeling or image labeling use case, but anything else (like a multi-step agent workflow or some sort of multi-modal task specific problem) it is not quite right for and you do end up doing a bit of trying to build your own custom interface.
So, imho you should try label studio but timebox and really decide for yourself quickly if it's going to work for you in a day, and if not go vibecode a different view and try it out or build labeling into a copy of a front end you're already using for your task if that's quick.
What I think we really need here is a "lovable meets labelstudio" that starts with simple defaults and lets anyone use natural language, sketches, screenshots, to create custom interfaces and modify them quickly.
- ultrasaurus 4 hours ago
  
  The SaaS version of Label Studio does have a natural language interface to create custom interfaces: https://docs.humansignal.com/guide/ask_ai
  I'm ostensibly an expert in the product and I probably use that 90%+ of the time (unless I'm testing something specific) -- using a sketch as input is a cool idea though!
  Disclaimer: I'm the VP Product at HumanSignal the company behind Label Studio.
bbischof 6 hours ago

Label studio is fine if it covers your need, but in many cases the core opportunity in an eval interface is fitting in with the SME’s workflow or current tech stack.
If label studio looks like what they can use, it’s fine. If not, a day of vibecoding is worth the effort to make your partners with special knowledge comfortable.
jph00 6 hours ago

Label Studio is great, but by trying to cover so many use cases, it becomes pretty complex.
I've found it's often easier to just whip up something for my specific needs, when I need it.
abletonlive 7 hours ago

This awful advice can’t be blanket applied and misses the point: starting from zero is extremely easy now with LLMs, the last 10% is the hardest part. Not only that, if you don’t start from zero you aren’t able to build from whatever you think the new first principles are. Spacex would not exist if it tried to extend old paradigm of rocketry.
There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.
- ReDeiPirati 6 hours ago
  
  I'd have agreed with you, if the principles would be different. But what was showed in the content is EXACTLY what those tools are doing today. Actually those tools are way more powerful and considering & covering way more scenarios.
  > There’s nothing wrong with starting from scratch or rebuilding an existing tool from the ground up. There’s no reason to blindly build from the status quo.
  Generally speaking all the options are ok, but not if you want to have something up as fast as you can or if your team is piloting something. I think the time you spend to vibe code it is greater than to setting any of those tools up.
  And BTW, you shouldn't vibe code something that flows proprietary data. At least you would work with co-pilots

afro88 13 hours ago

Some great info, but I have to disagree with this:

> Q: How much time should I spend on model selection?

> Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”

If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your system pretty easily. Use the best models you can, if you can afford it.

simonw 12 hours ago

I think the key part if that advice is the without evidence bit:
> I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence.
If you try to fix problems by switching from eg Gemini 2.5 Flash to OpenAI o3 but you don't have any evals in place how will you tell if the model switch actually helped?
phillipcarter 12 hours ago

> If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4)
How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?
FWIW I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation.
softwaredoug 13 hours ago

I might disagree as these models are pretty inscrutable, and behavior on your specific task can be dramatically different on a new/“better” model. Teams would do well to have the right evals to make this decision rather than get surprised.
Also the “if you can afford it” can be fairly non trivial decision.
shrumm 10 hours ago

The ‘with evidence’ part is key as simonw said. One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model.
Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case.
lumost 13 hours ago

The vast majority of a I startups will fail for reasons other than model costs. If you crack your use case, model costs should fall exponentially.
smcleod 12 hours ago

Yeah totally agree, I see so many systems perform badly only to find out they're using an older generation mode and simply updating to the current mode fixes many of their issues.
ndr 10 hours ago

Quality can drop drastically even moving from Model N to N+1 from the same provider, let alone a different one.
You'll have to adjust a bunch of prompts and measure. And if you didn't have a baseline to begin with good luck YOLOing your way out of it.

davedx 10 hours ago

I've worked with LLM's for the better part of the last couple of years, including on evals, but I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?

spmurrayzzz 9 hours ago

Concrete example from my own workflows: in my IDE whenever I accept or reject a FIM completion, I capture that data (the prefix, the suffix, the completion, and the thumbs up/down signal) and put it in a database. The resultant dataset is annotated such that I can use it for analysis, debugging, finetuning, prompt mgmt, etc. The "custom" tooling part in this case would be that I'm using a forked version of Zed that I've customized in part for this purpose.
calebkaiser 10 hours ago

Typically, you would collect a ton of execution traces from your production app. Annotating them can mean a lot of different things, but often it means some mixture of automated scoring and manual review. At the earliest stages, you're usually annotating common modes of failure, so you can say like "In 30% of failures, the retrieval component of our RAG app is grabbing irrelevant context." or "In 15% of cases, our chat agent misunderstood the user's query and did not ask clarifiying questions."
You can then create datasets out of these traces, and use them to benchmark improvements you make to your application.

pamelafox 9 hours ago

Fantastic FAQ, thank you Hamel for writing it up. We had an open space on AI Evals at Pycon this year, and had lots of discussion around similar questions. I only wrote down the questions, however:

# Evaluation Metrics & Methodology

* What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are similarity metrics still useful?

* Do you use step-by-step evaluations or evaluate full responses?

* How do you evaluate VLM (vision-language model) summarization? Do you sample outputs or extract named entities?

* How do you approach offline (ground truth) vs. online evaluation?

* How do you handle uncertainty or "don’t know" cases? (Temperature settings?)

* How do you evaluate multi-turn conversations?

* A/B comparisons and discrete labels (e.g., good/bad) are easier to interpret.

* It’s important to counteract bias toward your own favorite eval questions—ensure a diverse dataset.

## Prompting & Models

* Do you modify prompts based on the specific app being evaluated?

* Where do you store prompts—text files, Prompty, database, or in code?

* Do you have domain experts edit or review prompts?

* How do you choose which model to use?

## Evaluation Infrastructure

* How do you choose an evaluation framework?

* What platforms do you use to gather domain expert feedback or labels?

* Do domain experts label outputs or also help with prompt design?

## User Feedback & Observability

* Do you collect thumbs up / thumbs down feedback?

* How does observability help identify failure modes?

* Do models tend to favor their own outputs? (There's research on this.)

I personally work on adding evaluation to our most popular Azure RAG samples, and put a Textual CLI interface in this repo that I've found helpful for reviewing the eval results: https://github.com/Azure-Samples/ai-rag-chat-evaluator

hamelsmu 6 hours ago

This is Hamel. Thanks for sharing! I will incorporate these into the FAQ. I love getting additional questions like this.
mmanulis 8 hours ago

Any chance you can share what the answers were for choosing an evaluation framework?

_jonas 5 hours ago

Evals are critical, and I love the practicality of this guide!

One problem not covered here is: knowing which data to review.

If your AI system produces say 95% accurate responses, your Evals team will spend too much time reviewing production logs to discover different AI failure modes.

To enable your Evals team to only spend time reviewing the high-signal responses that are likely incorrect, I built a tool that automatically surfaces the least trustworthy LLM responses:

https://help.cleanlab.ai/tlm/

Hope you find it useful, I made sure it works out-of-the-box with zero-configuration required!

hamelsmu 5 hours ago

Hamel here. Thanks so much for asking this question! I will work on adding it to the FAQ. Please keep these coming!

andybak 12 hours ago

> About AI Evals

Maybe it's obvious to some - but I was hoping that page started off by explaining what the hell an AI Eval specifically is.

I can probably guess from context but I'd love to have some validation.

phren0logy 11 hours ago

Here's another article by the same author with more background on AI Evals: https://hamel.dev/blog/posts/evals/
I've appreciated Hamel's thinking on this topic.
- xpe 9 hours ago
  
  From that article:
  > On a related note, unlike traditional unit tests, you don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.
  Not sure how I feel about this, given expectations, culture, and tooling around CI. This suggestion seems to blur the line between a score from an eval and the usual idea of a unit test.
  P.S. It is also useful to track regressions on a per-test basis.
ethan_smith 3 hours ago

AI Evals are systematic frameworks for measuring LLM performance against defined benchmarks, typically involving test cases, metrics, and human judgment to quantify capabilities, identify failure modes, and track improvements across model versions.

th0ma5 6 hours ago

People should be demanding consistency and traceability from the model vendors checked by some tool perhaps like this. This may tell you when the vendor changed something but there is otherwise no recourse?

fossa1 14 hours ago

[dead]

satisfice 9 hours ago

This reads like a collection of ad hoc advice overfitted to experience that is probably obsolete or will be tomorrow. And we don’t even know if it does fit the author’s experience.

I am looking for solid evidence of the efficacy of folk theories about how to make AI perform evaluation.

Seems to me a bunch of people are hoping that AI can test AI, and that it can to some degree. But in the end AI cannot be accountable for such testing, and we can never know all the holes in its judgment, nor can we expect that fixing a hole will not tear open other holes.

simonw 5 hours ago

Hamel wrote a whole lot more about the "LLM as a judge" pattern (where you use LLMs to evaluate the output of other LLMs) here: https://hamel.dev/blog/posts/llm-judge/
- hamelsmu 4 hours ago
  
  Appreciate it, Simon! I have now edited my post to include links to "intro to evals" for those not familiar.