Mistral Small 3

620 points by jasondavies 5 months ago

simonw 5 months ago

I'm excited about this one - they seem to be directly targeting the "best model to run on a decent laptop" category, hence the comparison with Llama 3.3 70B and Qwen 2.5 32B.

I'm running it on a M2 64GB MacBook Pro now via Ollama and it's fast and appears to be very capable. This downloads 14GB of model weights:

  ollama run mistral-small:24b

Then using my https://llm.datasette.io/ tool (so I can log my prompts to SQLite):

  llm install llm-ollama
  llm -m mistral-small:24b "say hi"

More notes here: https://simonwillison.net/2025/Jan/30/mistral-small-3/

simonw 5 months ago

The API pricing is notable too: they dropped the prices by half from the old Mistral Small - that one was $0.20/million tokens of input, $0.60/million for output.
The new Mistral Small 3 API model is $0.10/$0.30.
For comparison, GPT-4o-mini is $0.15/$0.60.
- 85392_school 5 months ago
  
  Competition will likely be cheaper. (For context, Deepinfra runs larger 32B models at $0.07/$0.16)
isoprophlex 5 months ago

I make very heavy use of structured output (to convert unstructured data into something processable, eg for process mining on customer service mailboxes)
Is it any good for this, if you tested it?
I'm looking for something that hits the sweet spot of runs locally & follows prescribed output structure, but I've been quite underwhelmed so far
- enkrs 5 months ago
  
  I thought structured output is a solved problem now. I've had consistent results with ollama structured outputs [1] by passing Zod schema with the request. Works even with very small models. What are the challenges you're facing?
  [1] https://ollama.com/blog/structured-outputs
  - freehorse 5 months ago
    
    Structured output is solved, it is structuring data that's not, because that is an unbounded problem. There is no limit to how messy your data may be, and no limit to the accuracy and efficiency you may require.
    I have used such models to structure human-generated data into sth a script can then read and process, getting important aspects in this data (eg what time the human reported doing X thing, how long, with whom etc) into like a csv file with columns eg timestamps and whatever variables I am interested in.
  - Der_Einzige 5 months ago
    
    For anyone who thinks it isn't "solved", outlines debunked the paper which claims that "structured generation harms creativity":
    https://blog.dottxt.co/say-what-you-mean.html
- the_mitsuhiko 5 months ago
  
  I get decent JSON from it quite well with the "assistant: {" trick. I'm not sure how well trained it is to do JSON. The template on ollama has tools calls so I assume they made sure JSON works: https://ollama.com/library/mistral-small:24b/blobs/6db27cd4e...
  - a_wild_dandan 5 months ago
    
    And for anyone looking to dig deeper, check out "grammar-based sampling."
  - azinman2 5 months ago
    
    What’s the “assistant: {" trick? You just end your prompt with that?
    
    simonw 5 months ago
    
    Mistral supports prefixes: https://docs.mistral.ai/guides/prefix/
    
    azinman2 5 months ago
    
    That’s cool. However it only shows a few odd. I’d imagine the model needs to explicitly support this (be trained with it). None are about json… do you use that trick with json?
    
    simonw 5 months ago
    
    I use it to get JSON pretty often. See also: https://twitter.com/simonw/status/1885091289554968975
- starik36 5 months ago
  
  The only model that I've found to be useful in processing customer emails is o1-preview. The rest of the models work as well, but don't get all the minutia of the emails.
  My scenario is pretty specific though and is all about determining intent (e.g. what does the customer want) and mapping it onto my internal structures.
  The model is very slow, but definitely worth it.
- d4rkp4ttern 5 months ago
  It does decently well actually. You can test function-calling using Langroid. There are several example scripts you could try from the repo, e.g.
  uv run examples/basic/tool-extract-short-example.py --model ollama/mistral-small
  sample output: https://gist.github.com/pchalasani/662d7f13dbe690d6e2bfef01c...
  Langroid has a ToolMessage mechanism that lets you specify a tool/fn-call using Pydantic, which is then transpiled into system message instructions.
- mohsen1 5 months ago
  
  See function calling being called out here
  https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...
- mercer 5 months ago
  
  I've found phi4 to be very good for this.
- rkwz 5 months ago
  
  What local models are you currently using and what issues are you facing?
halyconWays 5 months ago

Maybe I'm an outlier but I don't see much value in running tiny local models vs. using a more powerful desktop in my house to host a larger and far more usable model. I run Open WebUI and connect it to my own llama.cpp/koboldcpp that runs a 4-bit 70B model, and can connect to it anywhere easily with Tailscale. For questions that even 70B can't handle I have Open WebUI hit OpenRouter and can choose between all the flagship models.
Every time I've tried a tiny model it's been too questionable to trust.
- kamranjon 5 months ago
  
  Have you tried Gemma 27b? I’ve been using it with llamafile and it’s pretty incredible. I think the winds are changing a bit and small models are becoming much more capable. Worth giving some of the smaller ones a shot if it’s been a while. I can run Gemma 27b on my 32gb MacBook Pro and it’s pretty capable with code too.
pks016 5 months ago

Question for people who spent more time with these small models. What's a current best small model to extract information from a large number of pdfs? I have multiple collection of research articles. I want two tasks 1) Extract info from pdfs 2) classify papers based the content of the paper.
Or point me to right direction
- themanmaran 5 months ago
  
  Hey this is something we know a lot about. I'd say Qwen 2.5 32B would be the best here.
  We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.
  But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.
  - pks016 5 months ago
    
    Most of them are experimental studies. So it would be text extraction of something like title, authors, species of the study, sample size etc. And classify based on the content of the pdfs.
    I tried the GPT-4o, it's good but it'll cost a lot if I want to process all the documents.
    
    SparkyMcUnicorn 5 months ago
    
    1. You can get a 50% discount via batching.
    2. Give a few Sonnet or 4o input/output examples to haiku, 4o-mini, or any other smaller model. Giving good examples to smaller models can bring the output quality closer to (or on par with) the better model.
rahimnathwani 5 months ago

Given you have 64GB RAM, you could run mistral-small:24b-instruct-2501-q8_0
jhickok 5 months ago

Do you know how many tokens per second you are getting? I have a similar laptop that I can test on later but if you have that info handy let me know!
- snickell 5 months ago
  
  M2 max with 64GB: 14 tokens/s running `ollama run mistral-small:24b --verbose`
prettyblocks 5 months ago

Hey Simon - In your experience, what's the best "small" model for function/tool calling? Of the ones I've tested they seem to return the function call even when it's not needed, which requires all kinds of meta prompting & templating to weed out. Have you found a model that more or less just gets it right?
- simonw 5 months ago
  
  I'm afraid I don't have a great answer for that - I haven't spent nearly enough time with function calling in these models.
  I'm hoping to add function calling to my LLM library soon which will make me much better equipped to experiment here.
  - prettyblocks 5 months ago
    
    Cool, thanks for the reply. Looking forward to following along with your progress.
jonas21 5 months ago

I don't understand the joke.
- simonw 5 months ago
  It's hardly a joke at all. Even the very best models tend to be awful at writing jokes.
  I find the addition of an explanation at the end (never a sign of a good joke) amusing at the meta-level:
  Why did the badger bring a puffin to the party? Because he heard puffins make great party 'Puffins'! (That's a play on the word "puffins" and the phrase "party people.")
  - dgacmu 5 months ago
    
    A man walks up to an llm and asks it to tell him a joke about a puffin and a badger.
    The llm replies with a joke that is barely a joke.
    The man says "another."
    The llm gives another unfunny response.
    "Another!"
    Followed by another similarly lacking response.
    "Another!"
    With exasperation, the llm replies "stop badgering me!"
    Except it won't, because that's not a high likelihood output. ;)
    
    airstrike 5 months ago
    
    Now that your comment is in their training data, it's a high likelihood output!
  - becquerel 5 months ago
    
    And yet LLMs tend to in fact be very funny, just very very rarely on purpose.
- emmelaich 5 months ago
  
  Apparently "party puffin" is a company that sells cheap party supplies and decorations. That's all that I can think of.

asb 5 months ago

Note the announcement at the end, that they're moving away from the non-commercial only license used in some of their models in favour of Apache:

We’re renewing our commitment to using Apache 2.0 license for our general purpose models, as we progressively move away from MRL-licensed models

diggan 5 months ago

Note that this seems to be about the weights themselves, AFAIK, the actual training code and datasets (for example) aren't actually publicly available.
It's a bit like developing a binary application and slapping a FOSS license on the binary while keeping the code proprietary. Not saying that's wrong or anything, but people reading these announcements tend to misunderstand what actually got FOSS licensed when the companies write stuff like this.
- crawshaw 5 months ago
  
  It's not the same as slapping an open source license on a binary, because unencumbered weights are so much more generally useful than your typical program binary. Weights are fine-tunable and embeddable into a wide range of software.
  To consider just the power of fine tuning: all of the press DeepSeek have received is over their R1 model, a relatively tiny fine-tune on their open source V3 model. The vast majority of the compute and data pipeline work to build R1 was complete in V3, while that final fine-tuning step to R1 is possible even by an enthusiastic dedicated individual. (And there are many interesting ways of doing it.)
  The insistence every time open sourced model weights come up that it is not "truly" open source is tiring. There is enormous value in open source weights compared to closed APIs. Let us call them open source weights. What you want can be "open source data" or somesuch.
  - diggan 5 months ago
    
    > The insistence every time open sourced model weights come up that it is not "truly" open source is tiring. There is enormous value in open source weights compared to closed APIs. Let us call them open source weights. What you want can be "open source data" or somesuch.
    Agree that there is more value in open source weights than closed APIs, but what I really want to enable, is people learning how to create their own models from scratch. FOSS to me means being able to learn from other projects, how to build the thing yourself, and I wrote about why this is important to me here: https://news.ycombinator.com/item?id=42878817
    It's not a puritan view but purely practical. Many companies started using FOSS as a marketing label (like what Meta does) and as someone who probably wouldn't be a software developer without being able to learn from FOSS, it fucking sucks that the ML/AI ecosystem is seemingly OK with the term being hijacked.
    
    crawshaw 5 months ago
    
    It's not just a marketing label. The term is not being hijacked. Open source models, open source weights, the license chosen, these are all exetrmely valuable concepts.
    The thing you want, open source model data pipelines, is a different thing. It's existence in no way invalidates the concept of an open source model. Nothing has been hijacked.
    
    diggan 5 months ago
    
    We call software FOSS when you can compile (if needed) and build the project yourself, locally, granted you have the resources available. If you have parts that aren't FOSS attached to the project somehow, we call it "Open Core" or similar. You wouldn't call a software project FOSS if the only thing under a FOSS license is the binary itself, or some other output, we require at least the code to be FOSS for it to be considered FOSS.
    Meta/Llama probably started the trend, and they still today say "The open-source AI models" and "Llama is the leading open source model family" which is grossly misleading.
    You cannot download the Llama models or weights without signing a license agreement, you're not allowed to use it for anything you want, you need to add a disclaimer on anything that uses Llama (which almost the entire ecosystem breaks as they seemingly missed this when they signed the agreement) and so on, which to me goes directly against what FOSS means.
    If you cannot reproduce the artifact yourself (again, granted you have the resources), you'd have a really hard time convincing me that that is FOSS.
    
    fzzzy 5 months ago
    
    The data pipeline to build the weights is the source. The weights are a binary. The term is being hijacked. Just call it open weights, not open source models. The source for the models is not available. The weights are openly available.
    
    rcdwealth 5 months ago
    
    Meta’s LLaMa 2 license is not Open Source – Open Source Initiative: https://opensource.org/blog/metas-llama-2-license-is-not-ope...
    If it would not be hijacked, then such articles would not exist.
    META is falsely and deceptively, but also carefully, pretending to be Open Source.
    The Open Source Definition – Open Source Initiative https://opensource.org/osd
    What is Free Software? - GNU Project - Free Software Foundation https://www.gnu.org/philosophy/free-sw.html
    Word "Open" as in "Open Source" - Words to Avoid (or Use with Care) Because They Are Loaded or Confusing https://www.gnu.org/philosophy/words-to-avoid.html#Open
    Please refrain from using "open" or "open source" as a synonym for "free software." These terms originate from different perspectives and values. The free software movement advocates for your freedom in computing, grounded in principles of justice. The open source approach, on the other hand, does not promote a set of values in the same way. When discussing open source views, it's appropriate to use that term. However, when referring to our views, our software, or our movement, please use "free software" or "free (libre) software" instead. Using "open source" in this context can lead to misunderstandings, as it implies our views are similar to those of the open source movement.
    
    crawshaw 5 months ago
    
    Your concern about Meta's license is fair, I have no useful opinion on that. I certainly wish they would use a freer license, though I am loath to look a gift horse in the mouth.
    My concern in this thread is people rejecting the concept of open source model weights as not "true" open source, because there is more that could be open sourced. It discounts a huge amount of value model developers provide when they open source weights. You are doing a variant of that here by trying to claim a narrow definition of "free software". I don't have any interest in the FSF definition.
    
    Miraste 5 months ago
    
    I'm in favor of FOSS, and I'd like to see more truly open models for ideological reasons, but I don't see a lot of practical value for individuals in open-sourcing the process. You still can't build one yourself. How does it help to know the steps when creating a base model still costs >tens of millions of dollars?
    It seems to me that open source weights enable everything the FOSS community is practically capable of doing.
    
    diggan 5 months ago
    
    > How does it help to know the steps when creating a base model still costs >tens of millions of dollars?
    You can still learn web development even though you don't have 10,000s of users with a large fleet of servers and distributed servers. Thanks to FOSS, it's trivial to go through GitHub and find projects you can learn a bunch from, which is exactly what I did when I started out.
    With LLMs, you don't have a lot of options. Sure, you can download and fine-tune the weights, but what if you're interested in how the weights are created in the first place? Some companies are doing a good job (like the folks building OLMo) to create those resources, but the others seems to just want to use FOSS because it's good marketing VS OpenAI et al.
    
    Miraste 5 months ago
    
    Learning resources are nice, but I don't think it's analogous to web dev. I can download nginx and make a useful website right now, no fleet of servers needed. I can even get it hosted for free. Making a useful LLM absolutely, 100% requires huge GPU clusters. There is no entry level, or rather that is the entry level. Because of the scale requirements, FOSS model training frameworks (see GPT-NeoX) are only helpful for large, well-funded labs. It's also difficult to open-source training data, because of copyright.
    Finetuning weights and building infrastructure around that involves almost all the same things as building a model, except it's actually possible. That's where I've seen most small-scale FOSS development take place over the last few years.
    
    fzzzy 5 months ago
    
    This isn't true. Learning how to train a 124M is just as useful as a 700B, and is possible on a laptop. https://github.com/karpathy/nanoGPT
    
    Miraste 5 months ago
    
    To clarify my point:
    Learning how to make a small website is useful, and so is the website.
    Learning how to finetune a large GPT is useful, and so is the finetuned model.
    Learning how to train a 124M GPT is useful, but the resulting model is useless.
    
    diggan 5 months ago
    
    > Finetuning weights and building infrastructure around that involves almost all the same things as building a model
    Those are two completely different roles? One is mostly around infrastructure and the other is actual ML. There are people who know both, I'll give you that, but I don't think that's the default or even common. Fine-tuning is trivial compared to building your own model and deployments/infrastructure is something else entirely.
    
    fzzzy 5 months ago
    
    It wouldn't cost tens of millions of dollars to create a 500m or 1b, and the process of learning is transferrable to larger model weights.
- eldenring 5 months ago
  
  Its not the exact same since you can still finetune it, you can modify the weights, serve it with different engines, etc.
  This kind of purity test mindset doesn't help anyone. They are shipping the most modifiable form of their model.
  - diggan 5 months ago
    
    Agree that it's not exactly the same, all analogies have holes, they're simplifications after all.
    I guess I'm vary of the messaging because I'm a developer 99% thanks to FOSS, and being able to learn from FOSS projects how to build similar stuff myself. Without FOSS, I probably wouldn't have been able to "escape" the working-class my family was "stuck in" when I grew up.
    I want to do whatever I can to make sure others have the same opportunity, and it doesn't matter if the weights themselves are FOSS or not, others cannot learn how to create their own models based on just looking at the weights. You need to be able to learn the model architecture, training and what datasets models are using too, otherwise you won't get very far.
    > This kind of purity test mindset doesn't help anyone. They are shipping the most modifiable form of their model.
    It does help others who might be stuck in the same situation I was stuck in, that's not nothing nor is it about "purity". They're not shipping the most open model they can, they could have done something like OLMo (https://github.com/allenai/OLMo) which can teach people how to build their own models from scratch.
    
    fzzzy 5 months ago
    
    Keep fighting the good fight. Saying Llama is open source is straight up lying. It's open weights.
    
    diggan 5 months ago
    
    Thank you, sometimes it feels weird to argue against people who are generally pro-FOSS but somehow for LLMs are fine with misleading statements. I'm glad at least one other person can see through it, encouraging I'm on the right track :)
    I'm not sure I'd even call Llama "open weights". For me that would mean I can download the weights freely (you cannot download Llama weights without signing a license agreement) and use them freely, you cannot use them freely + you need to add a notice from Meta/Llama on everything that uses Llama saying:
    > prominently display “Built with Llama” on a related website, user interface, blogpost, about page, or product documentation.
    https://www.llama.com/llama3_2/license/
    Not sure what the correct label is, but it's not open source nor open weights, as far as I can tell.
- jacooper 5 months ago
  
  > Note that this seems to be about the weights themselves, AFAIK, the actual training code and datasets (for example) aren't actually publicly available.
  Like every other open source / source available LLM?
  - diggan 5 months ago
    
    Like every other Open Source LLM weights, yes. But looking around, there are models that are 100% FOSS, like OLMo (https://github.com/allenai/OLMo).
    Also, I don't buy the argument that because many in the ecosystem mislabel/mislead people about the licensing, makes it ethically OK for everyone else to do so too.
  - amoss 5 months ago
    
    https://github.com/huggingface/open-r1
    
    pridkett 5 months ago
    
    While I hope the HuggingFace is successful here, a plan for building a model is a long way from releasing a model. Mistral has models out there - they allow you to modify them. Yeah, it’s now like what we’re used to. It probably needs something else, but people are doing some great things with them.
- zamalek 5 months ago
  
  Binaries can do arbitrary things, like report home to a central server. Weights cannot.
  - hiq 5 months ago
    
    Why is that relevant regarding the FOSS aspects of weights / binaries? If I run a binary within a VM and only consider its output and prevent any side-effect host, just like I could just consider the output of an LLM, my binary is still not any closer to being FOSS, is it?
  - actualwitch 5 months ago
    
    Depending on format, they might.
    
    danieldk 5 months ago
    
    Virtually all models are now distributed as Safetensors/gguf/etc. (which are just metadata + data), not pickled Python classes. Many libraries also don't even load pickled checkpoints anymore unless you add an argument explicitly stating that you want to load an unsafe checkpoint.
- dismalaf 5 months ago
  
  But the weights can be modified. Also the real key is that you can host it yourself, fine tune and make money from it without restriction. That's what it's really about. No one (well, few) cares about recreating it because if they could they'd simply have made one from scratch themselves.
  - diggan 5 months ago
    
    The same is true for FOSS in general. You're arguing that because no one (almost) builds their own X from scratch, there is therefore no value in having resources available for how to build your own X from scratch.
    For someone who basically couldn't become a developer with FOSS, this way of thinking is so backwards, especially on Hacker News. I thought we were pro-FOSS in general, but somehow LLMs get a pass because "they're too complicated and no one would build one from scratch".
    
    dismalaf 5 months ago
    
    They get a pass because we know what these companies train on (proprietary or private data) but they can't admit it, but they're still giving away multi million dollar models for free.
    Yes, it'd be nice if it was open and reproducible from start to finish. But let's not let perfect be the enemy of good.
    
    diggan 5 months ago
    
    > Yes, it'd be nice if it was open and reproducible from start to finish. But let's not let perfect be the enemy of good.
    "Let's not let companies exploit well-known definitions for their own gain" is what I'm going for, regardless if we personally gain from it or not.
    
    hiq 5 months ago
    
    I still don't understand why they have to mix definitions to confuse developers, and we on top of that apparently have to give up on the true meaning of FOSS. What's so hard about using the term "open weights" or some new term instead of trying to reuse FOSS terms they don't abide to?
- mcraiha 5 months ago
  
  The binary comparison is a bit bad, since binary can have copyrights. Weights cannot.
  - diggan 5 months ago
    
    Has that actually been tried in court, or is that your guess? Because you seem confident, but I don't think this has been tried (yet)
    
    badsectoracula 5 months ago
    
    It is a guess (not the same author) but it'd make sense: weights are machine output so if the output of AI is not under copyright because it is machine output (which seems to be something that is pretty much universally agreed upon), then the same would apply for the weights themselves.
    I'm not sure how someone would argue (in good faith) that training on copyrighted materials does not cause the weights to be a derivative of those materials and the output of their AI is not protected under copyright but the part in the middle, the weights, does fall under copyright.
    Note that this would be about the weights (i.e. the numbers), not their container.
    
    wongarsu 5 months ago
    
    Photographs are machine output too and are famously subject to copyright. The movie Toy Story is also machine output, but I'm confident Disney is enforcing copyright on that.
    The opinion that AI output isn't copyrightable derives from the opinion of the US Copyright Office, which argues that AI output is more like commissioning an artist than like taking a picture. And since the artist isn't human they can't claim copyright for their work.
    It's not at all obvious to me that the same argument would hold for the output of AI training. Never mind that the above argument about AI output is just the opinion of some US agency and hasn't been tested in court anywhere in the world.
    
    badsectoracula 5 months ago
    
    As mentioned by the other comment, the difference lies in how much human effort was put - something by its nature is not possible to ask in a black and white manner that applies everywhere. But in cases which are close to the extremes, it is easy to answer: even though the Toy Story renders are machine output, these renders are the result of a lot of human effort by the human artists that made the 3D scenes, materials, models, animation sequences, etc for the purpose of being used in those renders. So Disney can claim copyright on that sort of "machine output".
    Similarly, claiming copyright on AI output is like claiming copyright on something like `init_state(42, &s); for (int i=0; i < count; i++) output[i] = next_random(&s);`. While there is a bit of (theoretical) effort involved into choosing 42 as a starting input, ultimately you can't really claim copyright on a bunch of random numbers because you chose the initial seed value.
    Of course you can claim copyright in the code, but doing the same on the output makes no sense: even the if the idea of owning random numbers isn't absurd enough, consider what would happen if -say- 10000 people did the same thing (and to make things even more clear, what if `init_state` used only 8bits of the given number, therefore making sure that there would be a lot of people ending up with the same numbers).
    AI is essentially `init_state` and `next_random`, just with more involved algorithms than a random number generator.
    
    michaelt 5 months ago
    
    A photograph is often subject to copyright - but there's actually some nuance here; some countries also require a certain level of creative input by a human.
    https://en.wikipedia.org/wiki/Threshold_of_originality
    Areas of dispute include photographs of famous paintings (is it more in the character of a photocopy?), photographs taken by animals (does the human get copyright if they deliberately created the situation where the animal would take a photograph?), and videos taken automatically (can a CCTV video have an author?)
    Historically, the results are all over the place.
    
    conradev 5 months ago
    
    The copyright office is actively figuring it out. From yesterday: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
    
    realusername 5 months ago
    
    If you make them copyrightable, then it means they are a derivative of the training dataset.
    The only defense these AI companies have is making the weights machine output and thus not copyrightable.
    But then again that's the theory, the copyright system follows money and it wouldn't be surprising to have contradicting ideas being allowed.
youssefabdelm 5 months ago

I guess since they're not ahead anymore they decide to go back to open source.
dismalaf 5 months ago

They must have realized they were becoming irrelevant... I know I forgot about them and have been using other models locally. Openness is a huge win, even if I am using Mistral's hosting service I want to know I can always host it myself too, to protect my business against rug pulls and the like.
No one's going to pay for an inferior closed model...
mythz 5 months ago

Happy to seem them back to releasing OSS models, we used a lot of their OSS models early last year before they were eclipsed by better models and never bothered to try any of their large models which IMO weren't great value.
littlestymaar 5 months ago

I wonder if that's a consequence of the Deepseek distill release: fine-tuned Qwen and Llama were both released by Deepseek, but not Mistral, and that's was a missed PR opportunity for them for no good reason.
globular-toast 5 months ago

What does an Apache licence even mean in this context? It's not software. Is it even copyrightable?

tadamcz 5 months ago

Hi! I'm Tom, a machine learning engineer at the nonprofit research institute Epoch AI [0]. I've been working on building infrastructure to:

* run LLM evaluations systematically and at scale

* share the data with the public in a rigorous and transparent way

We use the UK government's Inspect [1] library to run the evaluations.

As soon as I saw this news on HN, I evaluated Mistral Small 3 on MATH [2] level 5 (hardest subset, 1,324 questions). I get an accuracy of 0.45 (± 0.011). We sample the LLM 8 times for each question, which lets us obtain less noisy estimates of mean accuracy, and measure the consistency of the LLM's answers. The 1,324*8=10,584 samples represent 8.5M tokens (2M in, 6.5M out).

You can see the full transcripts here in Inspect’s interactive interface: https://epoch.ai/inspect-viewer/484131e0/viewer?log_file=htt...

Note that MATH is a different benchmark from the MathInstruct [3] mentioned in the OP.

It's still early days for Epoch AI's benchmarking work. I'm developing a systematic database of evaluations run directly by us (so we can share the full details transparently), which we hope to release very soon.

[0]: https://epoch.ai/

[1]: https://github.com/UKGovernmentBEIS/inspect_ai

[2]: https://arxiv.org/abs/2103.03874

[3]: https://huggingface.co/datasets/TIGER-Lab/MathInstruct

coalteddy 5 months ago

Thanks a lot for this eval!
One question i have regarding evals is, what sampling temperature and/or method do you use? As far as i understand temperature/ method can impact model output alot. Would love to here you're thoughts on how these different settings of the same model can impact output and how to go about evaluating models when its not clear how to use the to their fullest
- tadamcz 5 months ago
  
  Generally, we'll use the API provider's defaults.
  For models we run ourselves from the weights, at the moment we'd use vLLM's defaults, but this may warrant more thought and adjustment. Other things being equal, I prefer to use an AI lab's API, with settings as vanilla as possible, so that we essentially defer to them on these judgments. For example, this is why we ran this Mistral model from Mistral's API instead of from the weights.
  I believe the `temperature` parameter, for example, has different implementations across architectures/models, so it's not as simple as picking a single temperature number for all models.
  However, I'm curious if you have further thoughts on how we should approach this.
  By the way, in the log viewer UI, for any model call, you can click on the "API" button to see the payloads that were sent. In this case, you can see that we do not send any values to Mistral for `top_p`, `temperature`, etc.

mohsen1 5 months ago

Not so subtle in function calling example[1]

        "role": "assistant",
        "content": "---\n\nOpenAI is a FOR-profit company.",

[1] https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-...

spwa4 5 months ago

So the point of this release is

1) code + weights Apache 2.0 licensed (enough to run locally, enough to train, not enough to reproduce this version)

2) Low latency, meaning 11ms per token (so ~90 tokens/sec on 4xH100)

3) Performance, according to mistral, somewhere between Qwen 2.5 32B and Llama 3.3 70B, roughly equal with GPT4o-mini

4) ollama run mistral-small (14G download) 9 tokens/sec on the question "who is the president of the US?" (also to enjoy that the answer ISN'T orange idiot)

freehorse 5 months ago

I tried just a few of the code generating prompts I have used last days, and it looks quite good and promising. It seems at least on par with qwen2.5-coder-32b which was the first local model i would actually use for code. I am also surprised how far we went with small models producing such more polished output in the last year.

On another note, I also wish they would follow up with a new version of the 8x7B mixtral. It was one of my favourite models, but at the time it could barely fit in my ram, and now that I have more ram it is rather outdated. But I don't complain, this model anyway is great and it is great that they are one of the companies which actually publish such models targeted to edge computing.

msp26 5 months ago

Finally, all the recent MoE model releases make me depressed with my mere 24GB VRAM.

> Note that Mistral Small 3 is neither trained with RL nor synthetic data

Not using synthetic data at all is a little strange

colonial 5 months ago

I recall seeing some complaints recently w.r.t. one of the heavily synthetic models (Phi?) - apparently they tend to overfit on STEM "book knowledge" while struggling with fuzzier stuff and instruction following.
I'm not much of an LLM user, though, so take my warmed over recollections with a grain of salt.
bloopernova 5 months ago

I'm surprised no GPU cards are available with like a TB of older/cheaper RAM.
- gr3ml1n 5 months ago
  
  Not surprising at all: Nvidia doesn't want to compete with their own datacenter cards.
  - wongarsu 5 months ago
    
    AMD could arguably do it. But they have to focus to stay above water at all, and "put 128GB or more of DDR5 ram on any previous-gen GPU" is probably not in their focus. With the state of their software it's not even certain if the community could pick up the slack and turn that into a popular solution.
    
    hnuser123456 5 months ago
    
    Their next generation of APUs will have a lot more memory bandwidth and there will probably be lots of AMD APU laptops with 64GB+ of RAM that can use HW acceleration and not be artificially segmented the way Nvidia can do it with VRAM being soldered.
  - papichulo2023 5 months ago
    
    Nvidia upcoming 'minipc' has shared ram up to 128gb for around 3k. No a competitor but pretty good for enthusiast.
    Hopefully is at least quadchannel.
- aurareturn 5 months ago
  
  Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.
  What good is 1TB RAM if the bandwidth is fed through a straw? Models would run very slow.
  You can see this effect on 128GB MacBook Pros. Yes, the model will fit but it’s slow. 500GB/s of memory bandwidth feeds 128GB RAM at a maximum rate of 3.9x per second. This means if your model is 128GB large, your max tokens/s is 3.9. In the real world, it’s more like 2-3 tokens/s after overhead and compute. That’s too slow to use comfortably.
  You’re probably wondering why not increase memory bandwidth too. Well, you need faster memory chips such as HBM and/or more memory channels. These changes will result in drastically more power consumption and bigger memory controllers. Great, you’ll pay for those. Now you’re bottlenecked by compute. Just add more compute? Ok, you just recreated the Nvidia H100 GPU. That’ll be $20k please.
  Some people have tried to use AMD Epyc CPUs with 8 channel memory for inference but those are also painfully slow in most cases.
  - acoard 5 months ago
    
    > Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.
    But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.
    
    aurareturn 5 months ago
    
    You can load giant models onto normal RAM such as on an Epyc system but they're still mostly bottlenecked by low memory bandwidth.
    
    elcomet 5 months ago
    
    You can offload tensors to the cpu memory. It will make your model run much slower but it will work

bugglebeetle 5 months ago

Interested to see what folks do with putting DeepSeek-style RL methods on top of this. The smaller Mistral models have always punched above their weight and been the best for fine-tuning.

petercooper 5 months ago

It's not RL, but you can get a long way with a thorough system prompt to encourage it to engage in 'thinking' behavior on its own without extra training. Just playing with it myself now with promising results - Mistral Small seems very receptive to this approach (not all models are - cough, Llama).
Update: This is such a prompt: https://gist.github.com/peterc/955d797ee35b3c777d76a2d881d2f...

yodsanklai 5 months ago

I'm curious, what people do with these smaller models?

Beretta_Vexee 5 months ago

RAG mainly, Feature extraction, tagging, Document and e-mail classification. You don't need a 24B parameter to know whether the e-mail should go to accounting or customer support.
- Panoramix 5 months ago
  
  Would this work for non-text data? Like finding outliers in a time series or classifying trends, that kind of thing
  - Beretta_Vexee 5 months ago
    
    Don't use LLM to do something that some python + panda lines could do better. Buy a data scientist a coffee and have a chat.
- pheeney 5 months ago
  
  What models would you recommend for basic classification if you don't need a 24B parameter one?
  - josh-sematic 5 months ago
    
    You might find this comparison chart helpful: https://www.airtrain.ai/blog/how-15-top-llms-perform-on-clas...
    Note: from October; also I work at Airtrain
  - elorant 5 months ago
    
    I’m using Llama-3 8B to classify html files. It’s surprisingly good, and I run it on an RTX 4060 Ti at 8-bit quantization. No complains so far.
  - Beretta_Vexee 5 months ago
    
    There's no alternative to testing with your own data. The majority of our data is in French, and our benchmarks differ greatly from public benchmarks generally based on English documents.
celestialcheese 5 months ago

Classification, tagging tasks. Way easier than older ML techniques and very fast to implement.
- mattgreenrocks 5 months ago
  
  When compared against more traditional ML approaches, how do they fare in terms of quality?
  - spmurrayzzz 5 months ago
    
    Historically the problem with using LLMs for the super simple conventional NLP stuff is that they were hard to control in terms of output. If you wanted a one-word answer for a classification task, you'd often have to deal with it responding in a paragraph. This obviously hurts precision and accuracy quite a bit. There were tricks you could use (like using few-shot examples or GBNF grammars or training low-rank adapters or even re-asking the model) to constrain output a bit, but they weren't perfect.
    Over the last 12-18 months though, the instruction-following capabilities of the models have improved substantially. This new mistral model in particular is fantastic at doing what you ask.
    My approach to this personally and professionally is to just benchmark. If I have a classification task, I use a tiny model first, eval both, and see how much improvement I'd get using an LLM. Generally speaking though, the vram costs are so high for the latter that its often not worth it. It really is a case-by-case decision though. Sometimes you want one generic model to do a bunch of tasks rather than train/finetune a dozen small models that you manage in production instead.
  - andrewgross 5 months ago
    
    Super easy to get started, but lacking for larger datasets where you want to understand a bit more about predictions. You generally lose things like prediction probability (though this can be recovered if you chop the head off and just assign output logits to classes instead of tokens), repeatability across experiments, and the ability to tune the model by changing the data. You can still do fine tuning, though itll be more expensive and painfaul than a BERT model.
    Still, you can go from 0 to ~mostly~ clean data in a few prompts and iterations, vs potentially a few hours with a fine tuning pipeline for BERT. They can actually work well in tandem to bootstrap some training data and then use them together to refine your classification.
  - whymauri 5 months ago
    
    After prompt optimization with something like DSPy and a good eval set, significantly faster and just about as good. Occasionally higher accuracy on held out data than human labelers given a policy/documentation e.g. customer support cases.

ignoramous 5 months ago

Mistral repeatedly emphasize on "accuracy" and "latency" for this Small (24b) model; which to me means (and as they also point out):

- Local virtual assistants.

- Local automated workflows.

Also from TFA:

  Our customers are evaluating Mistral Small 3 across multiple industries, including:

  - Financial services customers for fraud detection
  - Healthcare providers for customer triaging
  - Robotics, automotive, and manufacturing companies for on-device command and control
  - Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.

frankfrank13 5 months ago

They're fast, I used 4o mini to run the final synthesis in a CoT app and to do initial entity/value extraction in an ETL. Mistral is pretty good for code completions too, if I was in the Cursor business I would consider a model like this for small code-block level completions, and let the bigger models handle chat, large requests, etc.
_boffin_ 5 months ago

Cleaning messy assessor data. Email draft generation.
superkuh 5 months ago

Not spend $6000 on hardware because they run on computers we already have. But more seriously, they're fine and plenty fun for making recreational IRC bots.

rahimnathwani 5 months ago

Until today, no language model I've run locally on a 32GB M1 has been able to answer this question correctly: "What was Mary J Blige's first album?"

Today, a 4-bit quantized version of Mistral Small (14GB model size) answered correctly :)

https://ollama.com/library/mistral-small:24b-instruct-2501-q...

kamranjon 5 months ago

I just tried your question against Gemma 2 27b llamafile on my M1 Macbook with 32gb of ram, here is the transcript:
>>> What was Mary J Blige's first album?
Mary J. Blige's first album was titled *"What's the 411?"*.
It was released on July 28, 1992, by Uptown Records and became a critical and commercial success, establishing her as the "Queen of Hip-Hop Soul."
Would you like to know more about the album, like its tracklist or its impact on music?
- rahimnathwani 5 months ago
  Ah! I had not tried any gemma models locally. It worked:
  % llm -m gemma2:27b-instruct-q4_0 "What was Mary J Blige's first album?" Mary J. Blige's first album was **"What's the 411?"** It was released in July 1992. Let me know if you have any other questions about Mary J. Blige!
  - svachalek 5 months ago
    
    Gemma2 seems to be the best small model for trivia. Even the 9b model really surprises me sometimes with things it can answer, that seem ridiculous for a small model to know.

cptcobalt 5 months ago

This is really exciting—the 12-32b size range has my favorite model size on my home computer, and the mistrals have been historically great and embraced for various fine-tuning.

At 24b, I think this has a good chance of fitting on my more memory constrained work computer.

ericol 5 months ago

> the mistrals have been historically great and embraced for various fine-tuning Are there any guides on fine tuning them that you can recommend?
- ekam 5 months ago
  
  Unsloth is the one I personally hear the most about

timestretch 5 months ago

Their models have been great, but I wish they'd include the number of parameters in the model name, like every other model.

jbentley1 5 months ago

It's 24B parameters

rcarmo 5 months ago

There's also a 22b model that I appreciate, since it _almost_ fits into my 12GB 3060. But, alas, I might need to get a new GPU if this trend of fatter smaller models continues.

aargh_aargh 5 months ago

That's the older version (4 months old), check the release date.
- rcarmo 5 months ago
  
  Ah. I need to find a tighter quantization then, if it exists at all.

GaggiX 5 months ago

Hopefully they will finetuning it using RL like DeepSeek did, it would be great to have more open reasoning models.

Alifatisk 5 months ago

Is there a good benchmark one can look at that shows the best performing llm in terms of instruction following or overall score?

The only ones I am aware of is benchmarks on Twitter, Chatbot Arena [1] and Aider benchmark [2]

1. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leade...

2. https://aider.chat/docs/leaderboards

mike31fr 5 months ago

Running it on a MacBook with M1 Pro chip and 32 GB of RAM is quite slow. I expected to be as fast as phi4 but it's much slower.

mike31fr 5 months ago

With eval rate numbers:
- phi4: 12 tokens/s
- mistral-small: 9 tokens/s
On Nvidia RTX 4090 laptop:
- phi4: 36 tokens/s
- mistral-small: 16 tokens/s

Terretta 5 months ago

"When quantized, Mistral Small 3 can be run privately on a single RTX 4090 or a Macbook with 32GB RAM."

jszymborski 5 months ago

The trouble now is finding an RTX 4090.
- hnuser123456 5 months ago
  
  RTX 3090s are easy to find and work just as well.
  - petercooper 5 months ago
    
    Running the Q4 quant (14GB or so in size) at 46 tok/sec on a 3090 Ti right now if anyone's curious to performance. Want the headroom to try and max out the context.
    
    earleybird 5 months ago
    
    Interesting - _q4 on a pair of 12Gb 3060s it runs at 20 tok/sec. _q8 (25Gb) on same is about 4 tok/sec.
    
    petercooper 5 months ago
    
    ~360GB/s memory bandwidth on the 3060, versus ~1008GB/s on the 3090 Ti probably accounts for that.
    Given that, I'd expect a single 3060 (if a large enough one existed) to run at about 16 tok/s so 20 tok/s on two isn't bad not being NVLinked.
- benkaiser 5 months ago
  
  Runs on an AMD 7900 XTX at about ~20 tokens per second using LM Studio + Vulkan.

unraveller 5 months ago

What's this stuff about the model catering to ‘80%’ of generative AI tasks? What model do they expect me to use for the other 20% of the time when my question needs reasoning smarts.

sneak 5 months ago

There are APIs that use a very small model to determine the complexity of the request then route it to different apis or models based on the result of that classifier model.
This way you can do cheap/local automatically without the api client having to know anything about it, and the proxy will send the requests out to an expensive big model only when necessary.
abdullahkhalids 5 months ago

Crazy idea: a small super fast model whose only job is to decide which model to send your task to.
- hahnchen 5 months ago
  
  Not so crazy, it sorta exists https://withmartian.com so it's probably a good idea to pursue
- sneak 5 months ago
  
  This already exists but I forgot the name. It’s an api proxy.
zamadatix 5 months ago

Take your pick based on your use cases and needs?
xnx 5 months ago

Mistral Large

Havoc 5 months ago

Used it a bit today on coding tasks and overall very pleasant. The combination of fast and fits into 24gb is also appreciated

Wouldn’t be surprised if this gets used a fair bit given open license

butz 5 months ago

Is there a gguf version that could be used with llamafile?

simonw 5 months ago

A bunch have started showing up here: https://huggingface.co/models?other=base_model:quantized:mis...
The lmstudio-community ones tend to work well in my experience.

picografix 5 months ago

Tried running locally, gone were the days where you get broken responses on local models (i know this happened earlier but I tried after so many days)

adt 5 months ago

https://lifearchitect.ai/models-table/

mrbonner 5 months ago

Is there a chance for me to get a eGPU (external GPU dock) for my M1 16GB laptop to plunge thru this model?

hnfong 5 months ago

The smaller IQ2/Q3 GGUF quants should run "fine" on your existing 16GB.
(also, I don't know that M1 supports any eGPU...)

rvz 5 months ago

The AI race to zero continues to accelerate and Mistral has shown one card to just stay in the race. (And released for free)

OpenAI's reaction to DeepSeek looked more like cope and panic after they realized they're getting squeezed at their own game.

Notice how Google hasn't said anything with these announcements and didn't rush out a model nor did they do any price cuts? They are not in panic and have something up their sleeve.

I'd expect Google to release a new reasoning model that is competitive with DeepSeek and o1 (or matches o3). Would be even more interesting if they release it for free.

beAbU 5 months ago

Google has been consistently found with their finger up their nose during this entire AI bubble.
The reason why they are so silent is because they are still reacting to ChatGPT 3.5
- jug 5 months ago
  
  Gemini 2.0 Experimental is now a leading LLM. They started out poorly, but after a more reasonable 1.5 Pro, 2.0 is in another class entirely and a direct competitor to o1 (or o1-mini as for Gemini 2.0 Flash). They've made quick strides forward as the DeepMind team is kicking into gear, and I feel like they're neglected a bit too often these days, especially now while usage cost on AI Studio is a nice $0.
  - Arthur_ODC 5 months ago
    
    Definitely. The 1206 is probably my favorite out of any I've ever used.
- christianqchung 5 months ago
  
  The Gemini launch was a complete disaster, but technically speaking since February 2024, Gemini 1.5 pro and the ensuing lineup have been very impressive.
  - rvz 5 months ago
    
    A lot can change in a year.
- staticman2 5 months ago
  
  Gemini 1.5 pro is extremely impressive at reading 2 million tokens of a document and answering questions about it. And at least for the time being it's offered for free on AI studio.
- upbeat_general 5 months ago
  
  imo gemini-exp-1206 is the best public LLM that exists right now.
  - staticman2 5 months ago
    
    Gemini-exp-1206 seems significantly dumber than 1.5 pro on reading and comprehending large context documents.
jiraiya0 5 months ago

Already tried it. It’s called gemini-2.0-flash-thinking-exp-01-21. Looks better than DeepSeek.
- k__ 5 months ago
  
  R1 or V3?
  - jiraiya0 5 months ago
    
    Both

Havoc 5 months ago

How does that fit into a 4090? The files on the repo look way too large. Do they mean a quant?

cbg0 5 months ago

> Mistral Small can be deployed locally and is exceptionally "knowledge-dense", fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized.

fuegoio 5 months ago

Finally something from them

azinman2 5 months ago

They released codestral on Jan 13. What do you mean by “finally”?

strobe 5 months ago

not sure how much worse it than original but mistral-small:22b-instruct-2409-q2_K seems works on 16GB VRAM GPU

resource_waste 5 months ago

Curious how it actually compares to LLaMa.

Last year Mistral was garbage compared to LLaMa. I needed a permissive license, so I was forced to use Mistral, but I had LLaMa that I could compare it to. I was always extremely jealous of LLaMa since the Berkley Sterling finetune was so amazing.

I ended up giving up on the project because Mistral was so unusable.

My conspiracy was that there was some European patriotism that gave Mistral a bit more hype than was merited.

maven29 5 months ago

They're both European. Look at the author names on the llama paper.
- resource_waste 5 months ago
  
  That is a very European thing to say/do/claim.
  - cpldcpu 5 months ago
    
    a goof part of team is actually located in europe
    
    resource_waste 5 months ago
    
    I said it was a very European thing to say, because only a European would stretch that hard.
    Merikan company with Merikan investment get the credit. No one cares except Europeans about the interchangable workers residency is.
    I'm trying to remember the other case where people lol'd at Europe/Italy for taking credit for something that was clearly invented in the US. I think the person was born there, and moved to the US, but Italy still took credit.
    lol no. Its probably even more embarrassing that they left Europe.
    
    cpldcpu 5 months ago
    
    I thought now everything is about meritocracy? Have we been duped?
staticman2 5 months ago

I remember a huge improvement between the Mistral models that came out in January and February 2024 versus their later releases starting in July 2024, so it wouldn't surprise me if this was a good model for it's size.

netdur 5 months ago

seems on par or better than gpt4 mini

mariconrobot 5 months ago

i cunt get past the name

fvv 5 months ago

given new USA ai diffusion rules will mistral be able to survive and attract new capitals ? , I mean, given that france is top tier country

beAbU 5 months ago

This sounds like a USA problem, rather than a Mistral problem.
- lkbm 5 months ago
  
  Not being able to attract capital would clearly be a Mistral problem.
  - halfadot 5 months ago
    
    Luckily for Mistral, capital also exists in countries other than the USA.
solomatov 5 months ago

What are these ai diffusion rules?
- Beretta_Vexee 5 months ago
  
  "Those destinations, which are listed in paragraph (a) to Supplement No. 5 to Part 740, are Australia, Belgium, Canada, Denmark, Finland, France, Germany, Ireland, Italy, Japan, the Netherlands, New Zealand, Norway, Republic of Korea, Spain, Sweden, Taiwan, the United Kingdom, and the United States. For these destinations, this IFR makes minimal changes: companies in these destinations generally will be able to obtain the most advanced ICs without a license as long as they certify compliance with specific requirements provided in § 740.27." [0]
  France seems clearly exempt from most of the requirements. The main requirement of 740.27 is to sign a license under U.S. law, under which customers are prohibited from re-exporting ICs to non-Third 1 countries without U.S. approval.
  What's more, the text refers to AIs, which can have dual uses. The concept of dual civil-military use concerns a large number of technologies, and dates back to the first nuclear technologies.
  The text gives a few examples of dual-use models, such as models that simulate or facilitate the production of chemical compounds that could be used for chemical weapon creation, non conventional weapon creation or that could simplify or replace already identified dual-use goods or technologies.
  These uses are already covered by existing legislation on dual-use goods, and US export control. The American legislator is therefore potentially thinking of other uses, such as satellite and radar image analysis, and electronic warfare.
  As France is a nuclear-armed country with its own version of thoses technologies, it makes little sense to place it under embargo.
  But France isn't going to like being obliged once again to be forced to apply American law and regulation on its soil.
  As a European, I hope that alternatives to American dependence will soon appear.
  [0] https://www.federalregister.gov/documents/2025/01/15/2025-00...
  - wongarsu 5 months ago
    
    Export restrictions on US models seem like a boon for French Mistral, not a problem.
    Even if for now France is strong-armed into applying the same restrictions, they will be in a much better position than US companies if US-Europe relations deteriorate. Something that's not entirely unlikely under Trump. We are a week into his presidency and France is already talking about deploying troops to Greenland
    
    Beretta_Vexee 5 months ago
    
    In the case of Mistral, the restrictions on exporting the model would not apply, since the model is produced in France. What's more, Mistral won't be much help in producing deadly gases or uranium enrichment facilities. It is therefore not subject to this legislation for these two reasons.
    On the other hand, ICs could be subject to restrictions, and France has no alternative for sourcing large-capacity ICs.
    The USA could use dollar-denominated transactions to broaden the scope of the text. It's not insurmountable, but it will complicate matters.
    
    Der_Einzige 5 months ago
    
    Actually, Mistral by virtue of not aligning/safety lobotomizing their models will indeed trivially "help in producing deadly gases or uranium enrichment facilities!"
    
    Beretta_Vexee 5 months ago
    
    It's quite a feat to demonstrate that you know nothing about LLMs, chemistry or the nuclear industry in so few characters.
    
    Der_Einzige 5 months ago
    
    Let's see your google scholar or huggingface profile. You better square up if you're going to talk shit.
    It's telling that the release from OpenAI today warns about exactly this threat in their lengthy security section: https://cdn.openai.com/o3-mini-system-card.pdf
  - solomatov 5 months ago
    
    Thanks, very interesting information. I was aware of IC rules, but have never heard about limitations on model weights.

m3kw9 5 months ago

Sorry to dampen the news but 4o-mini level isn’t really a useful model other than talk to me for fun type of applications.