First is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be extremely cursed and you should take a step back and think about your life.
Second is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be wildly popular and adds enormous value.
This is excellent: an in-depth description showing how the Go internals make writing fast interpreters difficult, by someone who is far more determined than I ever was to make it fast anyway.
I’ve assumed that writing fast interpreters wasn’t a use case the Go team cared much about, but if it makes protobuf parsing faster, maybe it will get some attention, and some of these low-level tricks will no longer be necessary?
The Workers platform uses Cap'n Proto extensively, as one might expect (with me being the author of Cap'n Proto and the lead dev on Workers). Some other parts of Cloudflare use it (including the logging pipeline, which used it before I even joined), but there are also many services using gRPC or just JSON. Each team makes their own decisions.
I have it on my TODO list to write a blog post about Workers' use of Cap'n Proto. By far the biggest wins for us are from the RPC system -- the serialization honestly doesn't matter so much.
That said, the ecosystem around Cap'n Proto is obviously lacking compared to protobuf. For the Cloudflare Workers Runtime team specifically, the fact that we own it and can make any changes we need to balances this out. But I'm hesitant to recommend it to people at other companies, unless you are eager to jump into the source code whenever there's a problem.
A while ago I talked to some team that was planning a migration to GraphQL, long after this was generally thought to be a bad idea. The lead seemed really attached to the "composable RPCs" aspect of the thing, and at the time it seemed like nothing else offered this. It would be quite cool if capnproto became a more credible option for this sort of situation. At the time users could read about the rpc composition/promise passing/"negative latency" stuff, but it was not quite implemented.
This makes me really sad. Protobufs are not all that great, but they were there first and “good enough”.
It’s frustrating when we can’t have nice things because a mediocre Google product has sucked all the air out of the room. I’m not only talking about Protobufs here either.
We use flatbuffers extensively in our message passing framework, and they are extremely fast if you take care with your implementation. They have a few features that make them especially useful for us
1) The flatbuffer parser can be configured at runtime from a schema file, so our message passing runtime does not to need to know about any schemas at build time. It reads the schema files at startup, and is henceforth capable of translating messages to and from JSON when required. It's also possible to determine that two schemas will be compatible at runtime.
2) Messages can be re-used. For our high-rate messages, we build a message and then modify it to send again, rather than building it from scratch each time.
3) Zero decode overhead - there is often no need to deserialise messages - so we can avoid copying the data therein.
The flatbuffer compiler is also extremely fast, which is nice at build time.
Used them before and they're ok. They were missing some important features like sum types. The code output was a pain, but targeted a few languages. My suspicion is that Captain Proto would be technically superior but less documented than flatbuffers.
However, my preference is to use something like MsgPack or CBOR with compile time reflection to do serde directly into types. You can design the types to require minimal allocations and to parse messages within a few dozen nanoseconds. That means doing things like using static char arrays for strings. It wastes a bit of space but it can be very fast. Also skipping out on spaced used by 64bit pointers can replace a lot of shorter text fields.
That said, I should wrap or port this UPB to Nim. It'd be a nice alternative if it's really as fast as claimed. Though how it handles allocating would be the clincher.
> They were missing some important features like sum types.
They support unions now. I haven't had any trouble representing anything including recursive structures.
> The code output was a pain, but targeted a few languages.
You do need a day or so to get used to the code. Its a pointer based system with a 'flat memory'. Order of operations matters. If you have a parent and a child you need to write the child first, obtain a pointer to it and only then create/write the parent containing the pointer to the child. Once you get used to this it goes quickly. The advantage is that you don't have to create an intermediate copy in memory when reading (like protobuf) and you can read any particular part by traversing pointers without having to load the rest of the data into memory.
Neither of them are well-designed or well thought-through. They address some cosmetic issues of Protobuf, but don't really deal with the major issues. So, you could say they are slightly better, but the original was bad enough to completely disqualify it either.
Flatbuffers lets you directly mmap from disk, that trick alone makes it really good for use cases that can take advantage of it(fast access of read-only data). If you're clever enough to tune the ordering of fields you can give it good cache locality and really make it fly.
We used to store animation data in mmaped flatbuffers at a previous gig and it worked really well. Kernel would happily prefetch on access and page out under pressure, we could have 10s of MBs of animation data and only pay a couple hundred kb based on access patterns.
Yes, I am, it's fast and great, but the UX in go is a bit annoying since you need to constantly keep checking for errors on literally every set (to check for possible area errors). So your code has even more `if err != nil {return nil, err}` than usual go code.
> In other words, a UPB parser is actually configuration for an interpreter VM, which executes Protobuf messages as its bytecode.
This is kind of confusing, the VM is runtime crafted to parse a single protobuf message type and only this message type? The Second Futamura Projection, I suppose...
Or the VM is designed specifically around generic protobuf messages and it can parse any random message but only if it's a protobuf message?
I've been working on the design of a similar system but for general binary parsing (think bison/yacc for binary data) and hadn't even considered doing data over specialized VM vs. bytecode+data over general VM. Honestly, since it's designed around 'maximum laziness' (it just parses/verifies and creates metadata over the input so you only pay for decoding bytes you actually use) and I/O overhead is way greater than the VM dispatching trying this out is probably one of those "premature optimization is the root of all evil" cases but intriguing none the less.
I think I can shed some light on this, as the creator and lead of upb.
Calling a Protobuf Parser an "interpreter VM" is a little bit of rhetorical flourish. It comes from the observation that there are some deep structural similarities between the two, which I first observed in an article a few years back: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...
> It may seem odd to compare interpreter loops to protobuf parsers, but the nature of the protobuf wire format makes them more similar than you might expect. The protobuf wire format is a series of tag/value pairs, where the tag contains a field number and wire type. This tag acts similarly to an interpreter opcode: it tells us what operation we need to perform to parse this field’s data. Like interpreter opcodes, protobuf field numbers can come in any order, so we have to be prepared to dispatch to any part of the code at any time.
This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.
The tricky part is that the set of "case" labels for a Protobuf parser is message-specific and defined by the fields in the schema. How do we accommodate that?
More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.
The trick then becomes how to make this table-driven dispatch as fast as possible, to simulate what the switch() statement would have done. That question is what I cover at length in the article mentioned above.
Really great. I wonder, for the "types encoded as code" approach, is there any benefit to fast paths for data with fields in ascending order? For some json parsers with types encoded as code I have observed some speedup from either hard-coding a known key order or assuming keys in some order and providing a fallback in case an unexpected key is encountered. For users who are stuck with protobuf forever because of various services using it and various data being encoded this way, the historical data could plausibly be canonicalized and written back in large chunks when it is accessed, so that one need not pay the entire cost of canonicalizing it all at once. But of course the icache concerns are still just as bad.
This kind of "expected next field" optimization has a long history in protobuf, but results are mixed.
The generated code in C++ used to check for the expected next field before falling back to the switch() (example here: https://github.com/protocolbuffers/protobuf/blob/460e7dd7c47...) but this was removed in 2016 when load tests found that it hurt more than it helped.
One tricky part of making this optimization work is making a good guess about what the next field should be. Miguel's article alludes to this:
> Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.
> I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.
One delightful thing about the tail-call parser design is that the CPU's branch predictor effectively takes over the job of guessing. With a tail call parser, the dispatch sequence ends up looking like this:
cmp QWORD PTR [rdi+0x8],rsi # Bounds check
jbe .fallback
movzx r10d,WORD PTR [rsi] # Load two bytes of tag
mov eax,r10d
and eax,0xf8
mov r9,QWORD PTR [rcx+rax*2] # Load table data
xor r9,r10
mov rax,QWORD PTR [rcx+rax*2+0x8] # Load field parser function
jmp rax # Tail call to field parser
That "jmp rax" instruction is an indirect branch that can be predicted by the CPU. The CPU effectively guesses for us!
And unlike any kind of guess we might have performed ahead-of-time, the branch predictor will constantly adapt to whatever patterns it is seeing in the data. This is good, because statically guessing ahead-of-time is hard.
> This kind of "expected next field" optimization has a long history in protobuf
You could probably even trace the history of the idea all the way to Van Jacobson’s 30-instruction TCP fastpath[1]. Or to go a bit closer, I’ve found that an interpreter for a stack+accumulator VM (which, compared to the pure stack option, is prone to blowing up the bytecode count and thus dispatch cost with the constant PUSH-accumulator instructions) goes significantly faster if you change the (non-shared) dispatch from
which feels somewhat analogous to the next-field optimization and avoids polluting the indirect branch predictor with the very common PUSH predictions. (It’s still slower than not having those PUSHes in the first place.)
> More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.
Everything old is new again, I guess—one of the more advertised changes in Microsoft COM as it was maturing (circa 1995) was that you could use data-driven marshalling with “NDR format strings” (bytecode, essentially[1,2]) instead of generating C code. Shortly after there was typelib marshalling (format-compatible but limited), and much later also WinRT’s metadata-driven marshalling (widely advertised but basically completely undocumented).
Fabrice Bellard’s nonfree ASN.1 compiler[3] is also notable for converting schemas into data rather than code, unlike most of its open-source alternatives.
I still can’t help wondering what it is, really, that makes the bytecode-VM approach advantageous. In 1995, the answer seems simple: the inevitable binary bloat was unacceptable for a system that needs to fit into single-digit megabytes of RAM; and of course bytecode as a way to reduce code footprint has plenty of prior art (microcomputer BASICs, SWEET16, Forth, P-code, even as far back as the AGC).
Nowadays, the answer doesn’t seem as straightforward. Sure, the footprint is enormous if you’re Google, but you’re not Google (you’re probably not even Cloudflare), and besides, I hope you’re not Google and can design stuff that’s actually properly adapted to static linking. Sure, the I$ pressure is significant (thinking back to Mike Pall’s explanation[4] why he didn’t find a baseline JIT to be useful), but the bytecode interpreter isn’t going to be a speed demon either.
I don’t get it. I can believe it’s true, but I don’t really feel that I get it.
In some workloads, you can benefit from paying the latency cost of data dependencies instead of the latency cost of conditional control flow, but there's no general rule here. It's best to try out several options on the actual task and the actual production data distribution.
Regular expression engines on this model are often called VMs, certainly thats the terminology and layout I used. A big switch/computed-goto loop thing with state codes? Yeah, thats a VM.
I think its a very illuminating way to describe it. Nicely done with the implementation as well.
> This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.
This is a very insightful design. Encoding the parser in a table is an old technique, it's what YACC uses. There's a tradeoff between using the CPU's stack+registers versus having your own stack+state in this kind of work, and people have found situations where a small VM is faster because it gets to stay in the instruction cache and benefit from branch prediction, while the less-predictable data stays in the d-cache.
Based on what I know about the structure of protobufs internally and without having looked deep into what UPB is doing... I'd guess it could probably be a stack machine that treats (byte)+ as opcodes. Most of the time I'd think of it as parser -> AST -> bytecode, but I think the "grammar" of protobufs would allow your parser to essentially emit terminals as they're parsed straight to the VM as instructions to execute.
In the couple days since I posted my confusion (threads merged or something) I consulted the daffy robots and figured out how it all works. Also had them come up with a design document for "a specialized compiler and virtual machine architecture for parsing Protocol Buffer messages that achieves significant performance improvements through a novel compilation pipeline combining protobuf-specific AST optimization, continuation-passing style transformations, and tail call interpreter execution."
So am I identifying the bottlenecks that motivate this design correctly?
1. Go FFI is slow
2. Per-proto generated code specialization is slow, because of icache pressure
I know there's more to the optimization story here, but I guess these are the primary motivations for the VM over just better code generation or implementing a parser in non-Go?
I know that Java resisted improving their FFI for years because they preferred that the JIT get the extra resources. And that customers not bail out of Java every time they couldn’t figure out how to make it faster. There’s a case I recall from when HotSpot was still young, where the Java GUI team moved part of the graphics pipeline to the FFI in one release, hotspot got faster in the next, and then they rolled back the changes because it was now faster without the FFI.
But eventually your compiler is good enough that the FFI Is now your bottleneck, and you need to do something.
Go's goroutines aren't plain C threads (blocking syscalls are magically made async), and Go's stack isn't a normal C stack (it's tiny and grown dynamically).
A C function won't know how to behave in Go's runtime environment, so to call a C function Go needs make itself look more like a C program, call the C function, and then restore its magic state.
Other languages like C++, Rust, and Swift are similar enough to C that they can just call C functions directly. CPython is a C program, so it can too. Golang was brave enough to do fundamental things its own way, which isn't quite C-compatible.
Go (gc) was also a C program originally. It still had the same overhead back then as it does now. The implementation language is immaterial. How things are implemented is what is significant. Go (tinygo), being a different implementation, can call C functions as fast as C can.
> ...so it can too.
In my experience, the C FFI overhead in CPython is significantly higher than Go (gc). How are you managing to avoid it?
I think in case of CPython it's just Python being slow to do anything. There are costs of the interpreter, GIL, and conversion between Python's objects and low-level data representation, but the FFI boundary itself is just a trivial function call.
I wonder if they should be using something like libuv to handle this. Instead of flipping state back and forth, create a playground for the C code that looks more like what it expects.
Java FFI is slow and cumbersome, even more so if you're using the fancy auto-async from recent versions. The JVM community has mostly bitten the bullet and rewritten the entire world in Java rather than using native libraries, you only see JNI calls for niche things like high performance linear algebra; IMO that was the right tradeoff but it's also often seen as e.g. the reason why Java GUIs on the desktop suck.
Other languages generally fall into either camp of having a C-like stack and thread model and easy FFI (e.g. Ruby, TCL, OCaml) and maybe having futures/async but not in an invisible/magic way, or having a radically different threading model at the cost of FFI being slow and painful (e.g. Erlang). JavaScript is kind of special in having C-like stack but being built around calling async functions from a global event loop, so it's technically the first but feels more like the second.
JNI is the second or maybe third FFI for Java. JRI existed before it and that was worse, including performance. The debugging and instrumentation interfaces have been rewritten more times.
C# does marshal/unmarshal for you, with a certain amount of GC-pinning required for structures while the function is executing. It's pretty convenient, although not frictionless, and I wouldn't like to say how fast it is.
> How much slower is it in comparison to other languages?
It's about the same as most other languages that aren't specifically optimized for C calling. Considerably faster than Python.
Which is funny as everyone on HN loves to extol the virtues of Python being a "C DSL" and never think twice about its overhead, but as soon as the word Go is mentioned its like your computer is going to catch fire if you even try.
I've asked ChatGPT to summarize (granted my prompt might not be ideal), but some points to note, here just first in details others in the link at the bottom:
Calling C from Go (or vice versa) often requires switching from Go's lightweight goroutine model to a full OS thread model because:
- Go's scheduler manages goroutines on M:N threads, but C doesn't cooperate with Go's scheduler.
- If C code blocks (e.g., on I/O or mutex), Go must assume the worst and parks the thread, spawning another to keep Go alive.
* Cost: This means entering/exiting cgo is significantly more expensive than a normal Go call. There’s a syscall-like overhead.
... This was only the first issue, but then it follows with "Go runtime can't see inside C to know is it allocating, blocking, spinning, etc.", then "Stack switching", "Thread Affinity and TLS", "Debug/Profiling support overhead", "Memory Ownership and GC barriers"
i can't see how these numbers can be anywhere near correct (nor the ones above). in JavaScript on an old Core i5 the overhead of a simple ffi call is on the order of 5 nanoseconds. on a recent x64/arm64 cpu it's more like 2 nanoseconds.
you can verify this easily with Deno ffi which is pretty much optimal for JS runtimes. also, from everything i have seen and read, luajit should be even lower overhead than this.
you really shouldn't be asking chatgpt questions like this imo. these are facts, that need to be proven, not just vibes.
Thanks, I added it to the list. Keep in mind that the numbers may be off (both yours and mine), so I would not take them at face value. It is interesting how in yours JNI is still pretty good. Also Rust is "~5–20 ns" in yours, so I assume "0" is the baseline.
Hey, since I am in a thread where we are sharing what ChatGPT spits out, I just copy pasted it from there, too. :)
For what it is worth, I asked about LuaJIT after I have shared the link, and the numbers are now different for some languages, albeit not by much. Go (cgo) became ~50-100 ns. That said, I still believe it is unfair to single out Go when it does way better than some other GC languages.
That is not it. Everyone who copy pastes output from an LLM gets downvoted (and likely flagged). Even though I simply went with your method, I got down-voted too, when down-voting the parent comment (yours) would have sufficed. Oh well.
> Every type contributes to a cost on the instruction cache, meaning that if your program parses a lot of different types, it will essentially flush your instruction cache any time you enter a parser. Worse still, if a parse involves enough types, the parser itself will hit instruction decoding throughput issues.
Interesting. This makes me wonder how nanopb would benchmark against the parsers shown in the graphs. nanopb's whole schtick is that it's pure C99 and it doesn't generate separate parsing functions for each message. nanopb is what a lot of embedded code uses, due to the small footprint.
I see some discussion here about protobufs being widely used and preventing innovation because it's good enough and comes from Google.
I see a second effect that's probably more regressive: using protobufs (designed for RPC) as a database schema definition language or a disk data structure description language.
Much prefer something like typespec to describe types.And then derive the rpc and disk schema languages from that programatically.
My first attempt was to use the flatbuffer schema language as a starting point. But being attached to a serialization format can be a net negative. Did not find traction.
Motivation: can not stand field numbers in a database schema when other migration mechanisms exist.
I’d really love to see more work bringing the best parts of protobuf to a standardised serialization format like CBOR.
I’d make the same argument for gRPC-web to something like WHATWG streams and or WebTransport.
There is a lot of really cool and important learnings in both but it’s also so tied up in weird tooling and assumptions. Let’s rebase on IETF and W3C standards
Would be good to see support for encoding / decoding CBOR exposed as a broswer API - they currently use CBOR internally for WebAuthn so I’d hope it’s bnot too hard
This is very cool. The ASN.1 compiler I [sort of] maintain has an option to generate C code or a "template", and the "template" is just an AST that gets interpreted at run-time. I did not come up with that idea, but the people who did did it because it's faster to do the AST interpretation thing than to generate lots of object code -- the reason is that the templates and the template interpreter are smaller than the object code for the alternative, and cache effects add up.
last time i benchmarked msgpack and protobuf against each other it was near flat for my use case. JSON was 2-3x slower, but msgpack and protobuf were near equal. might be different now after this release, exciting :)
Keep in mind that some of the performance bottlenecks the author is talking about and optimizing for show up in large-scale uses and possibly not in benchmarks. In particular, the "parsing too many different types blows away your instruction cache" issue will only show up if you are actually parsing lots of types, otherwise UPB is not necessary.
If execution performance is really that much of an issue you're not going to choose Go anyway, rendering this project moot.
What is a real issue is marketing performance. As it turns out, everyone loves a story about something going fast. So it pays to create these kinds of optimizations for the sake of the story, used to sell the product to the developer who makes choices based on what they think feels good.
I would love to benchmark it against the old gogo protobuf. One thing that really slows down PB is its insistence on using pointers everywhere. Gogo protobuf used value types where possible, significantly reducing the GC pressure.
It feels like arenas are just an attempt to bring that back.
Not enough programmers talk to QA, and so are oblivious to testing.
The benchmarking here is... pointless. Here's my story (of writing a Protobuf parser) to illustrate why. I already wrote about it before, but it keeps happening, so, maybe repeating it is not such a bad thing.
So... the official (coming from Google) Protobuf parser for Python is ridiculously bad. It generates Python code for the parser. Not only is this unnecessary, it's also executed poorly. So, I decided to roll my own. Also, I decided to make it multi-threaded (I wrote it in C, with Python bindings).
Then I tried to benchmark it against the official implementation. I don't trust the quality of Python language implementation, so, that came under the spotlight first. For example, I chose Python's enumerators to represent Protobuf enumerators. Turned out that was a horrible decision because instantiation of Python's enumerators is extremely resource-intensive. I had to make a few similar adjustments, searching for the less resource-hungry Python built-in data-structures.
And yet I was still far behind on my benchmarks. So, I tried to understand what the Google's C++ code was doing. And then I realized that Google's parser in the "first pass" only extracted the top-level messages, not even trying to parse the hierarchical structure. This structure was only extracted on-demand. So, essentially, I was competing against memcpy()...
But, here's another catch: most real-world applications will not want the "intermediate" Protobuf structure made up of general-purpose data-structures such as dictionaries or enumerators. They will want them immediately parsed into domain-specific data-structures. So, a parser that offers this kind of functionality might be slower to parse into generic data-structures, but will win in real-world benchmarks.
Of course, the percentage of useful payload is also important. Typically, applications underutilize the messages they exchange. But, to what degree is different and completely dependent on application design. Similarly, mapping into domain-specific objects will depend on the sort of information the application wants to exchange. Protobuf is bad for column / tabular data, but somewhat better for object graph kind of data. Protobuf is bad for large messages with duplicated content, but is somewhat better for short messages with little duplication.
All of this, and much more, allows for a lot of variability in benchmark design, and would allow me, or anyone writing a parser to skew the numbers in whatever direction I want... to amuse the general public.
There are two ways to look at this.
First is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be extremely cursed and you should take a step back and think about your life.
Second is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be wildly popular and adds enormous value.
If you want to do something several hundreds of billions of times per day, you probably want to do it very efficiently.
Or, you know, several hundreds of billions of times per second...
I'm not sure which part you're objecting to.
If it's compilation at run-time, then I agree: it should be done at build time. But in this case it's really not a big deal.
If you're objecting to needing a compiler, then... you're not even wrong.
You can also prase Protobuf on the very low end, if you don't need super high throughput
https://jpa.kapsi.fi/nanopb/
This is excellent: an in-depth description showing how the Go internals make writing fast interpreters difficult, by someone who is far more determined than I ever was to make it fast anyway.
I’ve assumed that writing fast interpreters wasn’t a use case the Go team cared much about, but if it makes protobuf parsing faster, maybe it will get some attention, and some of these low-level tricks will no longer be necessary?
> and some of these low-level tricks will no longer be necessary?
Don't count on it. This language is Golang.
Hi to my favorite turkish teenager. You really don't get to generate fast codecs from normal-looking code in any language :(
In as-far-as wuffs looks normal, wuffs?
wuffs is extremely cool, but you do find the expected blobs of intrinsics for specific simd extensions in this code.
Offtopic, but is anyone using CapnProto, the ProtoBuf former maintainer's (kentonv around here) subsequent project?
https://capnproto.org
If so, how does it compare in practice?
(what does Cloudflare Workers use?)
Cloudflare is almost entirely run on Cap'n Proto, including the entire workers platform
The Workers platform uses Cap'n Proto extensively, as one might expect (with me being the author of Cap'n Proto and the lead dev on Workers). Some other parts of Cloudflare use it (including the logging pipeline, which used it before I even joined), but there are also many services using gRPC or just JSON. Each team makes their own decisions.
I have it on my TODO list to write a blog post about Workers' use of Cap'n Proto. By far the biggest wins for us are from the RPC system -- the serialization honestly doesn't matter so much.
That said, the ecosystem around Cap'n Proto is obviously lacking compared to protobuf. For the Cloudflare Workers Runtime team specifically, the fact that we own it and can make any changes we need to balances this out. But I'm hesitant to recommend it to people at other companies, unless you are eager to jump into the source code whenever there's a problem.
A while ago I talked to some team that was planning a migration to GraphQL, long after this was generally thought to be a bad idea. The lead seemed really attached to the "composable RPCs" aspect of the thing, and at the time it seemed like nothing else offered this. It would be quite cool if capnproto became a more credible option for this sort of situation. At the time users could read about the rpc composition/promise passing/"negative latency" stuff, but it was not quite implemented.
This makes me really sad. Protobufs are not all that great, but they were there first and “good enough”.
It’s frustrating when we can’t have nice things because a mediocre Google product has sucked all the air out of the room. I’m not only talking about Protobufs here either.
Thank you for posting here. Always insightful, always a treat.
To piggyback on this, is anyone using flatbuffers? They solve same problem as CapnProto, are a basis of arrow.
I've used it but too long ago and don't know their current state.
We use flatbuffers extensively in our message passing framework, and they are extremely fast if you take care with your implementation. They have a few features that make them especially useful for us
1) The flatbuffer parser can be configured at runtime from a schema file, so our message passing runtime does not to need to know about any schemas at build time. It reads the schema files at startup, and is henceforth capable of translating messages to and from JSON when required. It's also possible to determine that two schemas will be compatible at runtime.
2) Messages can be re-used. For our high-rate messages, we build a message and then modify it to send again, rather than building it from scratch each time.
3) Zero decode overhead - there is often no need to deserialise messages - so we can avoid copying the data therein.
The flatbuffer compiler is also extremely fast, which is nice at build time.
Used them before and they're ok. They were missing some important features like sum types. The code output was a pain, but targeted a few languages. My suspicion is that Captain Proto would be technically superior but less documented than flatbuffers.
However, my preference is to use something like MsgPack or CBOR with compile time reflection to do serde directly into types. You can design the types to require minimal allocations and to parse messages within a few dozen nanoseconds. That means doing things like using static char arrays for strings. It wastes a bit of space but it can be very fast. Also skipping out on spaced used by 64bit pointers can replace a lot of shorter text fields.
That said, I should wrap or port this UPB to Nim. It'd be a nice alternative if it's really as fast as claimed. Though how it handles allocating would be the clincher.
> They were missing some important features like sum types.
They support unions now. I haven't had any trouble representing anything including recursive structures.
> The code output was a pain, but targeted a few languages.
You do need a day or so to get used to the code. Its a pointer based system with a 'flat memory'. Order of operations matters. If you have a parent and a child you need to write the child first, obtain a pointer to it and only then create/write the parent containing the pointer to the child. Once you get used to this it goes quickly. The advantage is that you don't have to create an intermediate copy in memory when reading (like protobuf) and you can read any particular part by traversing pointers without having to load the rest of the data into memory.
To piggyback on this, is anyone using CBOR? They solve same problem as Protobuf, but more like JSON where a schema is not required.
Amazon uses CBOR extensively. Most AWS services by now should support being called using CBOR. The protocol they're using is publicly documented at: https://smithy.io/2.0/additional-specs/protocols/smithy-rpc-...
The services serve both CBOR and other protocols simultaneously.
Neither of them are well-designed or well thought-through. They address some cosmetic issues of Protobuf, but don't really deal with the major issues. So, you could say they are slightly better, but the original was bad enough to completely disqualify it either.
Flatbuffers lets you directly mmap from disk, that trick alone makes it really good for use cases that can take advantage of it(fast access of read-only data). If you're clever enough to tune the ordering of fields you can give it good cache locality and really make it fly.
We used to store animation data in mmaped flatbuffers at a previous gig and it worked really well. Kernel would happily prefetch on access and page out under pressure, we could have 10s of MBs of animation data and only pay a couple hundred kb based on access patterns.
Yes, I am, it's fast and great, but the UX in go is a bit annoying since you need to constantly keep checking for errors on literally every set (to check for possible area errors). So your code has even more `if err != nil {return nil, err}` than usual go code.
> In other words, a UPB parser is actually configuration for an interpreter VM, which executes Protobuf messages as its bytecode.
This is kind of confusing, the VM is runtime crafted to parse a single protobuf message type and only this message type? The Second Futamura Projection, I suppose...
Or the VM is designed specifically around generic protobuf messages and it can parse any random message but only if it's a protobuf message?
I've been working on the design of a similar system but for general binary parsing (think bison/yacc for binary data) and hadn't even considered doing data over specialized VM vs. bytecode+data over general VM. Honestly, since it's designed around 'maximum laziness' (it just parses/verifies and creates metadata over the input so you only pay for decoding bytes you actually use) and I/O overhead is way greater than the VM dispatching trying this out is probably one of those "premature optimization is the root of all evil" cases but intriguing none the less.
I think I can shed some light on this, as the creator and lead of upb.
Calling a Protobuf Parser an "interpreter VM" is a little bit of rhetorical flourish. It comes from the observation that there are some deep structural similarities between the two, which I first observed in an article a few years back: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...
> It may seem odd to compare interpreter loops to protobuf parsers, but the nature of the protobuf wire format makes them more similar than you might expect. The protobuf wire format is a series of tag/value pairs, where the tag contains a field number and wire type. This tag acts similarly to an interpreter opcode: it tells us what operation we need to perform to parse this field’s data. Like interpreter opcodes, protobuf field numbers can come in any order, so we have to be prepared to dispatch to any part of the code at any time.
This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.
The tricky part is that the set of "case" labels for a Protobuf parser is message-specific and defined by the fields in the schema. How do we accommodate that?
The traditional answer was to generate a function per message and use the schema's field numbers as the case labels. You can see an example of that here (in C++): https://github.com/protocolbuffers/protobuf/blob/f763a2a8608...
More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.
The trick then becomes how to make this table-driven dispatch as fast as possible, to simulate what the switch() statement would have done. That question is what I cover at length in the article mentioned above.
Really great. I wonder, for the "types encoded as code" approach, is there any benefit to fast paths for data with fields in ascending order? For some json parsers with types encoded as code I have observed some speedup from either hard-coding a known key order or assuming keys in some order and providing a fallback in case an unexpected key is encountered. For users who are stuck with protobuf forever because of various services using it and various data being encoded this way, the historical data could plausibly be canonicalized and written back in large chunks when it is accessed, so that one need not pay the entire cost of canonicalizing it all at once. But of course the icache concerns are still just as bad.
This kind of "expected next field" optimization has a long history in protobuf, but results are mixed.
The generated code in C++ used to check for the expected next field before falling back to the switch() (example here: https://github.com/protocolbuffers/protobuf/blob/460e7dd7c47...) but this was removed in 2016 when load tests found that it hurt more than it helped.
One tricky part of making this optimization work is making a good guess about what the next field should be. Miguel's article alludes to this:
> Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.
> I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.
One delightful thing about the tail-call parser design is that the CPU's branch predictor effectively takes over the job of guessing. With a tail call parser, the dispatch sequence ends up looking like this:
That "jmp rax" instruction is an indirect branch that can be predicted by the CPU. The CPU effectively guesses for us!And unlike any kind of guess we might have performed ahead-of-time, the branch predictor will constantly adapt to whatever patterns it is seeing in the data. This is good, because statically guessing ahead-of-time is hard.
> This kind of "expected next field" optimization has a long history in protobuf
You could probably even trace the history of the idea all the way to Van Jacobson’s 30-instruction TCP fastpath[1]. Or to go a bit closer, I’ve found that an interpreter for a stack+accumulator VM (which, compared to the pure stack option, is prone to blowing up the bytecode count and thus dispatch cost with the constant PUSH-accumulator instructions) goes significantly faster if you change the (non-shared) dispatch from
to which feels somewhat analogous to the next-field optimization and avoids polluting the indirect branch predictor with the very common PUSH predictions. (It’s still slower than not having those PUSHes in the first place.)[1] https://www.pdl.cmu.edu/mailinglists/ips/mail/msg00133.html
> More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.
Everything old is new again, I guess—one of the more advertised changes in Microsoft COM as it was maturing (circa 1995) was that you could use data-driven marshalling with “NDR format strings” (bytecode, essentially[1,2]) instead of generating C code. Shortly after there was typelib marshalling (format-compatible but limited), and much later also WinRT’s metadata-driven marshalling (widely advertised but basically completely undocumented).
Fabrice Bellard’s nonfree ASN.1 compiler[3] is also notable for converting schemas into data rather than code, unlike most of its open-source alternatives.
I still can’t help wondering what it is, really, that makes the bytecode-VM approach advantageous. In 1995, the answer seems simple: the inevitable binary bloat was unacceptable for a system that needs to fit into single-digit megabytes of RAM; and of course bytecode as a way to reduce code footprint has plenty of prior art (microcomputer BASICs, SWEET16, Forth, P-code, even as far back as the AGC).
Nowadays, the answer doesn’t seem as straightforward. Sure, the footprint is enormous if you’re Google, but you’re not Google (you’re probably not even Cloudflare), and besides, I hope you’re not Google and can design stuff that’s actually properly adapted to static linking. Sure, the I$ pressure is significant (thinking back to Mike Pall’s explanation[4] why he didn’t find a baseline JIT to be useful), but the bytecode interpreter isn’t going to be a speed demon either.
I don’t get it. I can believe it’s true, but I don’t really feel that I get it.
[1] https://learn.microsoft.com/en-us/windows/win32/rpc/rpc-ndr-...
[2] https://gary-nebbett.blogspot.com/2020/04/rpc-ndr-engine-dce...
[3] https://bellard.org/ffasn1/
[4] https://news.ycombinator.com/item?id=14460027
In some workloads, you can benefit from paying the latency cost of data dependencies instead of the latency cost of conditional control flow, but there's no general rule here. It's best to try out several options on the actual task and the actual production data distribution.
Regular expression engines on this model are often called VMs, certainly thats the terminology and layout I used. A big switch/computed-goto loop thing with state codes? Yeah, thats a VM.
I think its a very illuminating way to describe it. Nicely done with the implementation as well.
> This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.
This is a very insightful design. Encoding the parser in a table is an old technique, it's what YACC uses. There's a tradeoff between using the CPU's stack+registers versus having your own stack+state in this kind of work, and people have found situations where a small VM is faster because it gets to stay in the instruction cache and benefit from branch prediction, while the less-predictable data stays in the d-cache.
Based on what I know about the structure of protobufs internally and without having looked deep into what UPB is doing... I'd guess it could probably be a stack machine that treats (byte)+ as opcodes. Most of the time I'd think of it as parser -> AST -> bytecode, but I think the "grammar" of protobufs would allow your parser to essentially emit terminals as they're parsed straight to the VM as instructions to execute.
In the couple days since I posted my confusion (threads merged or something) I consulted the daffy robots and figured out how it all works. Also had them come up with a design document for "a specialized compiler and virtual machine architecture for parsing Protocol Buffer messages that achieves significant performance improvements through a novel compilation pipeline combining protobuf-specific AST optimization, continuation-passing style transformations, and tail call interpreter execution."
Interesting times we live in...
So am I identifying the bottlenecks that motivate this design correctly?
1. Go FFI is slow
2. Per-proto generated code specialization is slow, because of icache pressure
I know there's more to the optimization story here, but I guess these are the primary motivations for the VM over just better code generation or implementing a parser in non-Go?
I know that Java resisted improving their FFI for years because they preferred that the JIT get the extra resources. And that customers not bail out of Java every time they couldn’t figure out how to make it faster. There’s a case I recall from when HotSpot was still young, where the Java GUI team moved part of the graphics pipeline to the FFI in one release, hotspot got faster in the next, and then they rolled back the changes because it was now faster without the FFI.
But eventually your compiler is good enough that the FFI Is now your bottleneck, and you need to do something.
3. The use case is dynamic schemas and access is through the reflection API. Thus PGO has to be done at runtime...
I keep hearing that Go's C FFI is slow, why is that? How much slower is it in comparison to other languages?
Go's goroutines aren't plain C threads (blocking syscalls are magically made async), and Go's stack isn't a normal C stack (it's tiny and grown dynamically).
A C function won't know how to behave in Go's runtime environment, so to call a C function Go needs make itself look more like a C program, call the C function, and then restore its magic state.
Other languages like C++, Rust, and Swift are similar enough to C that they can just call C functions directly. CPython is a C program, so it can too. Golang was brave enough to do fundamental things its own way, which isn't quite C-compatible.
> CPython is a C program
Go (gc) was also a C program originally. It still had the same overhead back then as it does now. The implementation language is immaterial. How things are implemented is what is significant. Go (tinygo), being a different implementation, can call C functions as fast as C can.
> ...so it can too.
In my experience, the C FFI overhead in CPython is significantly higher than Go (gc). How are you managing to avoid it?
I think in case of CPython it's just Python being slow to do anything. There are costs of the interpreter, GIL, and conversion between Python's objects and low-level data representation, but the FFI boundary itself is just a trivial function call.
I would like to know this, too.
I wonder if they should be using something like libuv to handle this. Instead of flipping state back and forth, create a playground for the C code that looks more like what it expects.
What about languages like Java, or other popular languages with GC?
Java FFI is slow and cumbersome, even more so if you're using the fancy auto-async from recent versions. The JVM community has mostly bitten the bullet and rewritten the entire world in Java rather than using native libraries, you only see JNI calls for niche things like high performance linear algebra; IMO that was the right tradeoff but it's also often seen as e.g. the reason why Java GUIs on the desktop suck.
Other languages generally fall into either camp of having a C-like stack and thread model and easy FFI (e.g. Ruby, TCL, OCaml) and maybe having futures/async but not in an invisible/magic way, or having a radically different threading model at the cost of FFI being slow and painful (e.g. Erlang). JavaScript is kind of special in having C-like stack but being built around calling async functions from a global event loop, so it's technically the first but feels more like the second.
JNI is the second or maybe third FFI for Java. JRI existed before it and that was worse, including performance. The debugging and instrumentation interfaces have been rewritten more times.
https://docs.oracle.com/en/java/javase/24/docs/specs/jni/int... mentions JRI.
But it seems like JNI has been replaced by third party solutions multiple times as well.
https://developer.okta.com/blog/2022/04/08/state-of-ffi-java...
C# does marshal/unmarshal for you, with a certain amount of GC-pinning required for structures while the function is executing. It's pretty convenient, although not frictionless, and I wouldn't like to say how fast it is.
Similar enough to C I guess, at least in their stack layout.
It's explained in the article.
Go's threading model involves a lot of tiny (but growable) stacks and calling C functions almost immediately stack overflows.
Calling C safely is then slow because you have to allocate a larger stack, copy data around and mess with the GC.
> How much slower is it in comparison to other languages?
It's about the same as most other languages that aren't specifically optimized for C calling. Considerably faster than Python.
Which is funny as everyone on HN loves to extol the virtues of Python being a "C DSL" and never think twice about its overhead, but as soon as the word Go is mentioned its like your computer is going to catch fire if you even try.
Emotion-driven development is a bizarre world.
Yeah, that is why I am asking.
I've asked ChatGPT to summarize (granted my prompt might not be ideal), but some points to note, here just first in details others in the link at the bottom:
... This was only the first issue, but then it follows with "Go runtime can't see inside C to know is it allocating, blocking, spinning, etc.", then "Stack switching", "Thread Affinity and TLS", "Debug/Profiling support overhead", "Memory Ownership and GC barriers"All here - https://chatgpt.com/share/688172c3-9fa4-800a-9b8f-e1252b57d0...
Just to roll with your way: https://chatgpt.com/share/688177c9-ebc0-8011-88cc-9514d8e167...
Please do not take the numbers below at face value. I still expect an actual reply to my initial comment.
Per-call overhead:
Seems like Go is still fast enough as opposed to other programming languages with GC, so I am not sure it is fair to Go.Java now has FFM, that is far better and simpler than JNI, FWIW. and chatgpt says
Language/API | Call Overhead (no-op C) | Notes
Go (cgo) | ~40–60 ns | Stack switch + thread pinning
Java FFM | ~50 ns (downcall) | Similar to JNI, can be ~30 ns with isTrivial()
Java FFM (leaf) | ~30–40 ns | Optimized (isTrivial=true)
JNI | ~50–60 ns | Slightly slower than FFM
Rust (unsafe) | ~5–20 ns | Near-zero overhead
C# (P/Invoke) | ~20–50 ns | Depends on marshaling
Python (cffi) | 1000–10000 ns | Orders of magnitude slower |
i can't see how these numbers can be anywhere near correct (nor the ones above). in JavaScript on an old Core i5 the overhead of a simple ffi call is on the order of 5 nanoseconds. on a recent x64/arm64 cpu it's more like 2 nanoseconds.
you can verify this easily with Deno ffi which is pretty much optimal for JS runtimes. also, from everything i have seen and read, luajit should be even lower overhead than this.
you really shouldn't be asking chatgpt questions like this imo. these are facts, that need to be proven, not just vibes.
I agree. was just following the parents pattern, to make it work for me :)
Thanks, I added it to the list. Keep in mind that the numbers may be off (both yours and mine), so I would not take them at face value. It is interesting how in yours JNI is still pretty good. Also Rust is "~5–20 ns" in yours, so I assume "0" is the baseline.
This is chatgpt. Not my own benchmark. So it is probably hallucinating
> Rust (unsafe)
As if there is an alternative :)
More seriously, it’s “unsafe” from the perspective of the library calling into C, but usually “safe” for any layer above.
Hey, since I am in a thread where we are sharing what ChatGPT spits out, I just copy pasted it from there, too. :)
For what it is worth, I asked about LuaJIT after I have shared the link, and the numbers are now different for some languages, albeit not by much. Go (cgo) became ~50-100 ns. That said, I still believe it is unfair to single out Go when it does way better than some other GC languages.
oh wow I got downvoted a lot - I guess I'm bad at prompting :)
You are being downvoted because pasting AI output with no attempt at fact-checking is not bringing any real value to the discussion.
That is not it. Everyone who copy pastes output from an LLM gets downvoted (and likely flagged). Even though I simply went with your method, I got down-voted too, when down-voting the parent comment (yours) would have sufficed. Oh well.
Noted! Well maybe it makes sense! Thanks for the info!!!
I even got down-voted for telling you the truth. Lmao.
"ants_everywhere" is right.
Don't take it too personally.
The anti-LLM activists that patrol HN will often downvote comments just for stating facts that disagree with their hatred of LLMs.
> Every type contributes to a cost on the instruction cache, meaning that if your program parses a lot of different types, it will essentially flush your instruction cache any time you enter a parser. Worse still, if a parse involves enough types, the parser itself will hit instruction decoding throughput issues.
Interesting. This makes me wonder how nanopb would benchmark against the parsers shown in the graphs. nanopb's whole schtick is that it's pure C99 and it doesn't generate separate parsing functions for each message. nanopb is what a lot of embedded code uses, due to the small footprint.
Related ongoing thread:
Hyperpb: Faster dynamic Protobuf parsing - https://news.ycombinator.com/item?id=44661785
I see some discussion here about protobufs being widely used and preventing innovation because it's good enough and comes from Google.
I see a second effect that's probably more regressive: using protobufs (designed for RPC) as a database schema definition language or a disk data structure description language.
Much prefer something like typespec to describe types.And then derive the rpc and disk schema languages from that programatically.
My first attempt was to use the flatbuffer schema language as a starting point. But being attached to a serialization format can be a net negative. Did not find traction.
Motivation: can not stand field numbers in a database schema when other migration mechanisms exist.
I’d really love to see more work bringing the best parts of protobuf to a standardised serialization format like CBOR.
I’d make the same argument for gRPC-web to something like WHATWG streams and or WebTransport.
There is a lot of really cool and important learnings in both but it’s also so tied up in weird tooling and assumptions. Let’s rebase on IETF and W3C standards
Would be good to see support for encoding / decoding CBOR exposed as a broswer API - they currently use CBOR internally for WebAuthn so I’d hope it’s bnot too hard
You can easily do this. Protobuf supports pluggable writers, and iterating over a schema is pretty easy. We do it for the JSONB.
I'm not sure the purpose, though. Protobuf is great for its inflexible schema, and CBOR is great for its flexible data representation.
A separate CBOR schema would be a better fit, there's CDDL but it has no traction.
Does hyperb compile PB types into something like a byte-compiled or AST description of them that it then interprets at run-time?
EDIT: Yes https://github.com/bufbuild/hyperpb-go/blob/main/internal/td...
This is very cool. The ASN.1 compiler I [sort of] maintain has an option to generate C code or a "template", and the "template" is just an AST that gets interpreted at run-time. I did not come up with that idea, but the people who did did it because it's faster to do the AST interpretation thing than to generate lots of object code -- the reason is that the templates and the template interpreter are smaller than the object code for the alternative, and cache effects add up.
JITting bytecode VMs really seem fantastic at this. A similar trick works in Avro in Python (https://journal.spencerwnelson.com/entries/avro.html).
I suppose the insight is that schema documents are programs, just in a weird language, and compiled programs are fast.
last time i benchmarked msgpack and protobuf against each other it was near flat for my use case. JSON was 2-3x slower, but msgpack and protobuf were near equal. might be different now after this release, exciting :)
Keep in mind that some of the performance bottlenecks the author is talking about and optimizing for show up in large-scale uses and possibly not in benchmarks. In particular, the "parsing too many different types blows away your instruction cache" issue will only show up if you are actually parsing lots of types, otherwise UPB is not necessary.
> hyperpb is a brand new library, written in the most cursed Go imaginable
This made me LOL.
https://github.com/bufbuild/hyperpb-go
OK reading this, I have imposter syndrome again.
Where are the benchmarks comparing hyperpb to other proto parsers?
Fantastic post.
Please do one on your analysis and optimization workflow and tooling!
maybe if performance is really an issue, protobufs shouldn't be used ?
there is also flatbuffers, capnp and purely C++ serializers: zpp, yas... but protobufs are definitely convenient !
If execution performance is really that much of an issue you're not going to choose Go anyway, rendering this project moot.
What is a real issue is marketing performance. As it turns out, everyone loves a story about something going fast. So it pays to create these kinds of optimizations for the sake of the story, used to sell the product to the developer who makes choices based on what they think feels good.
If performance isn’t an issue, why not JSON?
Protobuf tooling isn't great, but JSON tooling is far worse. Developer experience matters too.
(Technically protobuf supports JSON encoding, but its additional work to enable, again diminishing the developer experience)
performance is always an issue when the alternative is 2+ orders of magnitude slower.
yeah, wish Go binary encoding/decoding was faster and leaner, maybe even with zero memory copy
I would love to benchmark it against the old gogo protobuf. One thing that really slows down PB is its insistence on using pointers everywhere. Gogo protobuf used value types where possible, significantly reducing the GC pressure.
It feels like arenas are just an attempt to bring that back.
Not enough programmers talk to QA, and so are oblivious to testing.
The benchmarking here is... pointless. Here's my story (of writing a Protobuf parser) to illustrate why. I already wrote about it before, but it keeps happening, so, maybe repeating it is not such a bad thing.
So... the official (coming from Google) Protobuf parser for Python is ridiculously bad. It generates Python code for the parser. Not only is this unnecessary, it's also executed poorly. So, I decided to roll my own. Also, I decided to make it multi-threaded (I wrote it in C, with Python bindings).
Then I tried to benchmark it against the official implementation. I don't trust the quality of Python language implementation, so, that came under the spotlight first. For example, I chose Python's enumerators to represent Protobuf enumerators. Turned out that was a horrible decision because instantiation of Python's enumerators is extremely resource-intensive. I had to make a few similar adjustments, searching for the less resource-hungry Python built-in data-structures.
And yet I was still far behind on my benchmarks. So, I tried to understand what the Google's C++ code was doing. And then I realized that Google's parser in the "first pass" only extracted the top-level messages, not even trying to parse the hierarchical structure. This structure was only extracted on-demand. So, essentially, I was competing against memcpy()...
But, here's another catch: most real-world applications will not want the "intermediate" Protobuf structure made up of general-purpose data-structures such as dictionaries or enumerators. They will want them immediately parsed into domain-specific data-structures. So, a parser that offers this kind of functionality might be slower to parse into generic data-structures, but will win in real-world benchmarks.
Of course, the percentage of useful payload is also important. Typically, applications underutilize the messages they exchange. But, to what degree is different and completely dependent on application design. Similarly, mapping into domain-specific objects will depend on the sort of information the application wants to exchange. Protobuf is bad for column / tabular data, but somewhat better for object graph kind of data. Protobuf is bad for large messages with duplicated content, but is somewhat better for short messages with little duplication.
All of this, and much more, allows for a lot of variability in benchmark design, and would allow me, or anyone writing a parser to skew the numbers in whatever direction I want... to amuse the general public.