Abusing C to implement JSON parsing with struct methods

118 points by ingve 4 months ago

_0ffh 4 months ago

I am sorry, I don't quite understand where the "abuse" comes in. Attaching function pointers to structs? I've done that in production twenty years ago (and others have probably done it in the 70s) and I see nothing wrong with it. C gives us function pointers for a reason, and it's not to not use them... to me, at least, they're actually one of the premier features of C! (GCC's goto *ptr on the other hand, now there's some potential for getting a little bit wild...)

uecker 4 months ago

I assume it is because nowadays there are so many people running around claiming you need C++ or Rust or other more complicated languages to build abstractions, that people do not understand anymore that you can do this in C just fine.
- usrnm 4 months ago
  
  I don't think I've ever seen a large C codebase that wasn't using structs with methods. It isn't by any measure a secret or lost art.
  - uecker 4 months ago
    
    And yet, here we are reading about it in the news.
- userbinator 4 months ago
  
  The earliest C++ compilers compiled to C, and I believe that's still possible to do even for the latest versions of the C++ standard.
  - nextaccountic 4 months ago
    
    It's always possible to compile a turing complete language into another turing complete language, so this statement doesn't carry much meaning.
    The interesting thing with the original C++ (then called "C with classes") compiler, Cfront, which compiled to C, was that it was an 1:1 transpiler that offered only local transformations with no global analysis (you can transform one function at time without looking at the rest of the code). The structure of the generated C code closely matched the C++ source code. This is something that's not always possible when compiling a language into another, this only happens when a language offers essentially just syntactic sugar over another, like Coffeescript and Javascript.
    I guess that if you compile exceptions using setjmp/longjmp and expanded all templates (I don't think you can compile templates into C macros), and also expanded method overloading and type inference, then all other features from modern C++ are only local transformations. Well maybe with the exception of couroutines from C++20 (which maybe can be compiled into setjmp/longjmp as well, if it can interact well with exceptions? But the purpose of couroutines is to compile into a state machine, which is a non local transformation)
    The original Cfront compiler didn't implement even exceptions (they wanted to but Wikipedia said the code was a mess [0] so it was abandoned), but that another, later C++ compiler [1] compiled into C and implemented exceptions and other things
    [0] https://en.wikipedia.org/wiki/Cfront
    [1] https://en.wikipedia.org/wiki/Comeau_C/C%2B%2B
alberth 4 months ago

People sometimes use structs when a class has only public members.
And I believe the STL is also implemented that way as well.
fsckboy 4 months ago

>I am sorry, I don't quite understand where the "abuse" comes in. Attaching function pointers to structs?
i had the same reacion, then i realized
(and yes, i know the following opinion is a minority oldschool boomer opinion, no need to downvote me to hell to make sure i know)
JSON itself is the abuse of C/unix, like XML before it.

hyperhello 4 months ago

Great site, but the JSON looks like it's embedded in C as values but is really just a macro that expands it into a string and then parses it. So you can't put any local variables into the JSON.

In my opinion, C really needs some officially supported reflection for the names of enums and such things. People have reinvented the wheel so many times and it never quite gets there.

immibis 4 months ago

That would be a C++ feature. C is very "batteries not included". In particular, it won't generate entire functions or data blocks that you didn't write or include from a library.
xnacly 4 months ago

Hi, author here, I just added the JSON macro to omit the quotation escaping around object keys and strings:
without JSON macro:
char* j = "{ \"key\": [\"value1\", \"value2\"]}";
with JSON macro:
char *j = JSON({ "key": ["value1", "value2"]});
But yes of course, due to the stringify no C values can be embedded - i wouldnt even know how one would solve this with a macro, but maybe someone has an idea and can comment on that.
DeathArrow 4 months ago

>In my opinion, C really needs some officially supported reflection for the names of enums and such things.
C is good as it is an someone has the chance to hold all its features in his memory. If you need or want to complicate things, there's C++ or Rust.
- pjmlp 4 months ago
  
  I very much doubt that as any pub quiz on C will validate.
  People think they know C, in reality they never read ISO C document, aren't aware of the differences between ISO C standard library and POSIX, how each compiler handles ill formed no diagnostic required, implementation defined and UB parts of the standard, and to top that even better compiler specific extensions.
  - virgilp 4 months ago
    
    You do not need how compiler handles UB parts of the standard; in fact you should NEVER rely on it, as it can change without warning. That's effectively "incorrect code", by definition.
    
    pjmlp 4 months ago
    
    And yet that is what many folks happen to do, because "performance all the things!".
    
    uecker 4 months ago
    
    They will also do this in C++ or Rust. The only difference is that in Rust they will pretend it is ok because they wrapped it "unsafe", so if they mess up it is nobody's fault because it can not possible be a problem with Rust or the Rust programmer, so it was unavoidable fate.
    
    pjmlp 4 months ago
    
    Like in any language that has adopted unsafe concept since ESPOL/NEWP did it in 1961, at least Rust has the advantage we can find those code blocks without the help of a static analysis tool.
    Nowdays C++ has inherited the same mentality as many performance minded C developers, which is a bit sad, given that during the 1990's it felt we had a better security first mentality, especially with the vendor specific frameworks that were shipped alongside the compilers.
    Regarding static analysis tooling it is kind of sad that many developers still think they known better than the language authors themselves, as per Dennis own words,
    > Although the first edition of K&R described most of the rules that brought C's type structure to its present form, many programs written in the older, more relaxed style persisted, and so did compilers that tolerated it. To encourage people to pay more attention to the official language rules, to detect legal but suspicious constructions, and to help find interface mismatches undetectable with simple mechanisms for separate compilation, Steve Johnson adapted his pcc compiler to produce lint [Johnson 79b], which scanned a set of files and remarked on dubious constructions.
    -- http://cm.bell-labs.co/who/dmr/chist.html
    And to come back to the original point, the worst part regardless of the language is that most tricks are done due to cargo cult and hearsay, without reaching to a profiler a single time.
    
    chlorion 4 months ago
    
    What are you even talking about?
    I've never saw anyone claim that doing UB in Rust is okay because its wrapped in an unsafe block?
    If anything I've saw the exact opposite of this, cases of UB in libraries is considered a bug almost always, vs in C or C++ where "its the users fault for doing the thing wrong".
    I notice you spreading an awful lot of bs about rust lately, not sure what the deal is but its pretty childish and lame, not to mention objectively wrong.
    
    uecker 4 months ago
    
    Programmers will generally not put UB into code. They will create unsafe code that risks having UB to get optimal speed. This is also what Rust programmers often use unsafe for from the code I have been looking at.
  - uecker 4 months ago
    
    Some even don't know that there is no such thing as "ill formed no diagnostic required" in C.

boricj 4 months ago

I'm working on a C++ library at work that lets you define a data model (a tree of objects, arrays and values) and run visitors through it. The data model is both effectively a JSON schema in source code form and a binding to the underlying data/application through callbacks and whatnots.

I can make it ingest or emit whatever data format I want by writing a deserializer or serializer visitor for it (either directly or through a library), I can perform various pipeline-like transformations by chaining visitors together. The core is header-only, templated, constexpr and suitable for usage on resource-constrained systems.

I know that there are equivalents in managed or interpreted languages (like Java's Jackson library), but I haven't managed to find anything quite like it elsewhere on the Internet for compiled, unmanaged languages. Maybe I haven't looked hard enough or I just don't know what to search for, which is too bad because it handily beats writing serialization/deserlialization code by hand.

nly 4 months ago

Here's something like that I wrote in C++ in 2017 for JSON using Boost.Fusion + simple function overloading of a single 'from_json' function for handling different types. It works for nested objects, it's runtime type checked, and all numeric conversions are checked for loss while being forgiving.
https://gist.github.com/nlyan/045fbe075b4e51d83be0cf4513fecd...
The DEFINE_JSON macro is a tiny wrapper around BOOST_FUSION_DEFINE_STRUCT so the whole parsing routine effectively gets unrolled at compile time.
https://www.boost.org/doc/libs/1_87_0/libs/fusion/doc/html/f...
The code predates broad availability of std::optional (so uses boost::optional), [[unlikely]] and the existence of Boost.JSON, so if i were using this technique today that's what I'd use, but at the time I used the taocpp JSON library (which is still actively maintained 8 years later)
https://github.com/taocpp/json
Here's an article talking about the technique from a CppCon 2014 talk - "Implementing Wire Protocols with Boost Fusion --Thomas Rodgers"
https://isocpp.org/blog/2015/01/cppcon-2014-implementing-wir...
- boricj 4 months ago
  
  It's fairly different than the approach inside the CppCon 2014 talk because the data model itself is not actually strongly tied to the underlying types of the data.
  That lack of strong integration does result in a bit more boiderplate code, but it allows more flexibility. In principle, the data model and the actual data can be quite different as long as you can bridge the two together with the boilderplate code. It's more like projecting a JSON schema over a bunch of various getters and setters and running visitors through them, with some helpers to automatically grok STL-like containers.
  - nly 4 months ago
    
    Yeah, an extra layer of type indirection for expressing various implicit encodings is useful for e.g. dates
    My point overall is that there is tonnes of prior art for compile time reflection libraries in c++ for this kind of thing. Boost has MPL, Fusion and Hana for starters. All of these have visitor patterns and algorithms out of the box.
    You originally said you hadn't found anything quite like it for unmanaged languages...but people have been doing this for decades at this point.
    In general I'm no longer a fan of the "encode the schema directly in C++" approach if the data i'm interfacing with actually has a schema.
danhau 4 months ago

What slightly annoys me about JSON is that the order of object properties is defined to be irrelevant.
{ “hello“: 123, “world“: “foo“ }
is the same as
{ “world“: “foo“, “hello“: 123 }
If these two were semantically different, writing deserializer would be easier and more efficient, since you can simply expect the next tokens to represent the currently visited class member, or error.
Otherwise you need to construct an object tree and look up its properties by name / hash.
- nly 4 months ago
  
  Apache Avro's C++ deserializer for the JSON serialization used to expect exactly that: keys to be in schema order.
  Trust me you don't want this. Usually the reason you want to use JSON in the first place is you want to support third party data access.
  Even if you constrained the field order you'd still have to deal with absent fields.
- inbx0 4 months ago
  
  Does the JSON spec actually say that those objects should be "equal", or does it just leave that detail to implementations?
  In JavaScript at least, those two are not exactly "the same", in the sense that you can observe the difference if you want to. If you parse those JSON strings and then iterate the keys (e.g. with Object.keys), the ordering will be different.
  - reichstein 4 months ago
    
    The JSON spec only defines the JSON text format. It doesn't say what the text means. There are obvious interpretations, but every program that reads or writes JSON can decide what it does with it.
    On the other hand, the thing that makes JSON actually useful is the interoperability, that JSON written by one program, on one platform, can be read by another preterm on another platform. Those programs have to agree on a protocol, what the JSON text must satisfy and what it means. It's usually not considered valuable to require object properties to be in a specific order, so they don't. But they could.
- jasonthorsness 4 months ago
  
  Notably this is one of the main differences between JSON and the MongoDB BSON format. Though some client libraries just treat it as JSON and don’t preserve the order.
lmm 4 months ago

> I know that there are equivalents in managed or interpreted languages (like Java's Jackson library), but I haven't managed to find anything quite like it elsewhere on the Internet for compiled, unmanaged languages.
Like Rust's Frunk? Or like what Zig or D let you do with "compile time reflection"?
- boricj 4 months ago
  
  It's definitively different than Frunk, the library is not a general-purpose functional toolkit. One could certainly implement it with Zig's compile-time reflection with ease (don't know much about D). Actually, it's superficially similar to refl-cpp's serialization example [1], but with far less templating magic underneath due to the restricted scope.
  [1] https://github.com/veselink1/refl-cpp/blob/master/examples/e...
unrealhoang 4 months ago

Isn’t rust’s serde what you are looking for?
https://serde.rs/data-model.html

teo_zero 4 months ago

Analyzing the json_value struct, I don't get why values, object_keys and length are fields outside the union. I expected something like:

  struct json_value {
    enum json_type type;
    union {
      bool boolean;
      char *string;
      double number;
      struct {
        char **keys;
        struct json_value *values;
        size_t length;
      } object;
    } value;
  };

Of course I would also use anonymous structs and unions to simplify

  json->value.object.length

  json->length

pjmlp 4 months ago

Also known as how to manually do the work C++ compilers do for you.

tempodox 4 months ago

Manually doing the work others have automated still helps you understand.
- pjmlp 4 months ago
  
  There is a difference between understanding and coding a full application like 1990's.

uecker 4 months ago

I haven't looked at this in detail, but I do not see an "abuse" here. This is just regular C code.

flohofwoe 4 months ago

It looks a lot like the kind of 'object oriented C' that was all the rage in the 90s though ;)
- curt15 4 months ago
  
  GObject still works pretty much this way.

anonymousiam 4 months ago

I've used this trick (unions within structs) for decades. You can parse very quickly, but your code will not necessarily be portable. It's a perfect solution if your application will only be run on the hardware you're developing on.

cantrecallmypwd 4 months ago

Temporary, in-memory representation of an AST generally doesn't need to be serialized or passed outside the process in this use case.
atiedebee 4 months ago

Why are unions within structs not portable?
- cantrecallmypwd 4 months ago
  
  Undefined padding and alignment.
  - atiedebee 4 months ago
    
    That doesn't really matter for sum types as used in this blog, does it? As long as you're not serialising the struct or accessing it through pointers of different types it ought to work anywhere.
  - Keyframe 4 months ago
    
    you can do explicit padding and you can force alignment; latter might be compiler specific until C11 at least. You can always check for the struct layout as an assert.. it's doable.

flohofwoe 4 months ago

Would be good to get some words about performance. Looking at the code there are a lot of granular memory allocations, which IME with other JSON libraries which work like this is the number one performance killer and can add up to seconds when parsing JSON files with hundreds of thousands of nodes.

Always going through a 'virtual method table' also can't be great for performance (not so much because of the indirection, but because it acts as an optimization barrier for the compiler).

Peformance doesn't matter much of course when the code is ever only used to load tiny JSON files.

xnacly 4 months ago

Hi, author here, definitely, allocating elements while parsing is slow, especially calling realloc for each new element encountered, a growing array backed by an arena would be my go to if i had to overengineer this one.

variadix 4 months ago

Unless I’m missing something, I don’t really see the point of putting the functions in the struct instance since it doesn’t seem like you need to use any form of polymorphism/function overriding.

If you have normal, non-virtual instance methods you are better off making it a normal function with some class centric naming pattern, e.g. JSON_Parse(JSON* self, const char* json).

If you want to emulate virtual functions from C++ you are better off storing a single pointer to a statically allocated vtable of function pointers (1 instance per class), that way your instance structs are smaller at the cost of an extra indirection per virtual call. You could also go the Rust route and use a large pointer type (pointer to object + pointer to vtable) for polymorphic objects that avoids a longer dependency chain at the cost of larger object pointers.

userbinator 4 months ago

Once we hit the end we allocate a temporary string, copy the chars containing the number from the input string and terminate the string with \0. strtod is used to convert this string to a double.

No need to allocate, just use strtod on the original string.

osmsucks 4 months ago

I'm not really seeing what's noteworthy about this article.

DeathArrow 4 months ago

>Instead of using by itself functions: attach functions to a struct and use these as methods

He could evolve this further: use SOLID and design patterns.

If the goal is to bastardize C experience, why stop at objects?

pwdisswordfishz 4 months ago

> CFLAGS := -std=c23

CFLAGS is for optional flags that don't change semantics or whether (correct) code compiles at all…

MathMonkeyMan 4 months ago

CFLAGS are flags for the C compiler. It could totally change the interpretation of the semantics. C89 != C23.

flykespice 4 months ago

Meta but I really appreciate how minimal static but styleful the website design is, specially the code snippets.

bt1a 4 months ago

looks great on mobile, too. it's exceptional at organizing the content's contrast ratios and spacing, my only nitpick is slightly too much color
xnacly 4 months ago

Thanks, i tried to go for a mimimalist cyberpunk inspired vibe - thus far i like it :)
- oguz-ismail 4 months ago
  
  What's with the lowercase I's though?
  - johnisgood 4 months ago
    
    The first letter in a couple of sentences is also lowercase, so probably typos.
    Somewhere he uses "I" (correctly), somewhere he is not, so it is inconsistent, and most likely just typos.
  - xnacly 4 months ago
    
    I am not a native english speaker so I often forget to uppercase the 'I's

jbirer 4 months ago

This is giving me GObject and Gdk flashbacks. If anyone wants to traumatize themselves, compile GTK2 and work with the aforementioned.

That being said, it's still a very clever way to implement this.

robinsonb5 4 months ago

I'm seeing reasonably clean use of function pointers without the usual tangle of opaque macros. This kind of pattern easily turns into a mess if you get too clever with it, but I quite like what I'm seeing here.
(I suspect autotools will have traumatised anyone long before they manage to get GTK2 to compile on a modern system!)

tempodox 4 months ago

But how do you know whether a JSON string is a pointer to static data (like in the example with the `main` function) or was dynamically allocated? The same goes for arrays and objects.

threeducks 4 months ago

You could tag dynamically allocated objects with some magic bytes. To then check whether a string is dynamically allocated, you can simply compare the first few bytes against the magic bytes. This is a probabilistic method, but if you use enough magic bytes, it becomes basically impossible to go wrong.
See e.g. https://github.com/99991/dynamic/blob/a423a04061ee44bad0720f... for an example (incidentally also a C JSON parser, but with automatic garbage collection).
xnacly 4 months ago

Hi, author here, i use the 'type' field of the json_value struct, strings, object keys, object values and array members are allocated, i use this info for freeing the memory
- UncleEntity 4 months ago
  
  So, umm, abusing the C switch statement to implement polymorphism?
  I kid, I kid...
  I am curious about how much this really saves over just using C++ classes and something like the curiously recurring template pattern while turning off runtime type checking.

McUsr 4 months ago

I look forward to the next blog post implementing the actual parser.

montyanderson 4 months ago

years ago i made a lib that tries to avoid the duplication in e.g. json->is_eof(json)

https://github.com/montyanderson/foop

jeffrallen 4 months ago

Please don't write new code in nonsafe languages.

ingen0s 4 months ago

this is still the best read of the year for me

eska 4 months ago

I don't think the use of structs to implement vtables in C is abuse, but in this case it's rather pointless. This idiom is used to have dynamic dispatch (e.g. a plugin system), an abstraction (e.g. same interface for file I/O and memory I/O in unit tests), or for ABI-stable vtables in libraries and C++ wrappers.

Instead of `json_parse(&parser_state)` your code calls `parser_state.parse(&parser_state)` in main. Instead of `json_eof(parser_state)` your code calls `parser_state->eof(parser_state)` in the implementation. This is more to type, more irregular (switching from . to ->), and has worse performance. This interface is also not flexible enough to implement a parser for other file formats such as XML. So it doesn't make things easier, and you're not actually making use of dynamic dispatch, so I don't see the point in using this technique.

In general I can not recommend this article for learning about C. I am being lenient about the things the author mentions explicitly, although I find even that mostly hand-waving away too much even for a prototype or short article. For example I have to say it is a bit of a head scratcher for me to hear that JSON is difficult to parse, since the ease of parsing is why JSON won to begin with. The linked article about JSON being a minefield is about underdefined semantics, not parser complexity.

The biggest mistake, and this is sadly hand-waved away as "over-engineering", is the large amount of mallocs and frees, and necessity to track stack vs heap allocations. In general this style of C reminds me of what my CS professors taught us before they showed us Java GC and C++ RAII as our saviors. C has its issues, but this is far from professional grade C code, and is what C looks like when written by programmers who come from higher level languages with less control over memory allocation (or STL-using C++ programmers where RAII mixes allocation and initialization). A tiny bump allocator would simplify all the following code enormously. For example the freeing function is very complex and can be avoided completely. Whether memory is managed by the stack or heap doesn't need to be tracked anymore. It becomes possible to write code that never allocates at all. Consider this article for comparison: https://www.rfleury.com/p/untangling-lifetimes-the-arena-all...

The json struct mixes 2 responsibilities: providing data and parsing it. If we assume as in the article that all data is available from the start, then there's a lot of complexity in the parser that could've been avoided, since the parser is written as if it doesn't know where the EOF is and doesn't handle that error case correctly (especially in non-DEBUG build where ASSERT would be disabled).

Returning EOF in-band in cur is bad practice also in C. I would avoid this whole function by providing higher level parsing functions like peek and expect which would provide or check for certain tokens (this is for non-optimized easy to write parsers). The check for null/false/true atoms should've been a code smell that this should be easier to implement somehow.

In modern C null-terminated strings are mostly avoided, and string slices are used instead (begin pointer and size, or begin and end pointers). There are variables in the code called "slice", but they're C-strings. I'd suggest to the author to rewrite the code with slices and compare just how much more readable and efficient it becomes.

The whole json_value struct can also be avoided by using X-macros to define the structs. It's the way how one implements a visitor pattern (with serializer and deserializer functions) without language support for reflection.