A Lightning Fast JSON Parser Library
Written by Nikos Vaggalis   
Thursday, 14 October 2021

simdjson is a C++ library that can parse JSON documents very fast. Version 1. 0 has been just released. How does it compare?

Does parsing 3 gigabytes of JSON per second sound fast enough?
This library achieves it. In last year's benchmark against the fastest standard compliant C++ JSON parsers, RapidJSON and sajson, smidjson by far outperformed them. It can parse 4x faster than RapidJSON and 25x faster than Modern C++.

This efficiency is mainly achieved due to the library under the hood using SIMD instructions, which excel at data level parallelism by fitting operations many times over per instruction, even under a single core.

Parsing fast is not only applicable to huge dumps of JSON, which the related benchmarks are applied upon. The same speed increase is experienced when parsing millions of small JSON documents per second. That aside, it can also minify JSON files by stripping spaces, tabs, newlines, and carriage returns therefore saving great amounts of space.

Version 1. 0 offers a new, "On Demand", frontend in addition to the standard DOM-based one, which is flagged as the default from now on.

What's the difference, you ask?

In the DOM-based approach, the document is parsed entirely and materialized in an in-memory construction. The On Demand approach feels like a DOM approach, but it sidesteps the construction of the DOM tree. It is entirely lazy: it decodes only the parts of the document that you access.

And what is the actual advantage in that?

With On Demand, if you open a file containing 1000 numbers and you need just one of these numbers, only one number is parsed. If you need to put the numbers into your own data structure, they are materialized there directly, without being first written to a temporary tree.

Thus we expect that the simdjson On Demand might often provide superior performance, when you do not need to materialize a DOM tree

Other than that, release 1. 0. 0 adds several other key features:

  • In big data analytics, it is common to serialize large sets of records as multiple JSON documents separated by while spaces. You can now get the benefits of On Demand while parsing almost infinitely long streams of JSON records. At each step, you have access to the current document, but a secondary thread indexes the following block. You can thus access enormous files while using a small amount of memory and achieve record-breaking speeds.

  • In some cases, JSON documents contain numbers embedded within strings (e. g. , "3. 1416"). You can access these numbers directly using methods such as get_double_in_string().

  • Given an On Demand instance (value, array, object, etc. ), you can now convert it to a JSON string using the to_json_string method which returns a string view in the original document for unbeatable speeds.

  • The On Demand front-end now supports the JSON Pointer specification. You can request a specific value using a JSON Pointer within a large document.

  • Arrays in On Demand now have a count_elements() method. Objects have a count_fields() method. Arrays and objects have a reset method for when you need to iterate through them more than once. Document instances now have a rewind method in case you need to process the same document multiple times.

As a simple example this is how you use the library to parse under the DOM based frontend:

You might think that since it is a C++ lib that only devs writing in C++ are benefited. This is not true as there's already bindings for other languages like Go, Ruby, Python and more, while there's even a port for PostgreSQL too in pg_simdjson.



Therefore, no matter what language you are writing code in, you can still leverage simdjson's advantages.


More Information

simdjson: Parsing gigabytes of JSON per second

Related Articles

Google JavaScript Engine Speeds JSON Parsing

Emacs 27.1 Adds Native JSON Parsing

Chrome To Support Simd.js 


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Notepad++ Twentieth Anniversary

An updated version of Notepad++ is available, on its 20th anniversary. The text editor first saw light in November 2003 when it was released on SourceForge.

PeerDB Brings Real Time Streaming To PostgreSQL

PeerDB is an ETL/ELT tool built for PostgreSQL. It makes all tasks that require streaming data from PostgreSQL to third party counterparts as effortless as it gets.

More News




or email your comment to: comments@i-programmer.info

Last Updated ( Saturday, 16 October 2021 )