simd-lite/simd-json

SIMD JSON for Rust  

Rust port of extremely fast serde compatibility

SIMD Json for Rust  

Rust port of extremely fast simdjson JSON parser with serde compatibility.


readme (for real!)

simdjson version

Currently tracking version 0.2.x of simdjson upstream (work in progress, feedback welcome!).

CPU target

To be able to take advantage of simd-json your system needs to be SIMD capable. This means that it needs to compile with native cpu support and the given features. This also requires that projects using simd-json also need to be configured with native cpu support. Look at The cargo config in this repository to get an example of how to configure this in your project.

simd-json supports AVX2, SSE4.2 and NEON.

Unless the allow-non-simd feature is passed to your simd-json dependency in your Cargo.toml simd-json will fail to compile, this is to prevent unexpected slowness in fallback mode that can be hard to understand and hard to debug.

allocator

For best performance we highly suggest using mimalloc or jemalloc instead of the system allocator used by default. Another recent allocator that works well ( but we have yet to test in production a setting ) is snmalloc.

serde

simd-json is compatible with serde and serde-json. The Value types provided implement serializers and deserializers. In addition to that simd-json implements the Deserializer trait for the parser so it can deserialize anything that implements the serde Deserialize trait. Note, that serde provides both a Deserializer and a Deserialize trait.

That said the serde support is contained in the serde_impl feature which is part of the default feature set of simd-json, but it can be disabled.

known-key

The known-key feature changes the hash mechanism for the DOM representation of the underlying JSON object, from ahash to fxhash. The ahash hasher is faster at hashing and provides protection against DOS attacks by forcing multiple keys into a single hashing bucket. The fxhash hasher on the other hand allows for repeatable hashing results, which in turn allows memoizing hashes for well known keys and saving time on lookups. In workloads that are heavy at accessing some well known keys this can be a performance advantage.

The known-key feature is optional and disabled by default and should be explicitly configured.

serializing

simd-json is not capable of serializing JSON data as there would be very little gain in re-implementing it. For serialization, we typically rely on serde-json.

For DOM values we provide convience methods for serialization.

For struct values we defer to external serde-compatible serialization mechanisms.

unsafe

simd-json uses a lot of unsafe code.

There are a few reasons for this:

  • SIMD intrinsics are inherently unsafe. These uses of unsafe are inescapable in a library such as simd-json.
  • We work around some performance bottlenecks imposed by safe rust. These are avoidable, but at a cost to performance. This is a more considered path in simd-json.

simd-json goes through extra scrutiny for unsafe code. These steps are:

  • Unit tests - to test 'the obvious' cases, edge cases, and regression cases
  • Structural constructive property based testing - We generate random valid JSON objects to exercise the full simd-json codebase stochastically. Floats are currently excluded since slighty different parsing algorihtms lead to slighty different results here. In short "is simd-json correct".
  • Data-oriented property based testing of string-like data - to assert that sequences of legal printable characters don't panic or crash the parser (they might and often error so - they are not valid json!)
  • Destructive Property based testing - make sure that no illegal byte sequences crash the parser in any way
  • Fuzzing (using American Fuzzy Lop - afl) - fuzz based on upstream simd pass/fail cases

This doesn't ensure complete safety nor is at a bullet proof guarantee, but it does go a long way to asserting that the library is production quality and fit for purpose for practical industrial applications.

Other interesting things

There are also bindings for upstream simdjson available here

License

simd-json itself is licensed under either of

  • Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
  • MIT license (LICENSE-MIT or #404) at your option.

However it ports a lot of code from simdjson so their work and copyright on that should be respected along side.

The serde integration is based on their example and serde-json so again, their copyright should as well be respected.

Issues

Collection of the latest Issues

Licenser

Licenser

Comment Icon0

This is a follow-up to #213 (details included there) in short:

mockalloc allows checking allocation leaks and other errors.

#213 includes some more details on possible complications around threading and selective tests

whyCPPgofast

whyCPPgofast

perf
Comment Icon6

Hi, I was benchmarking this against a very simple small JSON

Now my use case is: parse a small JSON as fast as possible just ONCE.

the results for me were (1 parse): serde_json = 3 microseconds simd_json = 10 microseconds

I was wondering if its normal for serde_json to be faster in smaller JSON's or am I getting incorrect results?

amanjeev

amanjeev

perf
Comment Icon1

Summary

#188 as an exercise showed that the feature to work with newline-delimited JSON (NDJSON) is not implemented in this crate.

Why

This feature is helpful when you have large number of records but each of those records are small JSON objects per line. This is often the case with large JSON files and looping over them and calling simd-json on each line is not going to help. This is added by @Licenser in this comment:

Ja the lines are fairly short too the advantages are a lot smaller (sometimes detrimental) as there is an initial cost to pay for filling the registers, doing multiple runs etc. can overshadow the performance gain for very small payloads.

@Licenser also adds

NDJSON would be incredibly cool (especially if we manage to realize in a streaming fashion / as an iterator)

What

Upstream simdjson has this feature called parse_many. Porting that to this crate is the first step.

!!!NEEDS MORE DETAILS!!!

Kogia-sima

Kogia-sima

Comment Icon0

Description

I'd like to serialize BTreeMap which has integer as a key. However, when I passed map into to_string or to_string_pretty, simd_json outpus invalid JSON data.

output:

Expected behavior

I want the JSON output like this:

Possible solution

serde_json crate defines a new struct named MapKeySerializer to properly render the map keys.

Environment

Software Version
OS Lubuntu 20.04.2 LTS
rustc rustc stable 1.48.0 (7eac88abb 2020-11-16)
simd-json 0.3.23
CJP10

CJP10

Comment Icon0

The current structure is all over the place compared to the upstream repo. We should move to align ourselves with their structure. Imo the split of higher level impls from lower level impls (see /include and /src). I'm not too keen on the naming but the structure is something I believe will help the projects structure.

This should help us in the process of tracking upstream changes and actually porting the changes.

zeylahellyer

zeylahellyer

Comment Icon0

I have a weird boolean field in an API. If the field is not present in a payload, then its real value ("meaningful value") is false, while if the field is present then its value is always null, meaning its real value is true.

To deal with this I've written a deserializer and visitor. But, a null value is treated differently in simd-json than in serde_json.

Take a sample payload like this:

And take this visitor for a Deserializer, which will deserialize this as true:

Tests result in:

Fixing simd-json can be done by changing visit_none to visit_unit:

But then that breaks serde_json. So to support both, I need both implementations, both of which are incompatible with the other. Should simd-json go through visit_none instead of/in addition to visit_unit?

Full test suite:

rxdn

rxdn

Comment Icon1

serde_json provides a RawValue type (with the raw_value feature flag) that just stores the raw json bytes so you can pass around a json string without having to parse it.

When simply serializing a RawValue using simd_json, it outputs an unexpected result:

Code to reproduce:

And when deserializing to a RawValue, simd-json produces an error:

Code to reproduce:

It'd be nice if simd-json support RawValue

athre0z

athre0z

enhancement
Comment Icon3

In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as lz4 -d < big.json | myapp, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.

Unfortunately, this kind of parsing is not at all straight-forward to do with simd-json. The usual no-copy BufRead::lines() workflow is killed by the fact that Lines yields immutable &strs while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. Using BufRead::read_line results in unnecessary copying of the line and manual \n suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).

I feel like it would be great if this lib could also provide a SIMD accelerated lines_mut which would increase this libraries usability immensely.

It is also very much possible that there is an obvious way to make this work which I just failed to see.

Licenser

Licenser

enhancement
Comment Icon3

It might be worth looking into making more performant versions of the serialize and deserialize macro. Serde's implementation factors in different formats and by that comes with different tradeoffs if we focus on JSON it is likely that we can be significantly faster matching at least DOM object serialization speeds and improving deserialization speed further.

Licenser

Licenser

easy
Comment Icon0

I ran the following little script rg '_mm' src/sse42 | sed -e 's/^.*://' -e 's/^[ ]*//' -e $'s/_mm/\\\n_mm/g' | grep '_mm' | sed -e 's/(.*//' | sort -u to see what SIMD functions are used in the sse4.2 target and I found _mm_testz_si128 as only function used in SSE4.1 everything else is part of the 3.x branch. We could possibly relax th test to SSE4.1 or even add a ssse3 target that falls back on the one function.

@sunnygleason since you wrote the SSE42 implementation what are your thoughts?

sunnygleason

sunnygleason

enhancement
Comment Icon9

One idea we should consider soonish is compiling everything for the target architecture regardless of CPU features and doing runtime detection & dispatch (as implemented in upstream, hopefully easier in Rust?).

There might be other goodies in upstream as well...

tkaitchuck

tkaitchuck

enhancement
Comment Icon2

Right now the crate provides a great way to parse json. But a lot of the simd tricks are much more general. For example parsing a float or an integer from a string, locating unescaped quotes, etc. may be valuable generally.

If these could be made into sub-crates that are individually pushed to crates.io they could be more easily reused in other contexts.

If they can be made nostd perhaps simd parsing of ints/floats could even be added to the standard library instead of the .parse() method on string.

poonai

poonai

enhancement
Comment Icon15

Do simdjson have flattened JSON access? (similar to https://github.com/pikkr/pikkr)

Will, there be any performance improvement if I use flattend json access?


Added by @Licenser as an issue description

The Tape struct should be querieable via a simplified version of JSONpath (section 3.2 in the paper linked below).

To achieve this we need at minumum:

  • a parser that takes a query string and turns it into a digestible format
  • a function that takes said format and applies it to a Tape
  • support for .<field> to query a object field
  • support for [<index>] to query array indexes
  • support for nesting those two
  • sufficient tests to cover the code (sufficient here is defined as 'does not drop crate coverage' or better)

Additional JSONpath operators are welcome but optional.

Licenser

Licenser

perf
Comment Icon17

It would be nice to be able to use the tape with the functions that are provided or the Value-trait, for read only situations this might significantly faster. The current problem is around ownership (hello rust ...)

Information - Updated Feb 14, 2022

Stars: 619
Forks: 35
Issues: 34

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Rust Greatest JSON weapon is Serde with over 4.4K stars on github and a massive developer community. This is considered a core Rust library for every developer to learn in BRC's opinion

Serde is a framework for serializing and deserializing Rust data structures efficiently and generically

Rust 버전 JsonPath 구현으로 Webassembly와 Javascript에서도 유사한 API 인터페이스를 제공 한다

JsonPath 구현으로 Webassembly와 Javascript에서도 유사한 API 인터페이스를 제공 한다

Rust 버전 JsonPath 구현으로 Webassembly와 Javascript에서도 유사한 API 인터페이스를 제공 한다

JSON-E Rust data-struct paramter crate for lightweight embedded content with objects and much more

What makes JSON-e unique is that it extensive documentation and ease of use

JSON-E Rust data-struct paramter crate for lightweight embedded content with objects and much more
JSON

111

A Rust JSON5 serializer and deserializer which speaks Serde

Deserialize a JSON5 string with from_str

A Rust JSON5 serializer and deserializer which speaks Serde

Rust JSON Parser Benchmark

Download and Generate JSON Data

Rust JSON Parser Benchmark

Read JSON values quickly - Rust JSON Parser

AJSON get json value with specified path, such as project

Read JSON values quickly - Rust JSON Parser

Rust actix json request example

Send a json request to actix, and parse it

Rust actix json request example
JSON

140

json_typegen - Rust types from JSON samples

json_typegen is a collection of tools for generating types from

json_typegen - Rust types from JSON samples

Rust JSON parsing benchmarks

This project aims to provide benchmarks to show how various JSON-parsing libraries in the Rust programming language perform at various JSON-parsing tasks

Rust JSON parsing benchmarks

A tiny command line tool written in rust to print json data as a formatted...

A tiny command line tool written in rust to print json data as a formatted table

A tiny command line tool written in rust to print json data as a formatted...

Rust RPC client for Bitcoin Core JSON-RPC

rust-jsonrpc and makes it easier to talk to the Bitcoin JSON-RPC interface

Rust RPC client for Bitcoin Core JSON-RPC
Facebook Instagram Twitter GitHub Dribbble
Privacy