Skip to content

Parsing newline separated JSON is cumbersome #124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
athre0z opened this issue Apr 15, 2020 · 4 comments
Closed

Parsing newline separated JSON is cumbersome #124

athre0z opened this issue Apr 15, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@athre0z
Copy link

athre0z commented Apr 15, 2020

In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as lz4 -d < big.json | myapp, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.

Unfortunately, this kind of parsing is not at all straight-forward to do with simd-json. The usual no-copy BufRead::lines() workflow is killed by the fact that Lines yields immutable &strs while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. Using BufRead::read_line results in unnecessary copying of the line and manual \n suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).

I feel like it would be great if this lib could also provide a SIMD accelerated lines_mut which would increase this libraries usability immensely.

It is also very much possible that there is an obvious way to make this work which I just failed to see.

@Licenser
Copy link
Member

HI!

Frist a bit explenation, the reason why we use &mut [u8] or str is that we do use a form of in situ parsing instead of allocating memory for strings we just re-use the existing buffer. There are a few ways around this but none of them are pleasant thanks to rusts borrow checker.

That said with 0.3 simdjson (upstream) has implemented a form of very fast option for parsing new line separated JSON but we didn't had a chance yet to look at this :)

@Licenser Licenser added the enhancement New feature or request label Apr 16, 2020
@lemire
Copy link

lemire commented Apr 17, 2020

@Licenser If I can be reassuring, the JSON stream parser (that's how we call it) is conceptually simple and involves few lines of code. Porting the idea of it would not be a lot of work. It is also subject to parallelization, which is cool.

@Licenser
Copy link
Member

Ja, I'm not worried :) the simdjson code is beautiful so it is always a pleasure to port :D just juggling the usual 10000 things to find the time 😂

@Licenser
Copy link
Member

Not at this is done but #194 is a nicer ticket name for this so I'll combine the two into that,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants