Parsing newline separated JSON is cumbersome #124

athre0z · 2020-04-15T22:25:14Z

In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as lz4 -d < big.json | myapp, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.

Unfortunately, this kind of parsing is not at all straight-forward to do with simd-json. The usual no-copy BufRead::lines() workflow is killed by the fact that Lines yields immutable &strs while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. Using BufRead::read_line results in unnecessary copying of the line and manual \n suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).

I feel like it would be great if this lib could also provide a SIMD accelerated lines_mut which would increase this libraries usability immensely.

It is also very much possible that there is an obvious way to make this work which I just failed to see.

The text was updated successfully, but these errors were encountered:

Licenser · 2020-04-15T22:32:11Z

HI!

Frist a bit explenation, the reason why we use &mut [u8] or str is that we do use a form of in situ parsing instead of allocating memory for strings we just re-use the existing buffer. There are a few ways around this but none of them are pleasant thanks to rusts borrow checker.

That said with 0.3 simdjson (upstream) has implemented a form of very fast option for parsing new line separated JSON but we didn't had a chance yet to look at this :)

lemire · 2020-04-17T00:14:45Z

@Licenser If I can be reassuring, the JSON stream parser (that's how we call it) is conceptually simple and involves few lines of code. Porting the idea of it would not be a lot of work. It is also subject to parallelization, which is cool.

Licenser · 2020-04-17T08:37:29Z

Ja, I'm not worried :) the simdjson code is beautiful so it is always a pleasure to port :D just juggling the usual 10000 things to find the time 😂

Licenser · 2022-10-21T12:14:03Z

Not at this is done but #194 is a nicer ticket name for this so I'll combine the two into that,

Licenser added the enhancement New feature or request label Apr 16, 2020

Licenser closed this as completed Oct 21, 2022

Licenser mentioned this issue Oct 21, 2022

Add support for newlin-delimited JSON (NDJSON) #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parsing newline separated JSON is cumbersome #124

Parsing newline separated JSON is cumbersome #124

athre0z commented Apr 15, 2020 •

edited

Loading

Licenser commented Apr 15, 2020

Uh oh!

lemire commented Apr 17, 2020

Uh oh!

Licenser commented Apr 17, 2020

Uh oh!

Licenser commented Oct 21, 2022

Uh oh!

Parsing newline separated JSON is cumbersome #124

Parsing newline separated JSON is cumbersome #124

Comments

athre0z commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Licenser commented Apr 15, 2020

Uh oh!

lemire commented Apr 17, 2020

Uh oh!

Licenser commented Apr 17, 2020

Uh oh!

Licenser commented Oct 21, 2022

Uh oh!

athre0z commented Apr 15, 2020 •

edited

Loading