You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as lz4 -d < big.json | myapp, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.
Unfortunately, this kind of parsing is not at all straight-forward to do with simd-json. The usual no-copy BufRead::lines() workflow is killed by the fact that Lines yields immutable &strs while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. Using BufRead::read_line results in unnecessary copying of the line and manual \n suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).
I feel like it would be great if this lib could also provide a SIMD accelerated lines_mut which would increase this libraries usability immensely.
It is also very much possible that there is an obvious way to make this work which I just failed to see.
The text was updated successfully, but these errors were encountered:
Frist a bit explenation, the reason why we use &mut [u8] or str is that we do use a form of in situ parsing instead of allocating memory for strings we just re-use the existing buffer. There are a few ways around this but none of them are pleasant thanks to rusts borrow checker.
That said with 0.3 simdjson (upstream) has implemented a form of very fast option for parsing new line separated JSON but we didn't had a chance yet to look at this :)
@Licenser If I can be reassuring, the JSON stream parser (that's how we call it) is conceptually simple and involves few lines of code. Porting the idea of it would not be a lot of work. It is also subject to parallelization, which is cool.
Uh oh!
There was an error while loading. Please reload this page.
In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as
lz4 -d < big.json | myapp
, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.Unfortunately, this kind of parsing is not at all straight-forward to do with
simd-json
. The usual no-copyBufRead::lines()
workflow is killed by the fact thatLines
yields immutable&str
s while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. UsingBufRead::read_line
results in unnecessary copying of the line and manual\n
suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).I feel like it would be great if this lib could also provide a SIMD accelerated
lines_mut
which would increase this libraries usability immensely.It is also very much possible that there is an obvious way to make this work which I just failed to see.
The text was updated successfully, but these errors were encountered: