Description
Overview
I am trying to parse a ~2.5GB JSON data file containing a list of lists of data (think Array of Array of Structs). Using the recommended approach of model_validate_json(f.read())
results in the OS SIGKILL-ing the process due to it running out of memory. In comparison, Python's json
module parses it effortlessly.
For a bit of detail, I profiled the code using the below snippets using memray
and am attaching the HTML flame graph files as TXT for ease of use (and because Github doesn't allow HTML files as attachments but allows PPTX..).
I wasn't able to dig deeper into the issue (due to lack of time) but it is possible that the issue is related to #843, but I could be very wrong (hence the new issue).
Vanilla json
import json
with open("dataset.json") as f:
data = json.load(f)
memray-flamegraph-test-json.py.107113.html.txt
This approach just uses about 8.8G of memory: ~6 for parsing and the rest for the string data buffer.
Pydantic recommended API
from titanium_data import Data
with open("dataset.json") as f:
data = Data.model_validate_json(f)
memray-flamegraph-test-pydantic.py.131233.html.txt
This gets SIGKILLed by the OS after consuming ~23G to parse the 2.5GB file.
Pydantic second approach
This uses the "non-recommended" approach from pydantic/pydantic#7323
import json
from titanium_data import Data
with open("./data/trajectories/scenario1/dataset.json") as f:
data = json.load(f)
data = Data.model_validate(data)
memray-flamegraph-test-pydantic2.py.130581.html.txt
Interestingly enough, this method successfully parses the dataset, and much faster than the direct approach of using model_validate_json
.
System Information
uname -srvmo
Linux 5.15.0-84-generic #93~20.04.1-Ubuntu SMP Wed Sep 6 16:15:40 UTC 2023 x86_64 GNU/Linux
Pydantic versions:
pydantic==2.3.0
-e git+https://github.com/pydantic/pydantic-core@c086caec1a200417f19850244282c06b5d4d1650#egg=pydantic_core
- Equivalent to
==2.6.3
- Equivalent to