Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: This PR causes a breaking change without migrating script, but Pubky app and Nexus shouldn't be affected much. That being said, try on staging and let me know if anything broke please.
What does this do?
Before this PR, we are currently loading the entire blob of data in memory, both while downloading (PUT) or uploading (GET) requests. That was fine while the data was kept very small (16kb) but since we changed that to 100mb limit, it is not a good idea to hold that much data in memory, even worse a client streaming in (or reading) blobs can be very slow, causing issues either in Tokio runtime, or worse blocking LMDB with a write transaction for long time.
In this PR, we change things so that we buffer incoming data to a temporary file first, then once that is done, we start a write transaction, and load the data from the file system to LMDB.
Secondly, this PR breaks down the blob into chunks that are optimized to fit two of which in each OS page size, this way LMDB doesn't waste pages on unaligned data.
We now have new structs to make writing an Entry as well as reading the content of an entry iteratable.
Finally, we return the content hash as an
Etag
among other headers. More work on headers will come in other PRs, including work on content_type sniffing.What does this PR Break
Previously Blobs were just saved as one entry in the blobs table, and it was identified by its hash, and has a reference count, now we give up on this deduplication by hash, and just use the timestamp + chunk index as keys of chunks, that saves space on keys, but means that two people uploading the same content will be stored twice... I think that is ok and we shouldn't worry about deduplication since it rarely is useful, and adds too much complexity. I would rather sacrifice disk space for simplicity for now.
What is this PR NOT?
This PR does NOT add range queries, you can't (yet) request a frame in a video or anything like that, but it certainly makes it easier in the future.