Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/non blocking writes #44

Merged
merged 6 commits into from
Oct 16, 2024
Merged

Feat/non blocking writes #44

merged 6 commits into from
Oct 16, 2024

Conversation

Nuhvi
Copy link
Collaborator

@Nuhvi Nuhvi commented Oct 16, 2024

Note: This PR causes a breaking change without migrating script, but Pubky app and Nexus shouldn't be affected much. That being said, try on staging and let me know if anything broke please.

What does this do?

Before this PR, we are currently loading the entire blob of data in memory, both while downloading (PUT) or uploading (GET) requests. That was fine while the data was kept very small (16kb) but since we changed that to 100mb limit, it is not a good idea to hold that much data in memory, even worse a client streaming in (or reading) blobs can be very slow, causing issues either in Tokio runtime, or worse blocking LMDB with a write transaction for long time.

In this PR, we change things so that we buffer incoming data to a temporary file first, then once that is done, we start a write transaction, and load the data from the file system to LMDB.

Secondly, this PR breaks down the blob into chunks that are optimized to fit two of which in each OS page size, this way LMDB doesn't waste pages on unaligned data.

We now have new structs to make writing an Entry as well as reading the content of an entry iteratable.

Finally, we return the content hash as an Etag among other headers. More work on headers will come in other PRs, including work on content_type sniffing.

What does this PR Break

Previously Blobs were just saved as one entry in the blobs table, and it was identified by its hash, and has a reference count, now we give up on this deduplication by hash, and just use the timestamp + chunk index as keys of chunks, that saves space on keys, but means that two people uploading the same content will be stored twice... I think that is ok and we shouldn't worry about deduplication since it rarely is useful, and adds too much complexity. I would rather sacrifice disk space for simplicity for now.

What is this PR NOT?

This PR does NOT add range queries, you can't (yet) request a frame in a video or anything like that, but it certainly makes it easier in the future.

@Nuhvi
Copy link
Collaborator Author

Nuhvi commented Oct 16, 2024

Seems to work ... send it

@Nuhvi Nuhvi merged commit eb5beea into main Oct 16, 2024
1 check passed
@Nuhvi Nuhvi deleted the feat/non-blocking-writes branch November 15, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant