Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit clean task memory usage #370

Closed
wants to merge 3 commits into from
Closed

Limit clean task memory usage #370

wants to merge 3 commits into from

Conversation

seanses
Copy link
Contributor

@seanses seanses commented May 16, 2024

Fix CLI-190

Given the fact that

  1. Git and PyXet send data in different block size to xet-core to clean,
  2. PyXet can clean multiple files in parallel,
  3. Different files have different clean speed,

it's difficult to tune memory usage for each clean task. Instead, clean tasks can share a common memory limit (8 GB) and grab resource as fast as they can for efficiency.

Tested locally uploading two files each of 40 GB and maintain memory usage < 8 GB:

(.env) di@di-mbp ~/xetdata/pyxet/python/pyxet % xet cp ~/tt/zz* xet://localhost:3000:seanses-local/empty/main
copy /Users/di/tt/zz0, /Users/di/tt/zzx... to xet://localhost:3000:seanses-local/empty/main: (2 / 2), 80 GiB | 1.19 GiB/s, done.

@seanses seanses changed the title global memory limit Limit clean task memory usage May 16, 2024
@seanses seanses marked this pull request as draft May 16, 2024 17:17
@seanses seanses force-pushed the di/cap-memory-usage branch from 3272a41 to 9d827c6 Compare May 16, 2024 17:28
@seanses seanses force-pushed the di/cap-memory-usage branch from 9d827c6 to b53353d Compare May 20, 2024 19:22
@seanses seanses marked this pull request as ready for review May 20, 2024 19:37
@seanses seanses requested review from hoytak and ylow May 20, 2024 19:37
@hoytak
Copy link

hoytak commented May 29, 2024

So... This seems really complicated, and I'm worried there will be side effects if it's not used correctly... Since we're using fixed sized buffers, I think we could get around this by simply enforcing a maximum buffer size globally, then calculating the size of the queue in order to respect the larger size limits wrt max buffer size.

@seanses
Copy link
Contributor Author

seanses commented May 29, 2024

So... This seems really complicated, and I'm worried there will be side effects if it's not used correctly...

The underlying logic is actually very simple. This treats multiple buffers as an entire bounded buffer, only considers entries to and exits from the entirety, and ignores internal data moves because they don't change the memory usage.

Since we're using fixed sized buffers, I think we could get around this by simply enforcing a maximum buffer size globally, then calculating the size of the queue in order to respect the larger size limits wrt max buffer size.

I'm not following this suggestion. Is this calculation done once which configures each buffer size below clean tasks start, or done dynamically during cleaning?

@seanses
Copy link
Contributor Author

seanses commented Jun 24, 2024

After offline discussion, close in favor of #399 for unforeseen risk.

@seanses seanses closed this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants