Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor data processing #419

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

refactor data processing #419

wants to merge 5 commits into from

Conversation

seanses
Copy link
Contributor

@seanses seanses commented Aug 29, 2024

Refactor data processing to

  1. change clean API as a non-async-iterator buffer based API (can drop async next as we drop async in underlying crates), usage:
let pft = PointerFileTranslatorV3::new(config).await;

/* ----------- Clean file 1 (can safely spawn into another thread) ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path1)).await?;
while let Some(data) =  read_file(&mut reader1) {
    cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;

/* ----------- Clean file 2 (can safely spawn into another thread)  ----------- */
let cleaner = pft.start_clean(4096 /*buffer size*/, Some(path2)).await?;
while let Some(data) =  read_file(&mut reader2) {
    cleaner.add_bytes(data).await?;
}
let cleaned_result = cleaner.result().await;

/* ----------- Finish ----------- */
pft.finalize_cleaning().await

For example, see

let cleaner = self.start_clean(4096, Some(path)).await?;

  1. drop XetConfig dependency. Right now there are some helper functions to map XetConfig to new configurations (see

    pub async fn translator_config_from(
    ), these are just for testing the correctness of the new data processing logic using the existing test set up.

  2. make repo salt optional for dedup

All integration tests pass.
Same clean speed as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant