Skip to content

Convert labkey.get_data and quilt.get_data to be single download_file calls to allow for mapped requests #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
evamaxfield opened this issue Feb 5, 2020 · 0 comments
Labels
enhancement New feature or request

Comments

@evamaxfield
Copy link

Currently dask workers only have two cores and four threads by default. Since quilt uses a ThreadPoolExecutor to download files this function running in a dask worker severely limits how quickly files will be downloaded.

If we re-configure the combination of {loader}.get_data and the actually fetching to basically be {loader}.get_data returns a list partial functions to call and then map out those partial functions the dask workers can each take a file to download instead of a single worker being used to download everything.

In pseudo-code with a lof of metadata handling removed:

def quilt.get_file_fetchers(save_dir, protein_list, n_fovs, overwrite):
    package = quilt3.Package.browse("aics/pipeline_integrated_cell", "s3://allencell")
    fetchers = []
    for protein in protein_list:
        for i in range(n_fovs):
            fetchers.append(
                # this is wrong but something like this
                partial(package[protein][i].fetch, save_path=save_dir / package[protein][i].name)
            )

    return fetchers

@task
def run_fetcher(fetcher):
    return fetcher()

@task
def get_save_load_data_functions(
    save_dir=None,
    protein_list=None,
    n_fovs=100,
    overwrite=False,
    dataset="quilt"
) -> List[Callable]:
    if dataset == "quilt":
        return quilt.get_file_fetchers(save_dir, protein_list, n_fovs, overwrite)
    return labkey.get_file_fetchers(save_dir, protein_list, n_fovs, overwrite)

with Flow() as flow:
    data_fetchers = get_save_load_data_functions(**kwargs)
    save_paths = run_fetcher.map(data_fetchers)
@evamaxfield evamaxfield added the enhancement New feature or request label Feb 5, 2020
@evamaxfield evamaxfield self-assigned this Feb 5, 2020
@evamaxfield evamaxfield removed their assignment Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant