Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get S3 metadata for genomic files #628

Open
3 tasks
gsantia opened this issue Aug 11, 2021 · 2 comments
Open
3 tasks

Get S3 metadata for genomic files #628

gsantia opened this issue Aug 11, 2021 · 2 comments
Assignees
Labels

Comments

@gsantia
Copy link
Contributor

gsantia commented Aug 11, 2021

A common step during the ingest process is to take the file paths to the genomic files (which are stored on S3) in the study and grab their associated metadata. This data is put into a format that the ingest library can recognize by the user, and then used as a data source during the extract stage.

It seems this is generally done with personal scripts analysts have developed. We could make it so the ingest library is able to perform these steps (with the appropriate S3 creds provided) automatically when genomic files are detected. This probably should not be the default behavior, but an option that can be turned on (to maintain backward-compatibility with previous ingest packages).

Rough ideas:

  • Setup ingest library to work with S3 creds (this may already be done, as I know s3 file paths can be provided as input)
  • Recognize genomic file entities before ingestion and trigger the S3 scraping process
  • Supplement ingested genomic file data with S3 metadata and ingest it all together
@fiendish
Copy link
Contributor

We could join all of the GF urls fields together into one long list and then fetch their cloud metadata in parallel at some point after clinical data are extracted. Fetching metadata in parallel for a given list of s3 paths should be added to the d3b-utils adjacent to the bucket scraping function.

@fiendish
Copy link
Contributor

Fetching metadata in parallel for a given list of s3 paths should be added to the d3b-utils adjacent to the bucket scraping function.

This is done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants