Get S3 metadata for genomic files #628

gsantia · 2021-08-11T21:17:40Z

A common step during the ingest process is to take the file paths to the genomic files (which are stored on S3) in the study and grab their associated metadata. This data is put into a format that the ingest library can recognize by the user, and then used as a data source during the extract stage.

It seems this is generally done with personal scripts analysts have developed. We could make it so the ingest library is able to perform these steps (with the appropriate S3 creds provided) automatically when genomic files are detected. This probably should not be the default behavior, but an option that can be turned on (to maintain backward-compatibility with previous ingest packages).

Rough ideas:

Setup ingest library to work with S3 creds (this may already be done, as I know s3 file paths can be provided as input)
Recognize genomic file entities before ingestion and trigger the S3 scraping process
Supplement ingested genomic file data with S3 metadata and ingest it all together

fiendish · 2021-08-14T19:01:55Z

We could join all of the GF urls fields together into one long list and then fetch their cloud metadata in parallel at some point after clinical data are extracted. Fetching metadata in parallel for a given list of s3 paths should be added to the d3b-utils adjacent to the bucket scraping function.

fiendish · 2021-09-23T18:23:01Z

Fetching metadata in parallel for a given list of s3 paths should be added to the d3b-utils adjacent to the bucket scraping function.

This is done now.

gsantia added the Epic label Aug 11, 2021

gsantia assigned gsantia and chris-s-friedman Aug 11, 2021

gsantia mentioned this issue Aug 17, 2021

Fetch metadata in parallel for a given list of S3 paths d3b-center/d3b-utils-python#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get S3 metadata for genomic files #628

Get S3 metadata for genomic files #628

gsantia commented Aug 11, 2021

fiendish commented Aug 14, 2021

fiendish commented Sep 23, 2021

Get S3 metadata for genomic files #628

Get S3 metadata for genomic files #628

Comments

gsantia commented Aug 11, 2021

fiendish commented Aug 14, 2021

fiendish commented Sep 23, 2021