You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A common step during the ingest process is to take the file paths to the genomic files (which are stored on S3) in the study and grab their associated metadata. This data is put into a format that the ingest library can recognize by the user, and then used as a data source during the extract stage.
It seems this is generally done with personal scripts analysts have developed. We could make it so the ingest library is able to perform these steps (with the appropriate S3 creds provided) automatically when genomic files are detected. This probably should not be the default behavior, but an option that can be turned on (to maintain backward-compatibility with previous ingest packages).
Rough ideas:
Setup ingest library to work with S3 creds (this may already be done, as I know s3 file paths can be provided as input)
Recognize genomic file entities before ingestion and trigger the S3 scraping process
Supplement ingested genomic file data with S3 metadata and ingest it all together
The text was updated successfully, but these errors were encountered:
We could join all of the GF urls fields together into one long list and then fetch their cloud metadata in parallel at some point after clinical data are extracted. Fetching metadata in parallel for a given list of s3 paths should be added to the d3b-utils adjacent to the bucket scraping function.
A common step during the ingest process is to take the file paths to the genomic files (which are stored on S3) in the study and grab their associated metadata. This data is put into a format that the ingest library can recognize by the user, and then used as a data source during the extract stage.
It seems this is generally done with personal scripts analysts have developed. We could make it so the ingest library is able to perform these steps (with the appropriate S3 creds provided) automatically when genomic files are detected. This probably should not be the default behavior, but an option that can be turned on (to maintain backward-compatibility with previous ingest packages).
Rough ideas:
The text was updated successfully, but these errors were encountered: