Fail AnnData ingest if expected raw data is missing (SCP-5956) #388
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
When AnnData is accepted for ingest, the study owner currently indicates if raw count data can be found in the .raw slot of the AnnData object. AnnData tutorials now suggest, instead of .raw, to store raw count data in adata.layers['counts'] (note that the researcher can choose any string as a key value, including "raw"). In all cases, we should reject the AnnData object if no data is found at the indicated raw_location. This update to raw count extraction allows checks in both the adata.raw slot and adata.layers[].
Note that this represents a breaking change in raw count extraction because an --extract ['raw_counts'] job will fail if a --raw_location parameter is not also provided.
Manual testing
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['raw_counts']" --raw-location ".raw"
and indicates job success
Run the following ingest pipeline job that successfully runs an adata.layers['counts'] raw counts extraction:
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/compliant_liver_layers_counts.h5ad --extract "['raw_counts']" --raw-location "counts"
And this invalid job that fails when .raw is specified but the raw count data exists elsewhere:
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/compliant_liver_layers_counts.h5ad --extract "['raw_counts']" --raw-location ".raw"
Confirm job output result has info indicating missing data:
and indicates job failure