Fail AnnData ingest if expected raw data is missing (SCP-5956) #388

jlchang · 2025-03-20T20:49:59Z

Background

When AnnData is accepted for ingest, the study owner currently indicates if raw count data can be found in the .raw slot of the AnnData object. AnnData tutorials now suggest, instead of .raw, to store raw count data in adata.layers['counts'] (note that the researcher can choose any string as a key value, including "raw"). In all cases, we should reject the AnnData object if no data is found at the indicated raw_location. This update to raw count extraction allows checks in both the adata.raw slot and adata.layers[].

Note that this represents a breaking change in raw count extraction because an --extract ['raw_counts'] job will fail if a --raw_location parameter is not also provided.

Manual testing

Set up for ingest testing as per usual (from ingest directory, run source ../scripts/setup-mongo-dev.sh)
Run the following ingest pipeline job that successfully runs a .raw slot raw counts extraction:
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['raw_counts']" --raw-location ".raw"
Confirm job output result has info indicating successful extraction:

Extracted 1 DataArray for 5dd5ae25421aa910a723a337:h5ad_frag.matrix.raw.mtx.gz Cells

and indicates job success

action: ingest_anndata
status: success
functionName: extract_from_anndata

Run the following ingest pipeline job that successfully runs an adata.layers['counts'] raw counts extraction:
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/compliant_liver_layers_counts.h5ad --extract "['raw_counts']" --raw-location "counts"
And this invalid job that fails when .raw is specified but the raw count data exists elsewhere:
python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/compliant_liver_layers_counts.h5ad --extract "['raw_counts']" --raw-location ".raw"
Confirm job output result has info indicating missing data:

No data found in .raw slot

and indicates job failure

action: ingest_anndata
status: failure
functionName: extract_from_anndata

eweitz

Code looks good! Thanks for the quick explainer at Friday standup.

bistline · 2025-03-26T20:01:37Z

ingest/ingest_pipeline.py

-# Differential expression analysis (h5ad matrix, raw count in adata.layers['counts'])
-python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression  --raw-location 'counts' --annotation-name cell_type__ontology_label --de-type rest  --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/compliant_liver_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/compliant_liver_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/compliant_liver_layers_counts.h5ad  --matrix-file-type h5ad --study-accession SCPdev --differential-expression
+# Differential expression analysis (h5ad matrix)
+python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name louvain --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --matrix-file-type h5ad --annotation-file ../tests/data/anndata/h5ad_frag.metadata.tsv --cluster-file ../tests/data/anndata/h5ad_frag.cluster.X_umap.tsv --cluster-name umap --study-accession SCPdev --differential-expression


As discussed, we still want --raw-location in the example commands as it is required now.

Thanks for the catch! Fixed in 15e2ec8.

jlchang added 3 commits March 20, 2025 12:30

add raw_location check

d9064d8

add tests for raw_location validation

f2780e9

fix passing of invalid raw_location

fd9391e

jlchang requested review from bistline and eweitz March 20, 2025 20:50

eweitz approved these changes Mar 24, 2025

View reviewed changes

bistline approved these changes Mar 26, 2025

View reviewed changes

restore DE examples for raw_location

15e2ec8

jlchang merged commit 9c81680 into development Mar 27, 2025
4 checks passed

bistline deleted the jlc_fail_missing_raw branch April 15, 2025 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail AnnData ingest if expected raw data is missing (SCP-5956) #388

Fail AnnData ingest if expected raw data is missing (SCP-5956) #388

Uh oh!

jlchang commented Mar 20, 2025

Uh oh!

eweitz left a comment

Uh oh!

bistline Mar 26, 2025

Uh oh!

jlchang Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

Fail AnnData ingest if expected raw data is missing (SCP-5956) #388

Fail AnnData ingest if expected raw data is missing (SCP-5956) #388

Uh oh!

Conversation

jlchang commented Mar 20, 2025

Background

Manual testing

Uh oh!

eweitz left a comment

Choose a reason for hiding this comment

Uh oh!

bistline Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

jlchang Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!