Skip to content

Fail AnnData ingest if expected raw data is missing (SCP-5956) #388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 27, 2025

Conversation

jlchang
Copy link
Contributor

@jlchang jlchang commented Mar 20, 2025

Background

When AnnData is accepted for ingest, the study owner currently indicates if raw count data can be found in the .raw slot of the AnnData object. AnnData tutorials now suggest, instead of .raw, to store raw count data in adata.layers['counts'] (note that the researcher can choose any string as a key value, including "raw"). In all cases, we should reject the AnnData object if no data is found at the indicated raw_location. This update to raw count extraction allows checks in both the adata.raw slot and adata.layers[].

Note that this represents a breaking change in raw count extraction because an --extract ['raw_counts'] job will fail if a --raw_location parameter is not also provided.

Manual testing

  1. Set up for ingest testing as per usual (from ingest directory, run source ../scripts/setup-mongo-dev.sh)
  2. Run the following ingest pipeline job that successfully runs a .raw slot raw counts extraction:
    python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --extract "['raw_counts']" --raw-location ".raw"
  3. Confirm job output result has info indicating successful extraction:

Extracted 1 DataArray for 5dd5ae25421aa910a723a337:h5ad_frag.matrix.raw.mtx.gz Cells

and indicates job success

action: ingest_anndata
status: success
functionName: extract_from_anndata

  1. Run the following ingest pipeline job that successfully runs an adata.layers['counts'] raw counts extraction:
    python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/compliant_liver_layers_counts.h5ad --extract "['raw_counts']" --raw-location "counts"

  2. And this invalid job that fails when .raw is specified but the raw count data exists elsewhere:
    python ingest_pipeline.py --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/compliant_liver_layers_counts.h5ad --extract "['raw_counts']" --raw-location ".raw"

  3. Confirm job output result has info indicating missing data:

No data found in .raw slot

and indicates job failure

action: ingest_anndata
status: failure
functionName: extract_from_anndata

@jlchang jlchang requested review from bistline and eweitz March 20, 2025 20:50
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Thanks for the quick explainer at Friday standup.

# Differential expression analysis (h5ad matrix, raw count in adata.layers['counts'])
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --raw-location 'counts' --annotation-name cell_type__ontology_label --de-type rest --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/compliant_liver_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/compliant_liver_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/compliant_liver_layers_counts.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression
# Differential expression analysis (h5ad matrix)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name louvain --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --matrix-file-type h5ad --annotation-file ../tests/data/anndata/h5ad_frag.metadata.tsv --cluster-file ../tests/data/anndata/h5ad_frag.cluster.X_umap.tsv --cluster-name umap --study-accession SCPdev --differential-expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we still want --raw-location in the example commands as it is required now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch! Fixed in 15e2ec8.

@jlchang jlchang merged commit 9c81680 into development Mar 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants