Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: atac fragment processing suggestion #1284

Merged
merged 9 commits into from
Mar 6, 2025
Merged

Conversation

Bento007
Copy link
Contributor

@Bento007 Bento007 commented Mar 4, 2025

Reason for Change

Changes

  • silencing dask distributed log messages to make output easier to understand.
  • reformat error message to make them easier to find in the output.
  • check if anndata is atac before any other checks.
  • perform anndata checks before fragments checks to speed of error feedback loop.
  • check that read support is not <= 0
  • check for mismatch chromosome with organism
  • return organisms that are not allowed in the error response.
  • catch pandas error when converting to parquet.

Testing

  • Either list QA steps or reasoning you feel QA is unnecessary
  • Reminder For CLI changes: upon merge, contact Lattice for final sign-off. Do not release a new cellxgene-schema
    version to PyPI without explicit QA + sign-off from Lattice on all functional CLI changes. They may install the package
    version at HEAD of main with
pip install git+https://github.com/chanzuckerberg/single-cell-curation/@main#subdirectory=cellxgene_schema_cli

Notes for Reviewer

Bento007 added 2 commits March 4, 2025 14:45
- fix dask warning.
- run fast anndata tests first.
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

Attention: Patch coverage is 72.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 89.33%. Comparing base (f63c0bc) to head (9d390e6).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1284      +/-   ##
==========================================
- Coverage   89.70%   89.33%   -0.37%     
==========================================
  Files          20       21       +1     
  Lines        2341     2373      +32     
==========================================
+ Hits         2100     2120      +20     
- Misses        241      253      +12     
Components Coverage Δ
cellxgene_schema_cli 89.98% <72.00%> (-0.52%) ⬇️
migration_assistant 91.26% <ø> (ø)
schema_bump_dry_run_genes 79.74% <ø> (ø)
schema_bump_dry_run_ontologies 99.53% <ø> (ø)
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

organism_ontology_term_ids = ad.io.read_elem(f["obs"])["organism_ontology_term_id"].unique().astype(str)
if organism_ontology_term_ids.size > 1:
error_message = (
"Anndata.obs.organism_ontology_term_id must have a unique value. Found the following values:\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if curators are fine with it, np, but this error message reads a little strangely to me. How about 'must have exactly 1 unique value."?

@@ -143,7 +143,7 @@ def check_anndata_requires_fragment(anndata_file: str) -> bool:
"""
onto_parser = OntologyParser()
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pin to a schema_version? anndata validation is doing so, we should do this to avoid potential mismatches

Copy link
Contributor

@ejmolinelli ejmolinelli Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you import the existing instance from validate.py? It would keep the versioned instances in sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved ONTOLOGY_PARSER to it's own files to share across modules

@@ -19,7 +19,7 @@

from .utils import is_ontological_descendant_of

logger = logging.getLogger(__name__)
logger = logging.getLogger("cellxgene-schema")

# TODO: these chromosome tables should be calculated from the fasta file?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: spoke to Trent about this in irl sync. Agreed that this issue should be tracked as a fast-follow for atac-seq validation, as this table should be aligned to the GENCODE version we're using for each pertinent species

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely agree here! It's just waiting to get out of sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def validate_anndata(anndata_file: str) -> list[str]:
errors = [validate_anndata_organism_ontology_term_id(anndata_file), validate_anndata_is_primary_data(anndata_file)]
return report_errors("Errors found in Anndata file", errors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a warning / note that because the anndata failed these basic checks, we could not validate the fragment-based rules (to account for someone seeing these, fixing them, then being surprised when they get new errors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed do "Errors found in Anndata file. Skipping fragment validation."

try:
fragment_required = check_anndata_requires_fragment(h5ad_file)
if fragment_required:
logger.info("Andata requires an ATAC fragment file.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit - Andata typo?

try:
fragment_required = check_anndata_requires_fragment(h5ad_file)
if fragment_required:
logger.info("Andata requires an ATAC fragment file.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("Andata requires an ATAC fragment file.")
logger.info("Anndata requires an ATAC fragment file.")

if fragment_required:
logger.info("Andata requires an ATAC fragment file.")
else:
logger.info("Andata does not require an ATAC fragment file.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same typo

else:
logger.info("Andata does not require an ATAC fragment file.")
except Exception as e:
report_errors("Andata does not support ATAC fragment files for the follow reason", [str(e)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same typo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also follow reason --> following reason

# convert the fragment to a parquet file for faster processing
try:
parquet_file = convert_to_parquet(fragment_file, tempdir)
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to propagate the Exception up the to the error message?

It looks like for the int columns in the fragment file, I will get a pandas parse exception with different dtypes.

For the chromosome column (col 1), there's this exception if there's a value that's not part of the expected categories:
pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type

And for the the barcode column, I think any given value is coerced to a string, so then the I get the validation error 'Barcodes don't match anndata.obs.index'

If it's too much to have more specific error messages for why the conversion failed, then maybe a message like : "Error converting fragment to parquet, check that dtypes for fragment file columns are consistent/match the schema"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There errors should appear that way now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting just the Error converting fragment to parquet. error message or the this raised exception for the first column: pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type

Ideally, if there's a parsing error at this step, it would be best to know which column or even which row failed so we could go find and correct the issue. With the current messages, I can very roughly know where to start to try and find the issue because I purposefully introduced it, but in a normal curation workflow, I'd be very lost as to what went wrong and how to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of bad values are you thinking you will see?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type

Doesn't this tell you what column had the error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, I'm testing against different random data types that go against what is expected for each column, but it's hard to say what's actually out there in the wild as we get real submissions. And also if we curate and concatenate fragments from various donors/samples into one file, do we accidentally introduce an error?

We can wait and see what we come across, but at the very least, the error message should state something like parsing error, check that columns match schema definition if it's too hard to get more specific than that

Copy link
Collaborator

@brian-mott brian-mott Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this tell you what column had the error?

Yes, that at least mentions the right column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the error message to be

Error Parsing the fragment file. Check that columns match schema definition. Error:

with the message from the pandas or error appends to the end.


def report_errors(header: str, errors: list[str]) -> list[str]:
if any(errors):
errors = [f"{i}: {e})" for i, e in enumerate(errors) if e is not None]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerate is a nice touch but I'd say not needed as long as each error is its own string/prints on a new line. Makes it easier to just check against the error instead of an error string with the enumeration plus the error message.

Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some more notes, but largely looks good. thanks!

@Bento007 Bento007 merged commit 50c98d3 into main Mar 6, 2025
11 of 14 checks passed
@Bento007 Bento007 deleted the tsmith/atac-feedback branch March 6, 2025 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants