fix: atac fragment processing suggestion #1284

Bento007 · 2025-03-04T22:47:33Z

Reason for Change

Mulitome: unsupported organism_ontology_term_id returns an ambiguous KeyError rather than a validation error message
Multiome: negative values in the fifth_column must fail validation
Multiome: using incorrect chromosome table returns an ambiguous pandas exception rather than a validation error message
Multiome: validation error messages are "lost in all the output"
Multiome: anndata validation should occur before fragment file validation
Multiome: include paired/unpaired assay check in process-fragment instead of as a separate cli command
Multiome: incorrect types in fragment columns return ambiguous pandas exception rather than a validation error message

Changes

silencing dask distributed log messages to make output easier to understand.
reformat error message to make them easier to find in the output.
check if anndata is atac before any other checks.
perform anndata checks before fragments checks to speed of error feedback loop.
check that read support is not <= 0
check for mismatch chromosome with organism
return organisms that are not allowed in the error response.
catch pandas error when converting to parquet.

Testing

Either list QA steps or reasoning you feel QA is unnecessary
Reminder For CLI changes: upon merge, contact Lattice for final sign-off. Do not release a new cellxgene-schema
version to PyPI without explicit QA + sign-off from Lattice on all functional CLI changes. They may install the package
version at HEAD of main with

pip install git+https://github.com/chanzuckerberg/single-cell-curation/@main#subdirectory=cellxgene_schema_cli

Notes for Reviewer

- fix dask warning. - run fast anndata tests first.

codecov · 2025-03-04T22:51:05Z

Codecov Report

Attention: Patch coverage is 72.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 89.33%. Comparing base (f63c0bc) to head (9d390e6).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1284      +/-   ##
==========================================
- Coverage   89.70%   89.33%   -0.37%     
==========================================
  Files          20       21       +1     
  Lines        2341     2373      +32     
==========================================
+ Hits         2100     2120      +20     
- Misses        241      253      +12

Components	Coverage Δ
cellxgene_schema_cli	`89.98% <72.00%> (-0.52%)`	⬇️
migration_assistant	`91.26% <ø> (ø)`
schema_bump_dry_run_genes	`79.74% <ø> (ø)`
schema_bump_dry_run_ontologies	`99.53% <ø> (ø)`

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

…tology_term_id

nayib-jose-gloria · 2025-03-05T19:32:17Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+        organism_ontology_term_ids = ad.io.read_elem(f["obs"])["organism_ontology_term_id"].unique().astype(str)
+    if organism_ontology_term_ids.size > 1:
+        error_message = (
+            "Anndata.obs.organism_ontology_term_id must have a unique value. Found the following values:\n"


nit: if curators are fine with it, np, but this error message reads a little strangely to me. How about 'must have exactly 1 unique value."?

nayib-jose-gloria · 2025-03-05T19:35:28Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

@@ -143,7 +143,7 @@ def check_anndata_requires_fragment(anndata_file: str) -> bool:
    """
    onto_parser = OntologyParser()


nit: pin to a schema_version? anndata validation is doing so, we should do this to avoid potential mismatches

could you import the existing instance from validate.py? It would keep the versioned instances in sync.

moved ONTOLOGY_PARSER to it's own files to share across modules

nayib-jose-gloria · 2025-03-05T19:37:39Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

@@ -19,7 +19,7 @@

 from .utils import is_ontological_descendant_of

-logger = logging.getLogger(__name__)
+logger = logging.getLogger("cellxgene-schema")

 # TODO: these chromosome tables should be calculated from the fasta file?


Note: spoke to Trent about this in irl sync. Agreed that this issue should be tracked as a fast-follow for atac-seq validation, as this table should be aligned to the GENCODE version we're using for each pertinent species

definitely agree here! It's just waiting to get out of sync.

nayib-jose-gloria · 2025-03-05T19:48:24Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+
+def validate_anndata(anndata_file: str) -> list[str]:
+    errors = [validate_anndata_organism_ontology_term_id(anndata_file), validate_anndata_is_primary_data(anndata_file)]
+    return report_errors("Errors found in Anndata file", errors)


Maybe add a warning / note that because the anndata failed these basic checks, we could not validate the fragment-based rules (to account for someone seeing these, fixing them, then being surprised when they get new errors)

changed do "Errors found in Anndata file. Skipping fragment validation."

ejmolinelli · 2025-03-05T19:49:25Z

cellxgene_schema_cli/cellxgene_schema/cli.py

+    try:
+        fragment_required = check_anndata_requires_fragment(h5ad_file)
+        if fragment_required:
+            logger.info("Andata requires an ATAC fragment file.")


super nit - Andata typo?

nayib-jose-gloria · 2025-03-05T19:52:06Z

cellxgene_schema_cli/cellxgene_schema/cli.py

+    try:
+        fragment_required = check_anndata_requires_fragment(h5ad_file)
+        if fragment_required:
+            logger.info("Andata requires an ATAC fragment file.")


Suggested change

logger.info("Andata requires an ATAC fragment file.")

logger.info("Anndata requires an ATAC fragment file.")

nayib-jose-gloria · 2025-03-05T19:52:24Z

cellxgene_schema_cli/cellxgene_schema/cli.py

+        if fragment_required:
+            logger.info("Andata requires an ATAC fragment file.")
+        else:
+            logger.info("Andata does not require an ATAC fragment file.")


nayib-jose-gloria · 2025-03-05T19:52:28Z

cellxgene_schema_cli/cellxgene_schema/cli.py

+        else:
+            logger.info("Andata does not require an ATAC fragment file.")
+    except Exception as e:
+        report_errors("Andata does not support ATAC fragment files for the follow reason", [str(e)])


also follow reason --> following reason

brian-mott · 2025-03-05T20:01:08Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+            # convert the fragment to a parquet file for faster processing
+            try:
+                parquet_file = convert_to_parquet(fragment_file, tempdir)
+            except Exception:


Is it possible to propagate the Exception up the to the error message?

It looks like for the int columns in the fragment file, I will get a pandas parse exception with different dtypes.

For the chromosome column (col 1), there's this exception if there's a value that's not part of the expected categories:
pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type

And for the the barcode column, I think any given value is coerced to a string, so then the I get the validation error 'Barcodes don't match anndata.obs.index'

If it's too much to have more specific error messages for why the conversion failed, then maybe a message like : "Error converting fragment to parquet, check that dtypes for fragment file columns are consistent/match the schema"

There errors should appear that way now

I'm still getting just the Error converting fragment to parquet. error message or the this raised exception for the first column: pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type

Ideally, if there's a parsing error at this step, it would be best to know which column or even which row failed so we could go find and correct the issue. With the current messages, I can very roughly know where to start to try and find the issue because I purposefully introduced it, but in a normal curation workflow, I'd be very lost as to what went wrong and how to fix it.

What type of bad values are you thinking you will see?

pyarrow.lib.ArrowInvalid: No non-null segments were available for field 'chromosome'; couldn't infer type

Doesn't this tell you what column had the error?

That's a good question, I'm testing against different random data types that go against what is expected for each column, but it's hard to say what's actually out there in the wild as we get real submissions. And also if we curate and concatenate fragments from various donors/samples into one file, do we accidentally introduce an error?

We can wait and see what we come across, but at the very least, the error message should state something like parsing error, check that columns match schema definition if it's too hard to get more specific than that

Doesn't this tell you what column had the error?

Yes, that at least mentions the right column

fix the error message to be

Error Parsing the fragment file. Check that columns match schema definition. Error:

with the message from the pandas or error appends to the end.

brian-mott · 2025-03-05T20:03:48Z

cellxgene_schema_cli/cellxgene_schema/atac_seq.py

+
+def report_errors(header: str, errors: list[str]) -> list[str]:
+    if any(errors):
+        errors = [f"{i}: {e})" for i, e in enumerate(errors) if e is not None]


Enumerate is a nice touch but I'd say not needed as long as each error is its own string/prints on a new line. Makes it easier to just check against the error instead of an error string with the enumeration plus the error message.

nayib-jose-gloria

left some more notes, but largely looks good. thanks!

Bento007 added 2 commits March 4, 2025 14:45

silence dask log messages

f234a1a

- fix dask warning. - run fast anndata tests first.

suggested changes

27e5593

Throw value error when parsing csv fails

0699cb2

Bento007 requested review from ivirshup and nayib-jose-gloria March 5, 2025 00:11

brian-mott reviewed Mar 5, 2025

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/atac_seq.py Outdated Show resolved Hide resolved

fix error reporting

3353381

nayib-jose-gloria reviewed Mar 5, 2025

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/atac_seq.py Outdated Show resolved Hide resolved

nayib-jose-gloria reviewed Mar 5, 2025

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/atac_seq.py Outdated Show resolved Hide resolved

move unique organism_ontology_term_id to validate_anndata_organism_on…

094498a

…tology_term_id

Bento007 requested review from nayib-jose-gloria and brian-mott March 5, 2025 19:12

nayib-jose-gloria reviewed Mar 5, 2025

View reviewed changes

ejmolinelli reviewed Mar 5, 2025

View reviewed changes

nayib-jose-gloria reviewed Mar 5, 2025

View reviewed changes

brian-mott reviewed Mar 5, 2025

View reviewed changes

nayib-jose-gloria reviewed Mar 5, 2025

View reviewed changes

suggested comments

912e5c6

Bento007 requested review from ejmolinelli, nayib-jose-gloria and brian-mott March 5, 2025 23:43

Bento007 added 2 commits March 6, 2025 10:18

Merge branch 'refs/heads/main' into tsmith/atac-feedback

f0bf9e1

under logging change

ec36b27

ejmolinelli approved these changes Mar 6, 2025

View reviewed changes

print the caught exception in the error message.

9d390e6

Bento007 merged commit 50c98d3 into main Mar 6, 2025
11 of 14 checks passed

Bento007 deleted the tsmith/atac-feedback branch March 6, 2025 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: atac fragment processing suggestion #1284

fix: atac fragment processing suggestion #1284

Bento007 commented Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 4, 2025 •

edited

Loading

nayib-jose-gloria Mar 5, 2025

nayib-jose-gloria Mar 5, 2025 •

edited

Loading

ejmolinelli Mar 5, 2025 •

edited

Loading

Bento007 Mar 5, 2025

nayib-jose-gloria Mar 5, 2025

ejmolinelli Mar 5, 2025

Bento007 Mar 5, 2025

nayib-jose-gloria Mar 5, 2025

Bento007 Mar 5, 2025

ejmolinelli Mar 5, 2025

nayib-jose-gloria Mar 5, 2025

nayib-jose-gloria Mar 5, 2025

nayib-jose-gloria Mar 5, 2025

nayib-jose-gloria Mar 5, 2025

brian-mott Mar 5, 2025

Bento007 Mar 5, 2025

brian-mott Mar 6, 2025

Bento007 Mar 6, 2025

Bento007 Mar 6, 2025

brian-mott Mar 6, 2025

brian-mott Mar 6, 2025 •

edited

Loading

Bento007 Mar 6, 2025

brian-mott Mar 5, 2025

nayib-jose-gloria left a comment

		@@ -143,7 +143,7 @@ def check_anndata_requires_fragment(anndata_file: str) -> bool:
		"""
		onto_parser = OntologyParser()

	logger.info("Andata requires an ATAC fragment file.")
	logger.info("Anndata requires an ATAC fragment file.")

fix: atac fragment processing suggestion #1284

fix: atac fragment processing suggestion #1284

Conversation

Bento007 commented Mar 4, 2025 • edited Loading

Reason for Change

Changes

Testing

Notes for Reviewer

codecov bot commented Mar 4, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

nayib-jose-gloria Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

ejmolinelli Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brian-mott Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nayib-jose-gloria left a comment

Choose a reason for hiding this comment

Bento007 commented Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 4, 2025 •

edited

Loading

nayib-jose-gloria Mar 5, 2025 •

edited

Loading

ejmolinelli Mar 5, 2025 •

edited

Loading

brian-mott Mar 6, 2025 •

edited

Loading