Maintain author-supplied row ordering when appropriate #381

jlchang · 2025-01-28T14:02:49Z

Background

Outreach from a new study owner indicated their author DE results did not present as expected. Upon inspection, we learned that author DE results have been sorted by gene name. This is likely a side effect of presenting SCP-computed DE results using the sort order from Scanpy (which is sorted by Z-score) rather than re-sorting Scanpy results by the significance metric (it is common to find DE results where many of the genes have identical significance metric values [see SCP1671 DE results for CSN1S1_macrophages, LC1/2 or eosinophils). The Z-score information, ie. reflected in the original row ordering of the scanpy output, provides an additional layer of granularity for sorting). By removing default sorting by significance metric in the UI, we uncovered the default sorting (by gene name) of author DE data files.

Implementation details to note

Input row order significance detection

To determine if the author supplied row ordering is "significant" we assume biologically significant row order will have high correlation with the significance metric. Thus, we can use a spearman rho test to compare the relative order of the significance metric values when sorted by value order compared to sorting by input file row order. We chose a corrrelation threshold of 0.95 having tested several author DE files with 10,000 random permutations and only seeing a maximum correlation by random chance of 0.65 (95% percentile correlations were ~0.25).

Input row order persistence

The original author DE code creates files sorted by gene name with an unnamed index column (derived from the individual DE comparison) before the gene column.

For files where the input row order is deemed significant, the row number from the original file is the index in the newly generated file.

Sorting by significance metric where input row order has no value

For files where the input row order is assessed to have NO meaning (in the Demo study data, the genes use the same fixed order for every comparison), presenting the data sorted by gene name is not as usefu as presenting the data sorted by the significance metric. For consistency with the input-row order case, the index column for the significance-value-sorted file will also reflect the row number from the original DE file.

Manual testing

Set up for ingest testing as per usual (from ingest directory, run source ../scripts/setup-mongo-dev.sh)
run the following author DE ingest

python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 ingest_differential_expression --annotation-name General_Celltype --annotation-type group --annotation-scope study --cluster-name cluster_umap_txt --study-accession SCPdev --ingest-differential-expression --differential-expression-file ../tests/data/author_de/pval_lfc_pvaladj_seurat-like.tsv --method wilcoxon --gene-header genes --group-header group --size-metric avg_log2FC --significance-metric p_val_adj --comparison-group-header comparison_group

Confirm the top 10 lines of cluster_umap_txt--General_Celltype--B_cells--study--wilcoxon.tsv look like

	gene	avg_log2FC	p_val_adj	pct.2	pct.1	p_val	cluster
5	CLDN4	0.07032361297	0.007025241667	0.3627032424	0.6372967576	0.007025241667	1
11	CHRDL2	0.05399463925	0.0156040131	0.195191055	0.804808945	0.0156040131	4
0	PRLR	0.06029729764	0.05007578039	0.1001712149	0.8998287851	0.05007578039	1
2	STAT5A	0.0009241678384	0.1084025724	0.1621845379	0.8378154621	0.1084025724	4
35	PDCD1	0.045612242	0.1646225321	0.5064751238	0.4935248762	0.1646225321	10
33	EGFR	0.03486721322	0.1795388672	0.7998459957	0.2001540043	0.1795388672	1
9	LALBA	0.05503699554	0.4597031772	0.4101330993	0.5898669007	0.4597031772	10
1	CD68	0.006366543978	0.5267130168	0.388745595	0.611254405	0.5267130168	4
4	XDH	0.08113321809	0.533315441	0.8355921958	0.1644078042	0.533315441	5

[optional] test to confirm retention of original sort order

obtain SCP2729 DE file all_clusters_3KL_and_WT.csv
Run

python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 ingest_differential_expression --annotation-name identity --annotation-type group --annotation-scope study --cluster-name "3KL vs WT" --study-accession SCPdev --ingest-differential-expression --differential-expression-file all_clusters_3KL_and_WT.csv --method wilcoxon --gene-header gene --group-header cluster --size-metric avg_log2FC --significance-metric p_val_adj --comparison-group-header None

Confirm first 10 rows look like

	gene	avg_log2FC	p_val_adj	pct.2	pct.1	p_val	Unnamed: 0
4480	Vcam1	1.5609976662718	7.06846692990118e-217	0.04	0.674	5.2138872389918e-221	Vcam1.5
4481	Mrc1	1.92420455654343	6.22815061509542e-193	0.144	0.97	4.59404780932022e-197	Mrc1.5
4482	Stab1	1.4459749528187	6.094727027880639e-178	0.115	0.869	4.49563105988097e-182	Stab1.5
4483	Cd163	1.64873974238492	8.59263323687761e-176	0.075	0.733	6.3381524207992995e-180	Cd163.5
4484	F13a1	1.53882950196416	1.96877868179237e-166	0.084	0.754	1.45222297100566e-170	F13a1.6
4485	Csf1r	1.5618781564145	1.90565743646718e-165	0.165	0.949	1.40566307919686e-169	Csf1r.6
4486	Ifi207	1.27328746189283	2.69343733049381e-165	0.124	0.856	1.98675026222159e-169	Ifi207.6
4487	Dab2	1.34508938220505	6.53374651080585e-164	0.119	0.847	4.81946338482396e-168	Dab2.5
4488	Pf4	1.80331562775723	6.96639520732423e-163	0.143	0.907	5.13859645004369e-167	Pf4.6

also commiting linting changes

bistline

Looks good - manual tests pass. Thanks for the useful background on assessing the significance of the original order!

codecov · 2025-01-28T15:15:18Z

Codecov Report

Attention: Patch coverage is 89.74359% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.50%. Comparing base (655d7d1) to head (c6872ca).
Report is 11 commits behind head on development.

Files with missing lines	Patch %	Lines
ingest/author_de.py	89.74%	4 Missing ⚠️

Additional details and impacted files

@@               Coverage Diff               @@
##           development     #381      +/-   ##
===============================================
- Coverage        75.94%   74.50%   -1.45%     
===============================================
  Files               30       30              
  Lines             4469     4491      +22     
===============================================
- Hits              3394     3346      -48     
- Misses            1075     1145      +70

Files with missing lines	Coverage Δ
ingest/author_de.py	`94.83% <89.74%> (+0.45%)`	⬆️

... and 1 file with indirect coverage changes

keeps convention readable in Github does not affect metadata validation

eweitz

Code looks generally code! Thanks for resolving this complex issue.

I note a likely lurking bug. But it seems like a trivial fix and would only impact performance, and not by much in absolute terms, so I don't consider it blocking.

eweitz · 2025-01-28T18:04:57Z

ingest/validation/ontologies/version.txt

+<<<<<<< HEAD
 1737653567 # validation cache key
+=======
+1738072997 # validation cache key
+>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119


Committing the diff seems like it could cause a problem in CI, and perhaps cause clients to refetch ontologies on every upload UI page load.

Suggested change

<<<<<<< HEAD

1737653567 # validation cache key

=======

1738072997 # validation cache key

>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119

<<<<<<< HEAD

1737653567 # validation cache key

=======

1738072997 # validation cache key

>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119

Ah, thanks for catching that. The rest of the ontology files were compressed so they couldn't be merged and I missed that this file needed a merge conflict resolution. Fixed in c6872ca

jlchang added 5 commits January 24, 2025 09:05

added input index as 'order' column

24867b9

also commiting linting changes

retain input sort order if order correlated with sig metric

3729a2e

no need to explicitly sort order column

5beebdc

fix tests - row index changes in results)

3f22453

add assert to test order_not_significant

306463c

jlchang requested review from bistline and eweitz January 28, 2025 14:02

Update minified ontologies via GitHub Actions

3262903

bistline approved these changes Jan 28, 2025

View reviewed changes

jlchang added 2 commits January 28, 2025 09:50

Merge remote-tracking branch 'origin' into jlc_keep_de_sorted

89bd441

accept new minified ontologies

4b1d620

add attribute_description to has_electrophys/morphology

fa4b19a

keeps convention readable in Github does not affect metadata validation

eweitz approved these changes Jan 28, 2025

View reviewed changes

jlchang added the build failure: false positive label Jan 28, 2025

clean up unresolved merge conflict

c6872ca

jlchang merged commit 9185c71 into development Jan 28, 2025
3 of 4 checks passed

jlchang deleted the jlc_keep_de_sorted branch January 28, 2025 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintain author-supplied row ordering when appropriate #381

Maintain author-supplied row ordering when appropriate #381

jlchang commented Jan 28, 2025 •

edited

Loading

bistline left a comment

codecov bot commented Jan 28, 2025 •

edited

Loading

eweitz left a comment

eweitz Jan 28, 2025

jlchang Jan 28, 2025

Maintain author-supplied row ordering when appropriate #381

Maintain author-supplied row ordering when appropriate #381

Conversation

jlchang commented Jan 28, 2025 • edited Loading

Background

Implementation details to note

Input row order significance detection

Input row order persistence

Sorting by significance metric where input row order has no value

Manual testing

[optional] test to confirm retention of original sort order

bistline left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 28, 2025 • edited Loading

Codecov Report

eweitz left a comment

Choose a reason for hiding this comment

eweitz Jan 28, 2025

Choose a reason for hiding this comment

jlchang Jan 28, 2025

Choose a reason for hiding this comment

jlchang commented Jan 28, 2025 •

edited

Loading

codecov bot commented Jan 28, 2025 •

edited

Loading