Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain author-supplied row ordering when appropriate #381

Merged
merged 10 commits into from
Jan 28, 2025

Conversation

jlchang
Copy link
Collaborator

@jlchang jlchang commented Jan 28, 2025

Background

Outreach from a new study owner indicated their author DE results did not present as expected. Upon inspection, we learned that author DE results have been sorted by gene name. This is likely a side effect of presenting SCP-computed DE results using the sort order from Scanpy (which is sorted by Z-score) rather than re-sorting Scanpy results by the significance metric (it is common to find DE results where many of the genes have identical significance metric values [see SCP1671 DE results for CSN1S1_macrophages, LC1/2 or eosinophils). The Z-score information, ie. reflected in the original row ordering of the scanpy output, provides an additional layer of granularity for sorting). By removing default sorting by significance metric in the UI, we uncovered the default sorting (by gene name) of author DE data files.

Implementation details to note

Input row order significance detection

To determine if the author supplied row ordering is "significant" we assume biologically significant row order will have high correlation with the significance metric. Thus, we can use a spearman rho test to compare the relative order of the significance metric values when sorted by value order compared to sorting by input file row order. We chose a corrrelation threshold of 0.95 having tested several author DE files with 10,000 random permutations and only seeing a maximum correlation by random chance of 0.65 (95% percentile correlations were ~0.25).

Input row order persistence

The original author DE code creates files sorted by gene name with an unnamed index column (derived from the individual DE comparison) before the gene column.

For files where the input row order is deemed significant, the row number from the original file is the index in the newly generated file.

Sorting by significance metric where input row order has no value

For files where the input row order is assessed to have NO meaning (in the Demo study data, the genes use the same fixed order for every comparison), presenting the data sorted by gene name is not as usefu as presenting the data sorted by the significance metric. For consistency with the input-row order case, the index column for the significance-value-sorted file will also reflect the row number from the original DE file.

Manual testing

  1. Set up for ingest testing as per usual (from ingest directory, run source ../scripts/setup-mongo-dev.sh)
  2. run the following author DE ingest
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 ingest_differential_expression --annotation-name General_Celltype --annotation-type group --annotation-scope study --cluster-name cluster_umap_txt --study-accession SCPdev --ingest-differential-expression --differential-expression-file ../tests/data/author_de/pval_lfc_pvaladj_seurat-like.tsv --method wilcoxon --gene-header genes --group-header group --size-metric avg_log2FC --significance-metric p_val_adj --comparison-group-header comparison_group

  1. Confirm the top 10 lines of cluster_umap_txt--General_Celltype--B_cells--study--wilcoxon.tsv look like
	gene	avg_log2FC	p_val_adj	pct.2	pct.1	p_val	cluster
5	CLDN4	0.07032361297	0.007025241667	0.3627032424	0.6372967576	0.007025241667	1
11	CHRDL2	0.05399463925	0.0156040131	0.195191055	0.804808945	0.0156040131	4
0	PRLR	0.06029729764	0.05007578039	0.1001712149	0.8998287851	0.05007578039	1
2	STAT5A	0.0009241678384	0.1084025724	0.1621845379	0.8378154621	0.1084025724	4
35	PDCD1	0.045612242	0.1646225321	0.5064751238	0.4935248762	0.1646225321	10
33	EGFR	0.03486721322	0.1795388672	0.7998459957	0.2001540043	0.1795388672	1
9	LALBA	0.05503699554	0.4597031772	0.4101330993	0.5898669007	0.4597031772	10
1	CD68	0.006366543978	0.5267130168	0.388745595	0.611254405	0.5267130168	4
4	XDH	0.08113321809	0.533315441	0.8355921958	0.1644078042	0.533315441	5

[optional] test to confirm retention of original sort order

  1. obtain SCP2729 DE file all_clusters_3KL_and_WT.csv
  2. Run
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 ingest_differential_expression --annotation-name identity --annotation-type group --annotation-scope study --cluster-name "3KL vs WT" --study-accession SCPdev --ingest-differential-expression --differential-expression-file all_clusters_3KL_and_WT.csv --method wilcoxon --gene-header gene --group-header cluster --size-metric avg_log2FC --significance-metric p_val_adj --comparison-group-header None
  1. Confirm first 10 rows look like
	gene	avg_log2FC	p_val_adj	pct.2	pct.1	p_val	Unnamed: 0
4480	Vcam1	1.5609976662718	7.06846692990118e-217	0.04	0.674	5.2138872389918e-221	Vcam1.5
4481	Mrc1	1.92420455654343	6.22815061509542e-193	0.144	0.97	4.59404780932022e-197	Mrc1.5
4482	Stab1	1.4459749528187	6.094727027880639e-178	0.115	0.869	4.49563105988097e-182	Stab1.5
4483	Cd163	1.64873974238492	8.59263323687761e-176	0.075	0.733	6.3381524207992995e-180	Cd163.5
4484	F13a1	1.53882950196416	1.96877868179237e-166	0.084	0.754	1.45222297100566e-170	F13a1.6
4485	Csf1r	1.5618781564145	1.90565743646718e-165	0.165	0.949	1.40566307919686e-169	Csf1r.6
4486	Ifi207	1.27328746189283	2.69343733049381e-165	0.124	0.856	1.98675026222159e-169	Ifi207.6
4487	Dab2	1.34508938220505	6.53374651080585e-164	0.119	0.847	4.81946338482396e-168	Dab2.5
4488	Pf4	1.80331562775723	6.96639520732423e-163	0.143	0.907	5.13859645004369e-167	Pf4.6

@jlchang jlchang requested review from bistline and eweitz January 28, 2025 14:02
Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - manual tests pass. Thanks for the useful background on assessing the significance of the original order!

Copy link

codecov bot commented Jan 28, 2025

Codecov Report

Attention: Patch coverage is 89.74359% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.50%. Comparing base (655d7d1) to head (c6872ca).
Report is 11 commits behind head on development.

Files with missing lines Patch % Lines
ingest/author_de.py 89.74% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #381      +/-   ##
===============================================
- Coverage        75.94%   74.50%   -1.45%     
===============================================
  Files               30       30              
  Lines             4469     4491      +22     
===============================================
- Hits              3394     3346      -48     
- Misses            1075     1145      +70     
Files with missing lines Coverage Δ
ingest/author_de.py 94.83% <89.74%> (+0.45%) ⬆️

... and 1 file with indirect coverage changes

keeps convention readable in Github

does not affect metadata validation
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks generally code! Thanks for resolving this complex issue.

I note a likely lurking bug. But it seems like a trivial fix and would only impact performance, and not by much in absolute terms, so I don't consider it blocking.

Comment on lines 1 to 5
<<<<<<< HEAD
1737653567 # validation cache key
=======
1738072997 # validation cache key
>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committing the diff seems like it could cause a problem in CI, and perhaps cause clients to refetch ontologies on every upload UI page load.

Suggested change
<<<<<<< HEAD
1737653567 # validation cache key
=======
1738072997 # validation cache key
>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119
<<<<<<< HEAD
1737653567 # validation cache key
=======
1738072997 # validation cache key
>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for catching that. The rest of the ontology files were compressed so they couldn't be merged and I missed that this file needed a merge conflict resolution. Fixed in c6872ca

@jlchang jlchang merged commit 9185c71 into development Jan 28, 2025
3 of 4 checks passed
@jlchang jlchang deleted the jlc_keep_de_sorted branch January 28, 2025 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants