-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maintain author-supplied row ordering when appropriate #381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - manual tests pass. Thanks for the useful background on assessing the significance of the original order!
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #381 +/- ##
===============================================
- Coverage 75.94% 74.50% -1.45%
===============================================
Files 30 30
Lines 4469 4491 +22
===============================================
- Hits 3394 3346 -48
- Misses 1075 1145 +70
|
keeps convention readable in Github does not affect metadata validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks generally code! Thanks for resolving this complex issue.
I note a likely lurking bug. But it seems like a trivial fix and would only impact performance, and not by much in absolute terms, so I don't consider it blocking.
<<<<<<< HEAD | ||
1737653567 # validation cache key | ||
======= | ||
1738072997 # validation cache key | ||
>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Committing the diff seems like it could cause a problem in CI, and perhaps cause clients to refetch ontologies on every upload UI page load.
<<<<<<< HEAD | |
1737653567 # validation cache key | |
======= | |
1738072997 # validation cache key | |
>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119 | |
<<<<<<< HEAD | |
1737653567 # validation cache key | |
======= | |
1738072997 # validation cache key | |
>>>>>>> 3262903a4d13e2ecacd8cac9b39f34f2449bc119 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for catching that. The rest of the ontology files were compressed so they couldn't be merged and I missed that this file needed a merge conflict resolution. Fixed in c6872ca
Background
Outreach from a new study owner indicated their author DE results did not present as expected. Upon inspection, we learned that author DE results have been sorted by gene name. This is likely a side effect of presenting SCP-computed DE results using the sort order from Scanpy (which is sorted by Z-score) rather than re-sorting Scanpy results by the significance metric (it is common to find DE results where many of the genes have identical significance metric values [see SCP1671 DE results for
CSN1S1_macrophages
,LC1/2
oreosinophils
). The Z-score information, ie. reflected in the original row ordering of the scanpy output, provides an additional layer of granularity for sorting). By removing default sorting by significance metric in the UI, we uncovered the default sorting (by gene name) of author DE data files.Implementation details to note
Input row order significance detection
To determine if the author supplied row ordering is "significant" we assume biologically significant row order will have high correlation with the significance metric. Thus, we can use a spearman rho test to compare the relative order of the significance metric values when sorted by value order compared to sorting by input file row order. We chose a corrrelation threshold of 0.95 having tested several author DE files with 10,000 random permutations and only seeing a maximum correlation by random chance of 0.65 (95% percentile correlations were ~0.25).
Input row order persistence
The original author DE code creates files sorted by gene name with an unnamed index column (derived from the individual DE comparison) before the gene column.
For files where the input row order is deemed significant, the row number from the original file is the index in the newly generated file.
Sorting by significance metric where input row order has no value
For files where the input row order is assessed to have NO meaning (in the Demo study data, the genes use the same fixed order for every comparison), presenting the data sorted by gene name is not as usefu as presenting the data sorted by the significance metric. For consistency with the input-row order case, the index column for the significance-value-sorted file will also reflect the row number from the original DE file.
Manual testing
source ../scripts/setup-mongo-dev.sh
)cluster_umap_txt--General_Celltype--B_cells--study--wilcoxon.tsv
look like[optional] test to confirm retention of original sort order
all_clusters_3KL_and_WT.csv