|
| 1 | + |
| 2 | +# Code for "Conditional prediction of consecutive tumor evolution using cancer progression models: What genotype comes next?" |
| 3 | +FIXME: give URL of paper in bioRxiv |
| 4 | + |
| 5 | + |
| 6 | +<!-- We are addressing the performance of Cancer Progression Models (CPMs) as --> |
| 7 | +<!-- tools to predict cancer evolution. Even if their accuracy when evaluating --> |
| 8 | +<!-- complete evolutionary trajectories can be limited by a series of factors, --> |
| 9 | +<!-- they could still be useful when it comes to reconstructing subsets of said --> |
| 10 | +<!-- trajectories. --> |
| 11 | + |
| 12 | +<!-- Thus, the question we are focusing on is the following: suppose that, at a --> |
| 13 | +<!-- certain time, we have been able to identify the most abundant genotype in --> |
| 14 | +<!-- a tumor, say _g<sub>n</sub>_ (where the subscript indicates _n_ mutated --> |
| 15 | +<!-- driver genes). Which genotype with _n_+1 mutated drivers is expected to --> |
| 16 | +<!-- arise next in the trajectory? --> |
| 17 | + |
| 18 | +<!-- Note that the sequence of most abundant genotypes (path of maximums, POM) --> |
| 19 | +<!-- doesn't necessarily have to match the sequence of ancestors of the --> |
| 20 | +<!-- ultimately fixated genotype (line of descent, LOD). A better defined --> |
| 21 | +<!-- version of our question would be: given an observed genotype with _n_ --> |
| 22 | +<!-- mutations in the POM, what is the probability for each of all possible --> |
| 23 | +<!-- genotypes with _n_+1 mutations to be in the LOD? --> |
| 24 | + |
| 25 | +<!-- To tackle this question, we are using a collection of simulated datasets --> |
| 26 | +<!-- generated using the [OncoSimulR --> |
| 27 | +<!-- package](https://bioconductor.org/packages/release/bioc/html/OncoSimulR.html). We --> |
| 28 | +<!-- are studying the predictions of various CPMs and comparing them to the --> |
| 29 | +<!-- actual evolutionary trajectories in the simulations. --> |
| 30 | + |
| 31 | +## Main scripts and directories |
| 32 | + |
| 33 | +Here we provide an overview of the main scripts and directories. Other |
| 34 | +files and directories will be mentioned, as needed, below. Where deemed of |
| 35 | +used for readers, we also provide .Rout files. RData and rds files are |
| 36 | +provided when their size permits. |
| 37 | + |
| 38 | +Executing the following scripts in the indicated order and with default |
| 39 | +options creates a ./data directory within the current one where transition |
| 40 | +matrices extracted from both the OncoSimulR simulations and the CPMs are |
| 41 | +stored. It also generates a table.RData object containing comparisons |
| 42 | +across said matrices. |
| 43 | + |
| 44 | +**oncoFunctions.R** - Contains functions called by many other scripts and some tests. |
| 45 | + |
| 46 | +**next-genotype_transitionMatrix-sim.R** - Extracts transition matrices directly from simulations. By default, matrices are stored in the ./sim directory. |
| 47 | + |
| 48 | +**next-genotype_transitionMatrix-cpm.R** - Extracts transition matrices from the outputs of the CPMs. By default, matrices are stored in the ./cpm directory. |
| 49 | + |
| 50 | +**next-genotype_transitionMatrix-timeDiscretizedCPMs.R** - Works just like the previos script, but for Schill's MHN and the time-discretized versions of CBN_ot and MCCBN. Output matrices are also stored under the ./cpm directory by default. |
| 51 | + |
| 52 | +**structData.R** - Takes the matrices under the ./sim and ./cpm directories (by default, but different directories can be provided) and groups them into nested lists stored in .RData objects and saved in the ./data directory (by default). |
| 53 | + |
| 54 | +**makeTable.R** - Reads the content of the ./data directory (by default) and creates the table.rds file containing a data frame with the comparative statistics across transition matrices. |
| 55 | + |
| 56 | +**oncoPlots.R** - Sourcing this file provides a set of functions that can |
| 57 | +be used to produce plots taking the table generated by makeTable as |
| 58 | +input. Not used. |
| 59 | + |
| 60 | +**merge-additional-info.R** - Run after |
| 61 | +makeTable.R. Merges additional info and outputs table-replicates.rds |
| 62 | +(approx. 20 GB file). |
| 63 | + |
| 64 | +**average-and-array-statistics.R** - Run after |
| 65 | +merge-additional-info.R. Computes per-array |
| 66 | +statistics (with different weighting schemes) and averages over |
| 67 | +replicates, genotypes, and number of mutations, appropriately. |
| 68 | + |
| 69 | +**makeTableObservedFreqs.R** - Can be run at any time, but necessarily |
| 70 | +before merge-additional-info.R. Returns observed frequencies of genotypes |
| 71 | +in actual samples used. |
| 72 | + |
| 73 | +**makeTableRanksFitness.R** - Can be run at any time, but necessarily |
| 74 | +before merge-additional-info.R. Returns fitness ranks of genotypes in |
| 75 | +fitness landscapes. |
| 76 | + |
| 77 | +**makeTableLocalMax.R** - Can be run at any time, but necessarily |
| 78 | +before merge-additional-info.R. Returns frequency with which each genotype |
| 79 | +is a local maximum in simulations. |
| 80 | + |
| 81 | +**makeTableWeights.R** - Can be run at any time, but necessarily |
| 82 | +before merge-additional-info.R. Returns the true frequency with which a |
| 83 | +genotype is the most abundant genotype during the simulations, as well as |
| 84 | +the frequency of each genotype in the complete 20000 samples. These two |
| 85 | +are the two sets of weights used. |
| 86 | + |
| 87 | +**makeTable-to-diff-wrt-null-2.R** - Computes difference in JS between the |
| 88 | +null model and the method output. |
| 89 | + |
| 90 | +**paired-wide-data.R** - Created wide (or paired-wide) data frames, so finding |
| 91 | +out best method or performing paired tests is easier. |
| 92 | + |
| 93 | +**create_rank_data.R** - Rank methods and obtain the best method in each |
| 94 | +case. Requires wide data. Comparisons made at the number mutations, array, |
| 95 | +and single genotype (per replicate) case. |
| 96 | + |
| 97 | +**create_wide_g_long_g.R** - Create minimal, concise and with factors, |
| 98 | +data sets for the glmertree models. |
| 99 | + |
| 100 | +**data_for_weighted_glmertree_plots.R** - (under plots-glms) Create data |
| 101 | +set to produce plots from lmertree fits. This produces |
| 102 | +""data_for_weighted_glmertree_plots.RData". |
| 103 | + |
| 104 | +**merge-compsfixNulls-wide_g.R** and **explore-reduce-method-comp.R**: |
| 105 | +comparisons among methods: merging with wide\_g data for lmertree |
| 106 | +fits. For some tables below. (E.g., in file when_good_conditions_0677.R) |
| 107 | + |
| 108 | +**method-comp-biol-data.R** Comparison of predictions between methods on |
| 109 | +the biological data, under /run-biol-examples. |
| 110 | + |
| 111 | +**./biol-data-plots** Directory with code for plots of the analyses of the |
| 112 | +cancer data sets. |
| 113 | + |
| 114 | + |
| 115 | +**plot-glms** Directory with code for fitting the lmertrees and plotting |
| 116 | +the results. Additionally, also scripts for producing tables in |
| 117 | +Supplementary material of performance when similar CE and TD |
| 118 | +(when_good_conditions_0677.R) |
| 119 | + |
| 120 | +**run-biol-examples** Code and data to run the CPM analyses and obtain the |
| 121 | +transition matrices on the cancer data sets. |
| 122 | + |
| 123 | +**biol-data-plots** Plots in manuscript and supplementary material for the |
| 124 | +output from the analyses on the cancer data sets. |
| 125 | + |
| 126 | +**Schill-MHN** Code from Schill et al., for MHN plus additional code for |
| 127 | +us to obtain transition matrices. Since there is no license in the |
| 128 | +original repository (https://github.com/RudiSchill/MHN) we have not |
| 129 | +provided it. The code should be left under Shcill-MHN/MHN (without |
| 130 | +creating a new subdirectory) |
| 131 | + |
| 132 | + |
| 133 | +**CBN-with-bug-fix-and-likelihood** A directory that contains the |
| 134 | +compressed directory of H-CBN with a couple of bug fixes and enhancements |
| 135 | +by RDU, as explained in the Supplementary Material. The original code from |
| 136 | +https://bsse.ethz.ch/cbg/software/ct-cbn.html is under the GNU GPL (v. 3), |
| 137 | +the same used here. The authors of the code are Niko |
| 138 | +Beerenwinkel, Moritz Gerstung, and Seth Sullivant. |
| 139 | + |
| 140 | +**Fitnesslandscape-characteristics** Code for plots of fitness landscapes |
| 141 | +characteristics. One of them is used in the Supplementary Material |
| 142 | +(gamma-rsign-obs-peaks-FL.pdf) |
| 143 | + |
| 144 | + |
| 145 | +## Pipeline |
| 146 | + |
| 147 | +00. The time discretized CBN and MCCBN transition matrices as well as MHNs |
| 148 | + are needed for analyses below. (Before the next-genotype-... are |
| 149 | + run). Run Time-discretized-CBN-MCCBN/time-discretized-CBN.R and |
| 150 | + Schill-MHN/run-Schill-MHN-trans-mat.R |
| 151 | + |
| 152 | +0. At any time, but before running merge-additional-info.R, run each of |
| 153 | + makeTableObservedFreqs.R, makeTableRanksFitness.R, makeTableLocalMax.R, |
| 154 | + and makeTableWeights.R. |
| 155 | + |
| 156 | + Output: |
| 157 | + |
| 158 | + - makeTableObservedFreqs.R: outObsFreqs.RData (~ 95 MB, but fast script) |
| 159 | + |
| 160 | + - makeTableRanksFitness.R: outRanksFitness.RData. (~ 43 MB, but |
| 161 | + fast script) |
| 162 | + |
| 163 | + - makeTableWeights.R: weightsAll.RData [plus intermediate RData: |
| 164 | + outfrerqsSampled and outFrerqsTrue]. (~ 17 MB; takes about 6 h) |
| 165 | + |
| 166 | + - makeTableLocalMax.R: outLocalMax.RData (~ 560 KB; takes about 7 h) |
| 167 | + |
| 168 | + |
| 169 | +1. Run all three scripts next-genotype_transitionMatrix-* |
| 170 | + |
| 171 | +2. Run structData |
| 172 | + |
| 173 | +3. Run makeTable.R. The -scratch flag is needed the first time this script |
| 174 | + runs, then once the temporary file makeTable_tmp.rds file has been |
| 175 | + created, formatting changes can be made and implemented by running |
| 176 | + makeTable with no flags. If any upstream changes are made, a new |
| 177 | + -scratch run will be needed to incorporate them. |
| 178 | + |
| 179 | +4. Run merge-additional-file-info.R. Input: |
| 180 | + pre-table-replicates.RData (plus fl_sampl.RData and the output from |
| 181 | + each of the files run in step 0). Output: table-replicates.rds. |
| 182 | + |
| 183 | +5. Run average-and-array-statistics.R. Input: |
| 184 | + table-replicates.rds. Output: FIXME: zz, spell out. |
| 185 | + |
| 186 | +6. At this point, the generated tables can be used for downstream analyses and plots. |
| 187 | + |
| 188 | +7. Run makeTable-to-diff-wrt-null-2.R: Input: |
| 189 | + data_no_any_over_replicate.rds, array_statistics_over_replicate.rds, |
| 190 | + data_no_any.rds, num_mut_statistics_over_replicate.rds. Output: |
| 191 | + genotype_diff_null.RData, array_diff_null.RData, |
| 192 | + genotype_diff_null_no_average.RData, num_mut_diff_null.RData. |
| 193 | + |
| 194 | +8. Run paired-data.R. Input: genotype_diff_null.RData, |
| 195 | + array_diff_null.RData, genotype_diff_null_no_average.RData, |
| 196 | + num_mut_diff_null.RData. Output: wide_array.RData, wide_genotype.RData, |
| 197 | + wide_genotype_no_average.RData, wide_num_mut.RData. |
| 198 | + |
| 199 | +9. Run create_rank_data.R, then create wide_g_long_g.R. Note that we |
| 200 | + eliminate cases with sampledProp == 0 (create_wide_g_long_g.R), then |
| 201 | + those with sampledProp < 1e-3 (fit-lmtree scripts and |
| 202 | + modelse-beta-regression.R), as well as the last genotype ---7 |
| 203 | + mutations or 10 mutations, for landscapes with 7 and 10 loci, |
| 204 | + respectively--- (fit-lmtree scripts and modelse-beta-regression.R). |
| 205 | + |
| 206 | +10. Run the lmertree fits. |
| 207 | + In subdirectory plots-glms/fits-Weighted-01-sqrt. A launch.sh will launch the |
| 208 | + lmertree fits. Once done, plotting with plotting-lmertree-fits-weighted-01-sqrt.R |
| 209 | + |
| 210 | + |
| 211 | +11. Analyze the cancer data sets. Under ./run-biol-examples |
| 212 | + |
| 213 | + - Launch analyses with CPMs. See details under README.txt. File |
| 214 | + launch-run-rerun.sh launches all R processes. |
| 215 | + |
| 216 | + - Once analyses with CPMs are completed, compare output: |
| 217 | + method-comp-biol-data.R |
| 218 | + |
| 219 | + |
| 220 | +12. Run the two files under ./biol-data-plots to generate the plots for |
| 221 | +the cancer data sets (main manuscript and supplementary material). |
| 222 | + |
| 223 | + |
| 224 | +## Contact |
| 225 | + |
| 226 | +For questions or comments, please contact |
| 227 | +[J. Diaz-Colunga ](mailto:[email protected]) or [R. Diaz-Uriarte ](mailto:[email protected]). |
| 228 | + |
| 229 | +<!-- This repository is currently maintained by [Juan --> |
| 230 | +<!-- Díaz-Colunga](jdiazc9.github.io). For questions or comments, please --> |
| 231 | +<!-- contact by [mail](mailto:[email protected]) or [mail](mailto:[email protected]). --> |
| 232 | + |
| 233 | + |
| 234 | + |
0 commit comments