Skip to content

Commit 12b53b8

Browse files
committed
initial commit
0 parents  commit 12b53b8

File tree

97 files changed

+47821
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+47821
-0
lines changed

FitnessLandscapes-characteristics/fitness-landscapes-characts.R

Lines changed: 1213 additions & 0 deletions
Large diffs are not rendered by default.
194 KB
Binary file not shown.

LICENSE

Lines changed: 674 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
2+
# Code for "Conditional prediction of consecutive tumor evolution using cancer progression models: What genotype comes next?"
3+
FIXME: give URL of paper in bioRxiv
4+
5+
6+
<!-- We are addressing the performance of Cancer Progression Models (CPMs) as -->
7+
<!-- tools to predict cancer evolution. Even if their accuracy when evaluating -->
8+
<!-- complete evolutionary trajectories can be limited by a series of factors, -->
9+
<!-- they could still be useful when it comes to reconstructing subsets of said -->
10+
<!-- trajectories. -->
11+
12+
<!-- Thus, the question we are focusing on is the following: suppose that, at a -->
13+
<!-- certain time, we have been able to identify the most abundant genotype in -->
14+
<!-- a tumor, say _g<sub>n</sub>_ (where the subscript indicates _n_ mutated -->
15+
<!-- driver genes). Which genotype with _n_+1 mutated drivers is expected to -->
16+
<!-- arise next in the trajectory? -->
17+
18+
<!-- Note that the sequence of most abundant genotypes (path of maximums, POM) -->
19+
<!-- doesn't necessarily have to match the sequence of ancestors of the -->
20+
<!-- ultimately fixated genotype (line of descent, LOD). A better defined -->
21+
<!-- version of our question would be: given an observed genotype with _n_ -->
22+
<!-- mutations in the POM, what is the probability for each of all possible -->
23+
<!-- genotypes with _n_+1 mutations to be in the LOD? -->
24+
25+
<!-- To tackle this question, we are using a collection of simulated datasets -->
26+
<!-- generated using the [OncoSimulR -->
27+
<!-- package](https://bioconductor.org/packages/release/bioc/html/OncoSimulR.html). We -->
28+
<!-- are studying the predictions of various CPMs and comparing them to the -->
29+
<!-- actual evolutionary trajectories in the simulations. -->
30+
31+
## Main scripts and directories
32+
33+
Here we provide an overview of the main scripts and directories. Other
34+
files and directories will be mentioned, as needed, below. Where deemed of
35+
used for readers, we also provide .Rout files. RData and rds files are
36+
provided when their size permits.
37+
38+
Executing the following scripts in the indicated order and with default
39+
options creates a ./data directory within the current one where transition
40+
matrices extracted from both the OncoSimulR simulations and the CPMs are
41+
stored. It also generates a table.RData object containing comparisons
42+
across said matrices.
43+
44+
**oncoFunctions.R** - Contains functions called by many other scripts and some tests.
45+
46+
**next-genotype_transitionMatrix-sim.R** - Extracts transition matrices directly from simulations. By default, matrices are stored in the ./sim directory.
47+
48+
**next-genotype_transitionMatrix-cpm.R** - Extracts transition matrices from the outputs of the CPMs. By default, matrices are stored in the ./cpm directory.
49+
50+
**next-genotype_transitionMatrix-timeDiscretizedCPMs.R** - Works just like the previos script, but for Schill's MHN and the time-discretized versions of CBN_ot and MCCBN. Output matrices are also stored under the ./cpm directory by default.
51+
52+
**structData.R** - Takes the matrices under the ./sim and ./cpm directories (by default, but different directories can be provided) and groups them into nested lists stored in .RData objects and saved in the ./data directory (by default).
53+
54+
**makeTable.R** - Reads the content of the ./data directory (by default) and creates the table.rds file containing a data frame with the comparative statistics across transition matrices.
55+
56+
**oncoPlots.R** - Sourcing this file provides a set of functions that can
57+
be used to produce plots taking the table generated by makeTable as
58+
input. Not used.
59+
60+
**merge-additional-info.R** - Run after
61+
makeTable.R. Merges additional info and outputs table-replicates.rds
62+
(approx. 20 GB file).
63+
64+
**average-and-array-statistics.R** - Run after
65+
merge-additional-info.R. Computes per-array
66+
statistics (with different weighting schemes) and averages over
67+
replicates, genotypes, and number of mutations, appropriately.
68+
69+
**makeTableObservedFreqs.R** - Can be run at any time, but necessarily
70+
before merge-additional-info.R. Returns observed frequencies of genotypes
71+
in actual samples used.
72+
73+
**makeTableRanksFitness.R** - Can be run at any time, but necessarily
74+
before merge-additional-info.R. Returns fitness ranks of genotypes in
75+
fitness landscapes.
76+
77+
**makeTableLocalMax.R** - Can be run at any time, but necessarily
78+
before merge-additional-info.R. Returns frequency with which each genotype
79+
is a local maximum in simulations.
80+
81+
**makeTableWeights.R** - Can be run at any time, but necessarily
82+
before merge-additional-info.R. Returns the true frequency with which a
83+
genotype is the most abundant genotype during the simulations, as well as
84+
the frequency of each genotype in the complete 20000 samples. These two
85+
are the two sets of weights used.
86+
87+
**makeTable-to-diff-wrt-null-2.R** - Computes difference in JS between the
88+
null model and the method output.
89+
90+
**paired-wide-data.R** - Created wide (or paired-wide) data frames, so finding
91+
out best method or performing paired tests is easier.
92+
93+
**create_rank_data.R** - Rank methods and obtain the best method in each
94+
case. Requires wide data. Comparisons made at the number mutations, array,
95+
and single genotype (per replicate) case.
96+
97+
**create_wide_g_long_g.R** - Create minimal, concise and with factors,
98+
data sets for the glmertree models.
99+
100+
**data_for_weighted_glmertree_plots.R** - (under plots-glms) Create data
101+
set to produce plots from lmertree fits. This produces
102+
""data_for_weighted_glmertree_plots.RData".
103+
104+
**merge-compsfixNulls-wide_g.R** and **explore-reduce-method-comp.R**:
105+
comparisons among methods: merging with wide\_g data for lmertree
106+
fits. For some tables below. (E.g., in file when_good_conditions_0677.R)
107+
108+
**method-comp-biol-data.R** Comparison of predictions between methods on
109+
the biological data, under /run-biol-examples.
110+
111+
**./biol-data-plots** Directory with code for plots of the analyses of the
112+
cancer data sets.
113+
114+
115+
**plot-glms** Directory with code for fitting the lmertrees and plotting
116+
the results. Additionally, also scripts for producing tables in
117+
Supplementary material of performance when similar CE and TD
118+
(when_good_conditions_0677.R)
119+
120+
**run-biol-examples** Code and data to run the CPM analyses and obtain the
121+
transition matrices on the cancer data sets.
122+
123+
**biol-data-plots** Plots in manuscript and supplementary material for the
124+
output from the analyses on the cancer data sets.
125+
126+
**Schill-MHN** Code from Schill et al., for MHN plus additional code for
127+
us to obtain transition matrices. Since there is no license in the
128+
original repository (https://github.com/RudiSchill/MHN) we have not
129+
provided it. The code should be left under Shcill-MHN/MHN (without
130+
creating a new subdirectory)
131+
132+
133+
**CBN-with-bug-fix-and-likelihood** A directory that contains the
134+
compressed directory of H-CBN with a couple of bug fixes and enhancements
135+
by RDU, as explained in the Supplementary Material. The original code from
136+
https://bsse.ethz.ch/cbg/software/ct-cbn.html is under the GNU GPL (v. 3),
137+
the same used here. The authors of the code are Niko
138+
Beerenwinkel, Moritz Gerstung, and Seth Sullivant.
139+
140+
**Fitnesslandscape-characteristics** Code for plots of fitness landscapes
141+
characteristics. One of them is used in the Supplementary Material
142+
(gamma-rsign-obs-peaks-FL.pdf)
143+
144+
145+
## Pipeline
146+
147+
00. The time discretized CBN and MCCBN transition matrices as well as MHNs
148+
are needed for analyses below. (Before the next-genotype-... are
149+
run). Run Time-discretized-CBN-MCCBN/time-discretized-CBN.R and
150+
Schill-MHN/run-Schill-MHN-trans-mat.R
151+
152+
0. At any time, but before running merge-additional-info.R, run each of
153+
makeTableObservedFreqs.R, makeTableRanksFitness.R, makeTableLocalMax.R,
154+
and makeTableWeights.R.
155+
156+
Output:
157+
158+
- makeTableObservedFreqs.R: outObsFreqs.RData (~ 95 MB, but fast script)
159+
160+
- makeTableRanksFitness.R: outRanksFitness.RData. (~ 43 MB, but
161+
fast script)
162+
163+
- makeTableWeights.R: weightsAll.RData [plus intermediate RData:
164+
outfrerqsSampled and outFrerqsTrue]. (~ 17 MB; takes about 6 h)
165+
166+
- makeTableLocalMax.R: outLocalMax.RData (~ 560 KB; takes about 7 h)
167+
168+
169+
1. Run all three scripts next-genotype_transitionMatrix-*
170+
171+
2. Run structData
172+
173+
3. Run makeTable.R. The -scratch flag is needed the first time this script
174+
runs, then once the temporary file makeTable_tmp.rds file has been
175+
created, formatting changes can be made and implemented by running
176+
makeTable with no flags. If any upstream changes are made, a new
177+
-scratch run will be needed to incorporate them.
178+
179+
4. Run merge-additional-file-info.R. Input:
180+
pre-table-replicates.RData (plus fl_sampl.RData and the output from
181+
each of the files run in step 0). Output: table-replicates.rds.
182+
183+
5. Run average-and-array-statistics.R. Input:
184+
table-replicates.rds. Output: FIXME: zz, spell out.
185+
186+
6. At this point, the generated tables can be used for downstream analyses and plots.
187+
188+
7. Run makeTable-to-diff-wrt-null-2.R: Input:
189+
data_no_any_over_replicate.rds, array_statistics_over_replicate.rds,
190+
data_no_any.rds, num_mut_statistics_over_replicate.rds. Output:
191+
genotype_diff_null.RData, array_diff_null.RData,
192+
genotype_diff_null_no_average.RData, num_mut_diff_null.RData.
193+
194+
8. Run paired-data.R. Input: genotype_diff_null.RData,
195+
array_diff_null.RData, genotype_diff_null_no_average.RData,
196+
num_mut_diff_null.RData. Output: wide_array.RData, wide_genotype.RData,
197+
wide_genotype_no_average.RData, wide_num_mut.RData.
198+
199+
9. Run create_rank_data.R, then create wide_g_long_g.R. Note that we
200+
eliminate cases with sampledProp == 0 (create_wide_g_long_g.R), then
201+
those with sampledProp < 1e-3 (fit-lmtree scripts and
202+
modelse-beta-regression.R), as well as the last genotype ---7
203+
mutations or 10 mutations, for landscapes with 7 and 10 loci,
204+
respectively--- (fit-lmtree scripts and modelse-beta-regression.R).
205+
206+
10. Run the lmertree fits.
207+
In subdirectory plots-glms/fits-Weighted-01-sqrt. A launch.sh will launch the
208+
lmertree fits. Once done, plotting with plotting-lmertree-fits-weighted-01-sqrt.R
209+
210+
211+
11. Analyze the cancer data sets. Under ./run-biol-examples
212+
213+
- Launch analyses with CPMs. See details under README.txt. File
214+
launch-run-rerun.sh launches all R processes.
215+
216+
- Once analyses with CPMs are completed, compare output:
217+
method-comp-biol-data.R
218+
219+
220+
12. Run the two files under ./biol-data-plots to generate the plots for
221+
the cancer data sets (main manuscript and supplementary material).
222+
223+
224+
## Contact
225+
226+
For questions or comments, please contact
227+
[J. Diaz-Colunga](mailto:[email protected]) or [R. Diaz-Uriarte](mailto:[email protected]).
228+
229+
<!-- This repository is currently maintained by [Juan -->
230+
<!-- Díaz-Colunga](jdiazc9.github.io). For questions or comments, please -->
231+
<!-- contact by [mail](mailto:[email protected]) or [mail](mailto:[email protected]). -->
232+
233+
234+

0 commit comments

Comments
 (0)