This directory contains the code to run analysis of
demographic inference using multiple programs, with
identical resulting data from any demographic model
found in stdpopsim
as input to each.
This provides an example of how standardized
population simulations may be used to compare methods
softwares and selection assumptions.
The present analyses tests how demographic inference is influenced by selection (e.g, linked selection).
The Snakemake workflow includes the necessary pipeline for simulation, analyses, and plotting in an efficient manner. Simply choose your parameters in a config file, and let snakemake handle the rest. (For large runs, the use of a cluster is highly encouraged)
The analysis includes three programs for predicting effective population
size through time(n_t
):
msmc,
stairwayplot, and
smc++.
There are four target rules that can be executed with the given parameters:
compound_msmc
,
compound_smcpp
,
compound_stairwayplot
,
or you can run all three on the same simulated data with rule all
.
To run an analysis, create a directory (wherever you want)
where all results, and intermediate
files will be stored. Next, create and place a file named config.json
in it.
The json file must contain key : value combos described below. An example
might look like this:
{
"seed" : 12345,
"num_samples_per_population" : [20, 0, 0],
"num_sampled_genomes_msmc" : "2,8",
"num_msmc_iterations" : 20,
"replicates" : 1,
"species" : "HomSap",
"model" : "OutOfAfricaArchaicAdmixture_5R19",
"genetic_map" : "HapMapII_GRCh37",
"chrm_list" : "chr22",
"mask_file" : "masks/HapmapII_GRCh37.mask.bed",
"dfe_id": "Gamma_K17",
"annotation_id":"ensembl_havana_104_exons,exons",
"slim_scaling_factor" : 1,
""
}
Once you have created a directory which contains the config file simply run snakemake from this directory (n_t), and point it to your analysis run directory, like so
snakemake -j 40 --config config="/projects/kernlab/jgallowa/homo_sapiens_Gutenkunst_0"
where -j
is the number cores available to run jobs in parallel, and
--config
points to the directory that contains the config file.
Our workflow can also be run on a cluster. To do so, it requires
the setup of a .json
configuration file that lets snakemake
know about your cluster. We have provided an example of
such a file in cluster_talapas.json
that is for use with a
University of Oregon SLURM cluster.
{
"__default__" :
{
"time" : "02:30:00",
"n" : 1,
"partition" : "kern,preempt",
"mem" : "16G",
"cores" : "4",
},
"run_msmc" :
{
"time" : "05:00:00",
"mem" : "32G",
"cores" : "4",
}
}
At a minimum, you should edit the names of the partition to match those on your own HPC. The workflow can then be launched with the call
snakemake -j 999 --config config="/projects/kernlab/jgallowa/homo_sapiens_Gutenkunst_0" --cluster-config cluster_talapas.json --cluster "sbatch -p {cluster.partition} -n {cluster.n} -t {cluster.time} --mem-per-cpu {cluster.mem} -c {cluster.cores}"
(it may prove useful to simply put this command in a bash script)
and jobs will be automatically farmed out to the cluster
The current final output is a plot comparing stairwayplot and smc++ estimates of homo_sapiens_Gutenkunst/all_estimated_Ne_{selection}.png
,
with selection being 'none', 'all' or applied to the 'exons'.
seed
: <class 'int'>
This sets the seed such that any analysis configuration
and run can be replicated exactly.
num_sampled_genomes_per_population
: <class 'list'>
This is the haploid number
of genomes to simulate for an analysis run per population.
replicates
: <class 'int'>
The number of replicate simulations to run and
analyze.
num_sampled_genomes_msmc
: <class 'str'>
This variable allows the user to run msmc independently on multiple
sample sizes (to be specified as a comma-separated string of sample sizes).
For each of these sample sizes,
replicates
number of analyses will be run with msmc
.
num_msmc_iterations
: <class 'int'>
The number of iterations for each msmc
run
output_dir
: <class 'str'>
This where all the intermediate files and results
will be stored. This includes all plots produced, as well as all input/output files
from the simulations and software runs organized into subdirectories by
replicate seed.
species
: <class 'module'>
from stdpopsim
such as stdpopsim.homo_sapiens
.
Chromosomes, demographics, and recombination maps should be derived from this.
model
: <class childclass Model>
This is the class that defines the demography associated with a species. All models
should inherit from <class Model>
genetic_map
: <class childclass GeneticMap>
This will define the genetic map
used for simulations.
chrm_list
: <class 'str'>
A string of the chromosome names you would like to simulate,
separated by commas. All chromosomes simulated will be fed
as a single input into each analysis by the inference programs, for each replicate.
Set to "all" to simulate all chromosomes for the genome.
mask_file
: <class 'str'>
A string indicating the path to a bed formatted file with
to be masked out of the analysis.
dfe_id
: <class 'str'>
A string indicating what Distribution of Fitness Effects to use.
See the catalog to check all the possible DFEs.
annotation_id
: <class 'str'>
,<class 'str'>
A string indicating where the DFE should be applied followed by the name of the data.
If applied to the 'exons', then an annotation name needs to be given see here for humans. If set to 'none',
a neutral model is applied. And if set to 'all' then selection is insert in the entire contig under subject.
annotation_id
: <class 'int'>