Skip to content

Files

This branch is up to date with popsim-consortium/analysis2:main.

workflows

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Nov 12, 2024
Feb 2, 2023
Jul 18, 2024
Apr 4, 2023
Feb 16, 2022
Apr 4, 2023
Oct 2, 2024
Nov 12, 2024
Nov 12, 2024
May 13, 2024
Jul 18, 2024
Jul 18, 2024
Jul 18, 2024
Oct 2, 2024
Jul 18, 2024
Apr 16, 2024
Oct 10, 2023
Oct 2, 2024
Oct 2, 2024
Nov 12, 2024
Oct 2, 2024

Readme for N(t) with selection example

This directory contains the code to run analysis of demographic inference using multiple programs, with identical resulting data from any demographic model found in stdpopsim as input to each. This provides an example of how standardized population simulations may be used to compare methods softwares and selection assumptions.

The present analyses tests how demographic inference is influenced by selection (e.g, linked selection).

The Snakemake workflow includes the necessary pipeline for simulation, analyses, and plotting in an efficient manner. Simply choose your parameters in a config file, and let snakemake handle the rest. (For large runs, the use of a cluster is highly encouraged)

Workflow

The analysis includes three programs for predicting effective population size through time(n_t): msmc, stairwayplot, and smc++. There are four target rules that can be executed with the given parameters: compound_msmc, compound_smcpp, compound_stairwayplot, or you can run all three on the same simulated data with rule all.

To run an analysis, create a directory (wherever you want) where all results, and intermediate files will be stored. Next, create and place a file named config.json in it. The json file must contain key : value combos described below. An example might look like this:

{
    "seed" : 12345,
    "num_samples_per_population" : [20, 0, 0],
    "num_sampled_genomes_msmc" : "2,8",
    "num_msmc_iterations" : 20,
    "replicates" : 1, 
    "species" : "HomSap",
    "model" : "OutOfAfricaArchaicAdmixture_5R19",
    "genetic_map" : "HapMapII_GRCh37",
    "chrm_list" : "chr22",
    "mask_file" : "masks/HapmapII_GRCh37.mask.bed",
    "dfe_id": "Gamma_K17",
    "annotation_id":"ensembl_havana_104_exons,exons", 
    "slim_scaling_factor" : 1,
    ""
}

Once you have created a directory which contains the config file simply run snakemake from this directory (n_t), and point it to your analysis run directory, like so

snakemake -j 40 --config config="/projects/kernlab/jgallowa/homo_sapiens_Gutenkunst_0"

where -j is the number cores available to run jobs in parallel, and --config points to the directory that contains the config file.

Cluster environments

Our workflow can also be run on a cluster. To do so, it requires the setup of a .json configuration file that lets snakemake know about your cluster. We have provided an example of such a file in cluster_talapas.json that is for use with a University of Oregon SLURM cluster.

{
    "__default__" :
    {
        "time" : "02:30:00",
        "n" : 1,
        "partition" : "kern,preempt",
        "mem" : "16G",
        "cores" : "4",
    },
    "run_msmc" :
    {
        "time" : "05:00:00",
        "mem" : "32G",
        "cores" : "4",
    }
    
}

At a minimum, you should edit the names of the partition to match those on your own HPC. The workflow can then be launched with the call

snakemake -j 999 --config config="/projects/kernlab/jgallowa/homo_sapiens_Gutenkunst_0" --cluster-config cluster_talapas.json --cluster "sbatch -p {cluster.partition} -n {cluster.n} -t {cluster.time} --mem-per-cpu {cluster.mem} -c {cluster.cores}"

(it may prove useful to simply put this command in a bash script)

and jobs will be automatically farmed out to the cluster

Final output

The current final output is a plot comparing stairwayplot and smc++ estimates of N ( t ) , i.e., homo_sapiens_Gutenkunst/all_estimated_Ne_{selection}.png , with selection being 'none', 'all' or applied to the 'exons'.

Config file Parameter Description

seed : <class 'int'> This sets the seed such that any analysis configuration and run can be replicated exactly.

num_sampled_genomes_per_population : <class 'list'> This is the haploid number of genomes to simulate for an analysis run per population.

replicates : <class 'int'> The number of replicate simulations to run and analyze.

num_sampled_genomes_msmc : <class 'str'> This variable allows the user to run msmc independently on multiple sample sizes (to be specified as a comma-separated string of sample sizes). For each of these sample sizes, replicates number of analyses will be run with msmc.

num_msmc_iterations : <class 'int'> The number of iterations for each msmc run

output_dir : <class 'str'> This where all the intermediate files and results will be stored. This includes all plots produced, as well as all input/output files from the simulations and software runs organized into subdirectories by replicate seed.

species : <class 'module'> from stdpopsim such as stdpopsim.homo_sapiens. Chromosomes, demographics, and recombination maps should be derived from this.

model : <class childclass Model> This is the class that defines the demography associated with a species. All models should inherit from <class Model>

genetic_map : <class childclass GeneticMap> This will define the genetic map used for simulations.

chrm_list : <class 'str'> A string of the chromosome names you would like to simulate, separated by commas. All chromosomes simulated will be fed as a single input into each analysis by the inference programs, for each replicate. Set to "all" to simulate all chromosomes for the genome.

mask_file : <class 'str'> A string indicating the path to a bed formatted file with to be masked out of the analysis.

dfe_id: <class 'str'> A string indicating what Distribution of Fitness Effects to use. See the catalog to check all the possible DFEs.

annotation_id : <class 'str'>,<class 'str'> A string indicating where the DFE should be applied followed by the name of the data. If applied to the 'exons', then an annotation name needs to be given see here for humans. If set to 'none', a neutral model is applied. And if set to 'all' then selection is insert in the entire contig under subject.

annotation_id : <class 'int'>