workflows

various updates to sweep pipeline (popsim-consortium#116 )

Nov 12, 2024

8ad0c0a · Nov 12, 2024

This branch is up to date with popsim-consortium/analysis2:main.

Name	Name	Last commit message	Last commit date
parent directory ..
config	config	various updates to sweep pipeline (popsim-consortium#116 )	Nov 12, 2024
graphs	graphs	Demo multpop fix (popsim-consortium#85 )	Feb 2, 2023
masks	masks	edits related to production pipeline (popsim-consortium#114 )	Jul 18, 2024
test	test	using updated genetic map; cleaned up some unused parameters	Apr 4, 2023
.genetic_map_downloaded	.genetic_map_downloaded	Merge analysis flow (popsim-consortium#40 )	Feb 16, 2022
README.md	README.md	using updated genetic map; cleaned up some unused parameters	Apr 4, 2023
dfe.snake	dfe.snake	Summary stat fix (popsim-consortium#120 )	Oct 2, 2024
diploshic.snake	diploshic.snake	various updates to sweep pipeline (popsim-consortium#116 )	Nov 12, 2024
diploshic_params.yaml	diploshic_params.yaml	various updates to sweep pipeline (popsim-consortium#116 )	Nov 12, 2024
gone.py	gone.py	updates o get PhoSin and HomSap working (popsim-consortium#112 )	May 13, 2024
masks.py	masks.py	edits related to production pipeline (popsim-consortium#114 )	Jul 18, 2024
msmc.py	msmc.py	edits related to production pipeline (popsim-consortium#114 )	Jul 18, 2024
n_t.snake	n_t.snake	edits related to production pipeline (popsim-consortium#114 )	Jul 18, 2024
plots.py	plots.py	Summary stat fix (popsim-consortium#120 )	Oct 2, 2024
simulation.snake	simulation.snake	edits related to production pipeline (popsim-consortium#114 )	Jul 18, 2024
smc.py	smc.py	edits to get tiny config working on talapas (popsim-consortium#110 )	Apr 16, 2024
stairway.py	stairway.py	update stairway.py	Oct 10, 2023
summary_stats.sh	summary_stats.sh	Summary stat fix (popsim-consortium#120 )	Oct 2, 2024
summary_stats.snake	summary_stats.snake	Summary stat fix (popsim-consortium#120 )	Oct 2, 2024
sweep_simulate.snake	sweep_simulate.snake	various updates to sweep pipeline (popsim-consortium#116 )	Nov 12, 2024
ts2fs.py	ts2fs.py	Summary stat fix (popsim-consortium#120 )	Oct 2, 2024

README.md

Readme for N(t) with selection example

This directory contains the code to run analysis of demographic inference using multiple programs, with identical resulting data from any demographic model found in stdpopsim as input to each. This provides an example of how standardized population simulations may be used to compare methods softwares and selection assumptions.

The present analyses tests how demographic inference is influenced by selection (e.g, linked selection).

The Snakemake workflow includes the necessary pipeline for simulation, analyses, and plotting in an efficient manner. Simply choose your parameters in a config file, and let snakemake handle the rest. (For large runs, the use of a cluster is highly encouraged)

Workflow

The analysis includes three programs for predicting effective population size through time(n_t): msmc, stairwayplot, and smc++. There are four target rules that can be executed with the given parameters: compound_msmc, compound_smcpp, compound_stairwayplot, or you can run all three on the same simulated data with rule all.

To run an analysis, create a directory (wherever you want) where all results, and intermediate files will be stored. Next, create and place a file named config.json in it. The json file must contain key : value combos described below. An example might look like this:

{
    "seed" : 12345,
    "num_samples_per_population" : [20, 0, 0],
    "num_sampled_genomes_msmc" : "2,8",
    "num_msmc_iterations" : 20,
    "replicates" : 1, 
    "species" : "HomSap",
    "model" : "OutOfAfricaArchaicAdmixture_5R19",
    "genetic_map" : "HapMapII_GRCh37",
    "chrm_list" : "chr22",
    "mask_file" : "masks/HapmapII_GRCh37.mask.bed",
    "dfe_id": "Gamma_K17",
    "annotation_id":"ensembl_havana_104_exons,exons", 
    "slim_scaling_factor" : 1,
    ""
}

Once you have created a directory which contains the config file simply run snakemake from this directory (n_t), and point it to your analysis run directory, like so

snakemake -j 40 --config config="/projects/kernlab/jgallowa/homo_sapiens_Gutenkunst_0"

where -j is the number cores available to run jobs in parallel, and --config points to the directory that contains the config file.

Cluster environments

Our workflow can also be run on a cluster. To do so, it requires the setup of a .json configuration file that lets snakemake know about your cluster. We have provided an example of such a file in cluster_talapas.json that is for use with a University of Oregon SLURM cluster.

{
    "__default__" :
    {
        "time" : "02:30:00",
        "n" : 1,
        "partition" : "kern,preempt",
        "mem" : "16G",
        "cores" : "4",
    },
    "run_msmc" :
    {
        "time" : "05:00:00",
        "mem" : "32G",
        "cores" : "4",
    }
    
}

At a minimum, you should edit the names of the partition to match those on your own HPC. The workflow can then be launched with the call

snakemake -j 999 --config config="/projects/kernlab/jgallowa/homo_sapiens_Gutenkunst_0" --cluster-config cluster_talapas.json --cluster "sbatch -p {cluster.partition} -n {cluster.n} -t {cluster.time} --mem-per-cpu {cluster.mem} -c {cluster.cores}"

(it may prove useful to simply put this command in a bash script)

and jobs will be automatically farmed out to the cluster

Final output

The current final output is a plot comparing stairwayplot and smc++ estimates of $N (t)$ , i.e., homo_sapiens_Gutenkunst/all_estimated_Ne_{selection}.png , with selection being 'none', 'all' or applied to the 'exons'.

Config file Parameter Description

seed : <class 'int'> This sets the seed such that any analysis configuration and run can be replicated exactly.

num_sampled_genomes_per_population : <class 'list'> This is the haploid number of genomes to simulate for an analysis run per population.

replicates : <class 'int'> The number of replicate simulations to run and analyze.

num_sampled_genomes_msmc : <class 'str'> This variable allows the user to run msmc independently on multiple sample sizes (to be specified as a comma-separated string of sample sizes). For each of these sample sizes, replicates number of analyses will be run with msmc.

num_msmc_iterations : <class 'int'> The number of iterations for each msmc run

output_dir : <class 'str'> This where all the intermediate files and results will be stored. This includes all plots produced, as well as all input/output files from the simulations and software runs organized into subdirectories by replicate seed.

species : <class 'module'> from stdpopsim such as stdpopsim.homo_sapiens. Chromosomes, demographics, and recombination maps should be derived from this.

model : <class childclass Model> This is the class that defines the demography associated with a species. All models should inherit from <class Model>

genetic_map : <class childclass GeneticMap> This will define the genetic map used for simulations.

chrm_list : <class 'str'> A string of the chromosome names you would like to simulate, separated by commas. All chromosomes simulated will be fed as a single input into each analysis by the inference programs, for each replicate. Set to "all" to simulate all chromosomes for the genome.

mask_file : <class 'str'> A string indicating the path to a bed formatted file with to be masked out of the analysis.

dfe_id: <class 'str'> A string indicating what Distribution of Fitness Effects to use. See the catalog to check all the possible DFEs.

annotation_id : <class 'str'>,<class 'str'> A string indicating where the DFE should be applied followed by the name of the data. If applied to the 'exons', then an annotation name needs to be given see here for humans. If set to 'none', a neutral model is applied. And if set to 'all' then selection is insert in the entire contig under subject.

annotation_id : <class 'int'>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

workflows

workflows

README.md

Readme for N(t) with selection example

Workflow

Cluster environments

Final output

Config file Parameter Description

Files

workflows

Directory actions

More options

Directory actions

More options

Latest commit

History

workflows

Folders and files

parent directory

README.md

Readme for N(t) with selection example

Workflow

Cluster environments

Final output

Config file Parameter Description