Skip to content

Commit 4c49ba4

Browse files
committed
update to v0.2
1 parent 16464a5 commit 4c49ba4

File tree

208 files changed

+305963
-74062
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

208 files changed

+305963
-74062
lines changed

.gitignore

+13-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@
22
modify_savage.done
33
Rplots.pdf
44
*.log
5-
data/assembly/*
5+
data/assembly
6+
data/snp
67
Ray*
78
*.nohup
89
*utect2.smk
@@ -11,3 +12,14 @@ rm_contamination.smk
1112
ncbi_ref
1213
*haploflow*
1314
*/*.ncbi*.yaml
15+
logs
16+
eval_haplo_assembler.smk
17+
support
18+
old
19+
libs/not_used
20+
libs/PEHaplo
21+
libs/PredictHaplo
22+
libs/virgena
23+
libs/vicuna.zip
24+
*.bak
25+
*.old

LICENSE

-201
This file was deleted.

README.md

+12-12
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
## QuasiModo - Quasispecies Metric Determination on Omics
2-
> #### Strain-level assembly and SNP calling benchmarking based on sequencing data of mixed strain samples for HCMV
2+
> #### Strain-level assembly and variant calling benchmarking based on sequencing data of mixed strain samples for HCMV
33
4-
This repository contains the scripts and pipeline that reproduces the results of the HCMV benchmarking study. In this study we evaluated genome assemblers and variant callers on 6 in vitro generated, mixed strain HCMV sequence samples, each consisting of two lab strains in different abundance ratios. This tool can also be used to evaluate assemblies and SNP calling results on other similar datasets.
4+
This repository contains the scripts and pipeline that reproduces the results of the HCMV benchmarking study. In this study we evaluated genome assemblers and variant callers on 10 in vitro generated, mixed strain HCMV sequence samples, each consisting of two lab strains in different abundance ratios. This tool can also be used to evaluate assemblies and variant calling results on other similar datasets.
55

6-
In this benchmarking study: variants callers `BCFtools` (v1.9), `VarScan` (v2.4.3), `Freebayes` (v1.2.0), `LoFreq` (v2.1.3.1), `CLC Genomics Workbench` (v11.0.1) were evaluated. For the assembly benchmarking, `ABySS` (v2.1.4), `megahit` (v1.1.3) , `IDBA` (v1.1.3), `SPAdes` (v3.12.0), `Ray` (v2.3.1), `tadpole` (v37.99) were assessed. The haplotype reconstruction program `Savage` (v0.4.0) was also evaluated.
6+
In this benchmarking study: variants callers `BCFtools` (v1.9), `VarScan` (v2.4.3), `Freebayes` (v1.2.0), `LoFreq` (v2.1.3.1), `CLC Genomics Workbench` (v11.0.1) were evaluated. For the assembly benchmarking, `ABySS` (v2.1.4), `megahit` (v1.1.3) , `IDBA` (v1.1.3), `SPAdes` (v3.12.0), `Ray` (v2.3.1), `Tadpole` (v37.99) were assessed. The haplotype reconstruction program `Savage` (v0.4.0) was also evaluated.
77

88
### Prerequirements
99

@@ -43,7 +43,7 @@ TA-1-10 ../HCMV_benchmark_output/data/seqs/reads/TA-1-10.qc.r1.fq.gz ../HCMV_ben
4343
Please modify the paths to the sequencing files which you have downloaded accordingly. In this example, the `<your project path>` is `../HCMV_benchmark_output` and the reads are in the `../HCMV_benchmark_output/data/seqs/reads`.
4444

4545

46-
#### ! Due to the high computational and time cost, by default this program do not run the whole benchmark for HCMV dataset from scratch (based on reads), instead it benchmarks the SNP call and assembly based on the VCF files and scaffolds provided within this program under `data` directory.
46+
#### ! Due to the high computational and time cost, by default this program do not run the whole benchmark for HCMV dataset from scratch (based on reads), instead it benchmarks the variant call and assembly based on the VCF files and scaffolds provided within this program under `data` directory.
4747

4848
### Adapt the configuration file
4949
All the paths must be either relative path to the parent directory of `config` folder or absolute path.
@@ -78,11 +78,11 @@ Options:
7878
7979
Commands:
8080
hcmv Benchmarking for HCMV dataset
81-
snpeval SNP calling benchmark for customized dataset
81+
vareval Variant calling benchmark for customized dataset
8282
asmeval Assembly benchmark for customized dataset
8383
```
8484

85-
This program consists of three subcommands: `hcmv`, `snpeval`, `asmeval`. The first one is used for the benchmarking on our HCMV datasets. And the other two are for the SNP call and assembly evaluation on customized datasets.
85+
This program consists of three subcommands: `hcmv`, `vareval`, `asmeval`. The first one is used for the benchmarking on our HCMV datasets. And the other two are for the variant call and assembly evaluation on customized datasets.
8686

8787
The argumentrs and options in the `hcmv` command:
8888
```
@@ -99,7 +99,7 @@ Options:
9999
-t, --threads INTEGER The number of threads to use. [default: 2]
100100
-d, --dryrun Print the details without run the pipeline.
101101
[default: False]
102-
-e, --evaluation [all|snpcall|assembly]
102+
-e, --evaluation [all|variantcall|assembly]
103103
The evaluation to run. [required]
104104
-s, --slow Run the evaluation based on reads, which is
105105
very slow. By default, the evaluation will
@@ -125,7 +125,7 @@ If you expect to the benchmarking based on the reads, you need to specify the `-
125125
#### Assess variant callers and analyze the mutation context of identified variants
126126

127127
```shell
128-
python3 run_benchmark.py hcmv -e snpcall -t 10 -c ~/miniconda3/envs
128+
python3 run_benchmark.py hcmv -e variantcall -t 10 -c ~/miniconda3/envs
129129
```
130130
If you wish to the benchmarking based on the reads, you need to specify the `--slow` or `-s` option which allows you to generate the variant calling results from reads.
131131

@@ -266,11 +266,11 @@ python3 run_benchmark.py asmeval -t 10 -c ~/miniconda3/envs \
266266
```
267267

268268
#### Assess variant callers
269-
The arguments and options of `snpeval` command:
269+
The arguments and options of `vareval` command:
270270
```
271-
Usage: run_benchmark.py snpeval [OPTIONS]
271+
Usage: run_benchmark.py vareval [OPTIONS]
272272
273-
SNP calling benchmark for customized dataset
273+
Variant calling benchmark for customized dataset
274274
275275
Options:
276276
-o, --outpath PATH The directory where to put the results and figures.
@@ -295,7 +295,7 @@ Options:
295295

296296
- Run the benchmarking
297297
```shell
298-
python3 run_benchmark.py snpeval -t 10 -c ~/miniconda3/envs \
298+
python3 run_benchmark.py vareval -t 10 -c ~/miniconda3/envs \
299299
-v "<comma-separated list of VCF files>" \
300300
-r "<comma-separated list of reference genomes>" \
301301
-o <output directory>

config/conda_env.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ dependencies:
2525
- mummer=3.23
2626
- pip=19.1.1
2727
- pandas=0.24.2
28+
- rtg-tools=3.11
29+
- tabix=0.2.6
2830
- r-tidyverse=1.2.1
2931
- r-cowplot=0.9.4
3032
- r-reshape2=1.4.3

config/conda_iva.yaml

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
name: hcmv_benchmark_iva
2+
channels:
3+
- bioconda
4+
- conda-forge
5+
- defaults
6+
dependencies:
7+
- iva=1.0.9

config/config.yaml

+5-2
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@ MerlinRef: ref/Merlin.BAC.fa
33
TB40ERef: ref/TB40E.GFP.fa
44
AD169Ref: ref/AD169.BAC.fa
55
PhixRef: ref/Phix.fa
6-
outpath: ../fastmode_output_final
7-
threads: 2
6+
outpath: ../revision_output_3
7+
threads: 20
88
runOnReads: false
9+
rmHumanEcoli: true
10+
HumanRefBWAIdx: /net/sgi/viral_genomics/MHH/human_genome/hg19.genome.bwa
11+
EcoliRefBWAIdx: ref/Ecoli.NC_000913.fa

data/assembly.tar.gz

-78.2 MB
Binary file not shown.

data/snp.tar.gz

9.69 MB
Binary file not shown.

0 commit comments

Comments
 (0)