Skip to content

Commit

Permalink
Create README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
nicolo-tellini authored Apr 2, 2024
1 parent e37f432 commit 273488c
Showing 1 changed file with 268 additions and 0 deletions.
268 changes: 268 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
<p align="center">
<img src="https://github.com/nicolo-tellini/intropipeline/blob/loaded/img/intropipeline.logo.png" alt="intropipeline logo" width="450" height="300"/>
</p>

raw data produced in Tellini et al., can be downloaded [here](http://134.59.51.17:5000/sharing/keXp2tuto)

:construction: this page is under construction (31/07/2023)

:rocket: A v.2 with several improvements in stability, speed and memory consumption is close to being released.

# intropipeline

[![Licence](https://img.shields.io/github/license/nicolo-tellini/intropipeline?style=plastic)](https://github.com/nicolo-tellini/intropipeline/blob/main/LICENSE)
[![Release](https://img.shields.io/github/v/release/nicolo-tellini/intropipeline?style=plastic)](https://github.com/nicolo-tellini/intropipeline/releases/tag/v.1.0.0)
[![release date](https://img.shields.io/github/release-date/nicolo-tellini/intropipeline?color=violet&style=plastic)](https://github.com/nicolo-tellini/intropipeline/releases/tag/v.1.0.0)
[![commit](https://img.shields.io/github/last-commit/nicolo-tellini/intropipeline?color=yellow&style=plastic)](https://github.com/nicolo-tellini/intropipeline/graphs/commit-activity)

An automated computational framework for detecting *Saccharomyces paradoxus* introgressions in *Saccharomyces cerevisiae* strains from paired-end illumina sequencing.

<p align="center">
<img src="https://github.com/nicolo-tellini/introspect/blob/loaded/img/logo2.png" alt="Sublime's custom image"/>
</p>

## Description

v1. is described in Tellini et al 20xx for detecting *S.par* introgressions in *S.cer* strains. <br>

v2. contains the following implementations and changes:
- ```minimap2``` replaced ```bwa mem``` almost halving the running time (see [Heng Li 2018, Bioinformatics](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778?login=true)) achieving comparable results; <br>

sample: ERR3010122 <br>

threads: 2 <br>

Architecture: x86_64 <br>

CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz <br>


| script | Elapsed Time | Maximum resident set size (GB) |
| ------------- | ------------- | ------------- |
| bwa mem + samtools (v1) | 6:21 (m:ss) | 1.3 |
| minimap2 + samtools (v2) | 3:36 (m:ss) | 1.3 |

- improved the reproducibility of the mapping by implementing the standard samtools workflow according to [samtools' guideline](http://www.htslib.org/workflow/fastq.html)
- improved the roboustness of the mapping by appending the name of the strain to a checkpoint (cps) file (```./cps/cps.txt```). The strains which names are stored in ```./cps/cps.txt``` will not be mapped again.
- introduced ```data.table```, ```lapply``` and custom function for large file manipulation for reducing runtime and RAM load.
example:

sample: ERR3010122 <br>

threads: 2 <br>

Architecture: x86_64 <br>

CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz <br>

| script | Elapsed Time (s) | Maximum resident set size (GB) |
| ------------- | ------------- | ------------- |
| parser_marker.r (v1) | 0:17 s | 0.8 |
| parser_marker.r (v2) | 0:06 s | 0.5 |
| clrs.r (v1) | 0:49 s | 1.9 |
| clrs.r (v2) | 0:17 s | 0.7 |

- introduced the variables ```nSamples``` and ```nThreads``` inside ```runner.sh```. The first variable controls the number of samples to run in paralell and the second the per-samples number of threads. ```nSamples``` guarantees a contant number of samples running in parallel; as soon as the count drop of one sample an other will start to run. The definition of these variables affect the scripts ```minimap2.sh``` (which replaces ```bwa.sh```), ```bcftools_markers.sh``` (which replaces ```samtools_marker.sh```) and ```freec.sh```;
- corrected an error that prevented the detection of the CNVs;
- Added a new approach for merging markers in blocks:

In v1 the markers are (1) genotyped, (2) filtered and (3) joined as long as they are consecutive and carry the same information. In v2 this does not change.

In v2 the markers are (1) ranked, (2) genotyped, (3) filtered, (4) joined as long as they are consecutive in the **ranking** and carry the same information. v1 did not use the ranking.
Inevitably, this results in a more fragmented signal but provides a more realistic and faithful representation of the introgression reflecting regions where the genotyping was either discordant or failed.
The ranking also represents the strategy that allowed the speedup of ```clrs.r``` (the script that generates the blocks).

<p align="center">
<img src="https://github.com/nicolo-tellini/intropipeline/blob/loaded/img/mrkstrategy.png" alt="Sublime's custom image"/>
</p>

## Download

:octocat: :

```sh
git clone --recursive https://github.com/nicolo-tellini/intropipeline.git
```

## Content

:open_file_folder: :

```{bash}
.
├── rep
│ ├── Ann
│ └── Asm
├── runner.sh
├── scr
└── seq
5 directories 1 file
```

- ```rep``` : repository with assemblies, annotations and pre-computed marker table,</br>
- ```runner.sh``` : the script you edit and run,</br>
- ```scr``` : scripts,</br>
- ```seq``` : put the FASTQs files here,</br>

### Before starting

``` gzip -d ./rep/mrktab.gz ```

``` gzip -d ./rep/Asm/*gz```

### About the fastqs

Move the FASTQs inside ```./seq/```

Paired-end FASTQs data **must** be gziped and suffixed with **.R1.fastq.gz** and **.R2.fastq.gz**.

### Default

```./scr/bwa.sh``` uses 2 thread for sample (n.samples = 2).

```./scr/samtools_markers.sh``` uses 1 thread for sample (n.samples = 4).

```./scr/gem.sh``` uses 2 threads.

```./scr/freec.sh``` uses 4 threads.

these values can be changed editing the scripts.

### How to run

Edit ```runner.sh``` :page_with_curl:

```{bash}
#!/bin/bash
#####################
### user settings ###
#####################
## S. paradoxus reference assembly
ref2Label="CBS432" ## choose the Spar assembly you think better fit the origin of your samples
## short labels (used to name file)
ref2="EU" ## choose a short name for Spar
# STEP 1
fastqQC="yes" ## fastqc control (required) ("yes","no" or "-" the last is skip)
# STEP 2
shortReadMapping="yes" ## ("yes","no")
# STEP 3
mrkgeno="yes" ## ("yes","no")
# STEP 4
cnv="yes" ## ("yes","no")
# STEP 5
intro="yes" ## ("yes","no")
#####################
### settings' end ###
#####################
```

Run ```runner.sh``` :runner:

```{bash}
nohup bash runner.sh &
```
## The result

The results concerning the introgressions are stored in ```./int```

Ex.

An Alpechin strain:

<p align="center">
<img src="https://github.com/nicolo-tellini/introspect/blob/loaded/img/example-result.png" alt="res"/>
</p>


## How to interprer the result

Blue-Red plots provides an overview of potential introgressed DNA across the genome.
The interpretation of the results is a process that require the integration of different data the pipeline produces.

<p align="center">
<img src="https://github.com/nicolo-tellini/introspect/blob/loaded/img/res1.png" alt="Sublime's custom image"/>
</p>

:exclamation: Reminder: blocks are defined as consecutive markers besring the same genomic info (Homo S.cer, Homo S.par, Het).

<br />

How are markers distributed inside the *S.par* block?

A couple of possible scenarious:

**Case 1**: abundant markers suporting the block
<p align="center">
<img src="https://github.com/nicolo-tellini/introspect/blob/loaded/img/res2.png" alt="Sublime's custom image"/>
</p>

:exclamation: Note: Only a few markers in the figure above are represented in the cartoon;


**Case 2**: *not* so abundant markers suporting the block
<p align="center">
<img src="https://github.com/nicolo-tellini/introspect/blob/loaded/img/res3.png" alt="Sublime's custom image"/>
</p>

:exclamation: Note: you should *not* exclude the possibility that a large events is supported by a low number of markers as in the example.

The number of markers supporting the blocks, the marker density and the info concerning the genotype are stored in ```int``` and ```int/AllSegments```.

## Dependencies

### Softwares

* [FastQC](https://github.com/s-andrews/FastQC/releases/tag/v0.11.9) v. 0.11.9
* [bwa](https://github.com/lh3/bwa/releases/tag/v0.7.17) v. 0.7.17-r1198-dirty
* [samtools](https://github.com/samtools/samtools/releases/tag/1.9) v. 1.9
* [GEM](https://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/) v. 1.315 (beta)
!! The GEM version used for the analyses is 1.759 (not available anymore).
* [Control-FREEC](https://github.com/BoevaLab/FREEC/releases/tag/v11.6) v. 11.6; makeGraph.R script was renamed makeplotcnv.R; A copy of all the scripts in [FREEC/scripts/](https://github.com/BoevaLab/FREEC) is in scr. Nevertheless freec has to be installed
* A copy of [sambamba](https://github.com/biod/sambamba/releases/tag/v0.6.5) v. 0.6.5 is provided with the pipeline (no installation required)

### R libraries

* [data.table](https://rdocumentation.org/packages/data.table/versions/1.14.2) v. 1.14.2
* [ggplot2](https://github.com/tidyverse/ggplot2/releases/tag/v3.3.5) v. 3.3.5
* [vcfR](https://github.com/knausb/vcfR/releases/tag/v1.12.0) v. 1.12.0
* [scales](https://cran.r-project.org/src/contrib/Archive/scales/) v. 1.1.1
* [rtracklayer](http://www.bioconductor.org/packages/3.11/bioc/html/rtracklayer.html) v. 1.48.0
* [seqinr](https://cran.r-project.org/src/contrib/Archive/seqinr/) v. 4.2-8

## Find out more

*S.cer* consensus assembly [link method paper]

Marker definition [link method paper]

## Citations

## Release history

* v1.0.0 released on 2023

## TO-DO list

### short-term updates
- rename columns names ClrS

### long-term updates

- more extensive mapping
- migration to data.table
- adopt ranking strategy
- integration genomics annotations
- traffic light bash (control over the number of samples and threads)
- switch CNVs

0 comments on commit 273488c

Please sign in to comment.