Skip to content

split, merge, reformulate assembly_visualization.md #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions docs/source/technical/Saga_graphical_window_loggin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Using a graphical window in SAGA
It is possible to use graphical interface on Saga.
Login to saga using the `-Y` option


_**PRACTICAL EXERCISE:**_

```bash
ssh -Y <username>@saga.sigma2.no
```

Do a simple test that to check that it is working

```bash
xeyes
```

type `ctrl+C` to kill xeyes


You can read more on [SAGA manual: X11 forwarding](https://documentation.sigma2.no/getting_started/create_ssh_keys.html#x11-forwarding)


## Concrete example
Below is an example where we are using IGV, which is required for the tutorial:[Visualizing assembly and associated reads pileup with IGV](../tutorials/Visualisation_assembly_reads_pileup_IGV.md)

Ask for ressources on Saga to use IGV (do not forget `--x11`)

```bash
# Example of how to use the ressources interactively
srun --account=nn9305k --mem-per-cpu=16G --cpus-per-task=1 --qos=devel \
--time=0:30:00 --x11 --pty bash -i
conda activate igv
# launching IGV
igv
```

A window with IVG should open.

You will see several error messages that it tries to load a genome file, ignore those (it is because there are some server data that try to be loaded through internet, but we have no internet access from the compute nodes. Anyway we do not need those data to work with.)
Binary file added docs/source/tutorials/IGV.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
67 changes: 67 additions & 0 deletions docs/source/tutorials/Visualisation_assembly_reads_pileup_IGV.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Visualizing assembly and associated reads pileup with IGV

The [IGV: Integrative Genomics Viewer](http://software.broadinstitute.org/software/igv/home) software allows to visually explore and scroll through sequences, that may or not be split in contigs, and may or not be associated with reads pileup. It can eg. allow you to detect specific positions were variants are erroneous (eg. at the end of contigs where reads are poorly mapped) and visualize eg. locations where coverage is not homogeneous (ie. repeated regions that are not proprely assembled).

NB: Convention: means `<change_me>`

## 1. Access to IGV

- You can access [IGV online](https://igv.org/app/). If you use this solution **be aware that data might be accessible or become publicly aware as they are uploaded to the website**.
- You can install IGV on your own PC, this requires that your files are on the same machine (Please see <software.vetinst.no> for installation).
- You can use saga with a graphical user interface. Please see the techinical: [Using a graphical window in SAGA](../technical/Saga_graphical_interface.md). **For Mac and Linux users only**. Be aware that you can have a Virtual Machine (VM) installed on a Windows PC (eg. on Virtual Box), or can request a Bioinformatic computer with Linux installed.

## 2. Example: using IGV with reads mapped to its own assembly

We have assembled reads and annotated using Bifrost pipeline. The data for the practical are found in `/cluster/projects/nn9305k/tutorial/20210412_mapping_visualization/visualisation_files`

Files description:
- the assembly (polished with Pilon, **output provided by Prokka annoation** ) is: `MiSeq_Ecoli_MG1655_50x.fna` (fasta format)
- the genome annotation is the file (.gff) provided by `Prokka` : `MiSeq_Ecoli_MG1655_50x.gff`
- the final `.bam` file: `final_all_merged.bam` that contained the mapping of the PE and Unpaired reads to the assembly (used as reference)
- A `.bed` file (or gap track): `MiSeq_Ecoli_MG1655.bed` that allows easily identification of scaffolds limits (gaps).

NB: you can see you to generate those files see the tutorial: [Mapping reads against a reference/assembly](./Visualisation_assembly_reads_pileup_IGV.md) and in the [Creating gap tracks](./creating_file_visualize_gaps.md)

If you work on your pc you want to transfer those files to a local folder. If in need, please look at the tutorial regarding [file transfer](./transfering_files.md)

Now you can load those files in [IGV](https://software.broadinstitute.org/software/igv/)

![IGV](./IGV.png)


_**PRACTICAL EXERCISE:**_

1. Create a `genome file` this allows associating tracks to the assembly : `Genomes > create.genome file`. Use the menu to select your assembly file `.fasta`and the annotation-gene file: `.gff`.
A track corresponds to a row of information that is displayed in IGV (one track per type of information. Information provided to IGV by loading one file: eg. the mapping file, the genome/assembly file, the file representing gaps between scaffolds).

NB: You need to fill the `unique identifier` AND `descriptive name` fileds.

2. Locate and Load your `mapped reads` and the `gap file` using: `file > load from file`
- if you use SAGA here: you need to navigate throught the file structure: cluster>projects>nn9305k>active> and your own folder

3. To be able to easily re-open (without re-importing everything you can do): `file > save session`
- be carefull where you save in the navigator when using SAGA

Now you are ready to navigate and explore your assembly.

**Try to find a gap.**

NB: To zoom while staying centered on the gap: click above menu (position within the scaffold - at the gap - top track) then click with the mouse at the gap position on the gap track (until appropriate zoom is obtained).

You can look here for [Options and interpretation](http://software.broadinstitute.org/software/igv/PopupMenus#AlignmentTrack),
and here: [PE orientations](http://software.broadinstitute.org/software/igv/interpreting_pair_orientations).

**Have a look at:**
- coverage
- gaps positions
- some strange scaffolds?
- PE orientations: in detail how the reads map to your assembly (you will need to zoom a lot)
- are some PE reads miss-oriented? reported as having abnormal insert sizes?

## Troubleshooting

To be able to link the positions in the assembly to the annotations, the scaffold names between assembly and annotations files have to be consistent when provided as input file into IGV.

This is why we used the assembly that prokka outputs in the present example (this is the sequence is the same as .fasta assembly polished by pilon). Indeed, `Prokka` transforms scaffold names from the assembly used as input. If scaffold names in the annotation file and sequence file are not consistent, IGV will not manage to link the information of both files.

If you encounter such case, and do not have access to the assembly provided by prokka, the solution is to use an alias file. The alias file will provide to IGV the correspondance between contigs of your sequence/assembly and the annotation file. This is [described here](https://software.broadinstitute.org/software/igv/LoadData/#aliasfile)
28 changes: 28 additions & 0 deletions docs/source/tutorials/creating_file_visualize_gaps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Creating a file to visualize gaps in genomes

When you attempt to visualize genome assemblies with software such as IGV, beeing able to easily locate starting/and ending scaffold locations is very usefull. This, because those positions are potentially problematic regions of an assembly. A gap file (refered to as `.bed` file) provides the locations where scaffolds are interupted. Those file also also to rapidely navigate through the different scaffolds when provided to visualization softwares.


We use a little python script from [sequencetools repository](https://github.com/lexnederbragt/sequencetools) to create `bed`files. NB: If you use this script in an article you need to proprely aknowledge Lex Nederbragt.

For ease we copied this script and placed it in the SAGA folder: `/cluster/projects/nn9305k/tutorial/20210412_mapping_visualization/scripts`. The script is now called `scaffoldgap2bed_py3.py`.


To generate a `.bed` file with this script, Biopython needs to be accessible.

- On Saga: we can use biopython that is installed in the bifrost environment

<div style="background-color:silver">

_**PRACTICAL EXERCISE:**_

```bash
path_script="path"
conda activate bifrost
python scaffoldgap2bed_py3.py -i <assembly.fna> > <gap_file>.bed
```
</div>

</br>

You can now use this file eg. in the tutorial [Visualizing assembly and associated reads pileup with IGV](./Visualisation_assembly_reads_pileup_IGV.md)
184 changes: 184 additions & 0 deletions docs/source/tutorials/mapping_reads_reference_assembly.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Mapping reads against a reference/assembly

## 1. What is read-mapping?

Read mapping is "placing" your reads at a locus with a matching sequence in a longer sequence.
This longer sequence can either be the isolate assembly, or an assembly of a related isolate (eg. a reference isolate for the group of isolate under study).

You could think of read mapping as a simulation of local hybridization of your reads to the genome, at a loci where it fits well enough to hybridize. Usually a few mismatches are allowed (think about the consequences), even if your reads have been "hard-clipped". Hard-clipping means that the ends of the reads are not part of the mapping and that those parts have been removed (or trimmed: eg. adapter removal with Trimmomatic). On the other hand, "Soft-clipping" allows to perform read mapping while ignoring (downweighting) the mismatches present at the endings of reads. This allows to map reads in the presence of adapters.


A portion of the reads may map to several loci in your genome (ie. in repeated regions).

Reads can be mapped as paired or single. If paired is used, then the matching regions are defined by the insert size and the length of each read


![https://commons.wikimedia.org/wiki/File:Mapping_Reads.png](https://upload.wikimedia.org/wikipedia/commons/2/2e/Mapping_Reads.png)


## 2. Reads mapping purposes

Read mapping techniques are used for several purposes.
Some examples:

1) To evaluate the quality of an assembly (or to compare different methods used to assemble your reads). Read mapping can help identifying problematic areas in your assembly process. Indeed, statistics are necessary but might not be sufficient to evaluate the quality of your assembly.

To assess the quality of an assembly we can eg. to look at:
- the coverage regularity (ex: some repeated regions might have increased coverage)
- the coverage at the beginning and end of scaffolds - is it good enough?
- are there positions where pileup of reads show polymorphism?
- ...

2) Assembly polishing software such as `pilon` and `reapr` use mapped reads to identify potential positions where the assemblies should be improved. It is good to visualize what information they are using.

Assembly polishing softwares can eg. use variation in coverage, wrong read pairs orientation, discrepancy between expected insert size and actual insert size obtained from read mapping (ie. longer than expected) to improve assembly.

3) To identify SNPs: some methods use reads mapping against a reference genome to identify and type variants/SNPs (e.g. [snippy](https://github.com/tseemann/snippy))

## 3. Practical: mapping reads to the assembly obtained from the same reads.

We have assembled reads using `Bifrost pipeline`. The data for today's practical are found on SAGA in `/cluster/projects/nn9305k/tutorial/20210412_mapping_visualization/mapping_practical`

Description of the files in this folder:
- the raw_reads: `MiSeq_Ecoli_MG1655_50x_R{1,2}.fastq.gz`
- the trimmed-reads: `MiSeq_Ecoli_MG1655_50x_{R1,R2,S}_concat_stripped_trimmed.fq.gz`
- the assembly (polished with Pilon, output during prokka annoation): `MiSeq_Ecoli_MG1655_50x.fna` (fasta format)

Mapping can be done either with raw reads or trimmed reads. If you are interested to see if there are positions with SNPs, it is easier to use trimmed-reads (avoiding the visualization noise due to the presence of adapters)

To map reads, we use the [bwa mem] option from [bwa tools] software.

NB: Convention: means `<change_me>`


_**PRACTICAL EXERCISE:**_

In your `home` directory make a directory for today's work
and a folder called `mapping` where you will **copy** the input files

```bash
cd <home>
mkdir mapping
cd mapping

rsync -rauPW /cluster/projects/nn9305k/tutorial/20210412_mapping_visualization/mapping_practical/* .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to a page with rsync

Copy link
Contributor Author

@evezeyl evezeyl Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page needs to be made. Removed this part - made a link (commented above) so has to be written

```


You can look at the file content e.g. (one reads-file, the assembly file):
```bash
head MiSeq_Ecoli_MG1655_50x.fna
gunzip -cd MiSeq_Ecoli_MG1655_50x_S_concat_stripped_trimmed.fq.gz | head
```

The mapping tool called `bwa tools` which is installed in the **Bifrost pipeline** conda environment.

Here are options for performing the mapping:
1. You can use the raw_reads for mapping
2. You can use the trimmed reads for mapping
3. You can perform 2 mapping rounds with raw_reads and trimmed reads (2 times). This might be interesting if you want to look at the effect of trimming or not on the mapping visually.

For the tutorial, you will chose one solution 1 or 2. (Solution 3 is if you are curious and want to repeat the exercise)

You will ask for interactive ressources in saga and activate the `bifrost` environment. (We use the queue for testing purposes: devel)

```bash
srun --account=nn9305k --mem-per-cpu=4G --cpus-per-task=2 --qos=devel --time=0:30:00 --pty bash -i
conda activate bifrost
```

To map the reads to an assembly/reference we need to:
1) Index the reference: we are in this case using the assembly as reference

```bash
bwa index <reference.fna>
```

2) We map the reads (attribute the position of the reads according to the index) and sort the mapped reads by index position

NB: Here `-` means that the output of the pipe `|` is used as input in samtools. And `\` is used to indicate that your code (instructions) continues on the next line

- for paired reads (PE) or paired trimmed reads:

```bash
bwa mem -t 4 <reference.fna> <in1:read1.fq.gz> <in2:read2.fq.gz> \
| samtools sort -o <out:PE_mapped_sorted.bam> -
```

- If you used trimmed reads (follow below) you also have unpaired reads (called S here) that we can map as such:

```bash
bwa mem -t 4 <reference.fna> <in:S_reads.fq.gz> \
| samtools sort -o <out:S_mapped_sorted.bam> -
```

- You can merge merged the paired trimmed reads and S reads as such:

```bash
samtools merge <out:all_merged.bam> <in1:S_mapped_sorted.bam> <in2:PE_mapped_sorted.bam>
```

- Now we need to be sure merged reads are still sorted by index: we resort

```bash
samtools sort -o <out:final_all_merged.bam> <in:all_merged.bam>
```

**OPTIONAL**

NB: Some software like `Pilon` need an updated index after merging bam files). You do that by running:

```bash
samtools index <final_all_merged.bam>
```

## 4. Understanding the mapping file format: the [sam/bam file format]

`.bam` files are in a compressed binary format. We need to transform the `.bam` (to a `.sam` file) to be able to see how mapped-reads are represented in the file.

Also have a look at [Samtools article] and at [Samtools manual]


_**PRACTICAL EXERCISE:**_

To decompress: chose f.eks. `PE_mapped_sorted.bam` that we did in the first step:

```bash
samtools view -h -o <out.sam> <in.bam>
```
Look at your `.sam` file using:

```bash
less <filename.sam>
```

## 5. Looking at how the reads maps against the reference/assembly with [Samtools] `tview` module

- looking at the reads pileup with SAMtools
- use `-p <position>` if you want to see a specific position
- type `?` to view the navigation help while samtools is running
- type `q` to quit


_**PRACTICAL EXERCISE:**_

```bash
samtools tview <final_all_merged.bam> --reference <assembly>
```

## Going further
Suggested next lesson: [Visualizing assembly and associated reads pileup with IGV](./Visualisation_assembly_reads_pileup_IGV.md)

You can also look here at the [Uio course] for more details or if you want to do things slightly differently. We reuse some of their [scripts](https://inf-biox121.readthedocs.io/en/2017/Assembly/practicals/Sources.html).

[Uio course]:https://inf-biox121.readthedocs.io/en/2017/Assembly/practicals/03_Mapping_reads_to_an_assembly.html

[bwa mem]:http://bio-bwa.sourceforge.net/bwa.shtml

[bwa tools]:http://bio-bwa.sourceforge.net/

[Samtools article]:https://academic.oup.com/bioinformatics/article/25/16/2078/204688

[Samtools manual]:http://www.htslib.org/doc/samtools.html

[Samtools](http://www.htslib.org/doc/samtools.html)
15 changes: 15 additions & 0 deletions docs/source/tutorials/transfering_files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Transfering files between server and PC or to another server ?

## scp


## rsync


```bash
# transfering from SAGA
rsync -rauPW <user_name>:<source_folder> <destination_folder>
```


## filezilla