-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit f629659
Showing
5 changed files
with
286 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
._* | ||
.Rproj.user | ||
.RData | ||
.Ruserdata | ||
.DS_Store | ||
.Rhistory | ||
.nfs* | ||
diamond |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Version: 1.0 | ||
|
||
RestoreWorkspace: Default | ||
SaveWorkspace: Default | ||
AlwaysSaveHistory: Default | ||
|
||
EnableCodeIndexing: Yes | ||
UseSpacesForTab: Yes | ||
NumSpacesForTab: 2 | ||
Encoding: UTF-8 | ||
|
||
RnwWeave: Sweave | ||
LaTeX: pdfLaTeX |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
title: "DiaMet" | ||
output: | ||
md_document: | ||
variant: gfm | ||
--- | ||
|
||
<!-- README.md is generated from README.Rmd. Please edit that file --> | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
message = FALSE, | ||
warning = FALSE, | ||
comment = "#>" | ||
) | ||
``` | ||
|
||
# DiaMet | ||
This repository aims to analyze those metagenomic Illumina reads that were unable to | ||
be classified by [VirMet](https://github.com/medvir/VirMet) (`undetermined_reads.fastq.gz`) | ||
by aligning them on protein level using [DIAMOND](https://github.com/bbuchfink/diamond)/BLASTx. | ||
|
||
`diamet.py` analyzes all reads from `undetermined_reads.fastq.gz` | ||
|
||
- as single reads, and | ||
|
||
- as contigs created by *de novo* assembly using [megahit](https://github.com/voutcn/megahit). | ||
|
||
### How to run | ||
|
||
1. Enter timavo. | ||
|
||
`ssh timavo` | ||
|
||
2. Move into the directory of the sample whose undetermined reads you want to analyze. | ||
|
||
`cd /analysis/VirMet/<run>/<sample>/` | ||
|
||
3. Run the python script. | ||
|
||
`python <path to script>/diamet.py` | ||
|
||
### Input | ||
|
||
To run `diamet.py`, you need: | ||
|
||
- the `diamond` unix executable file which can be found [here](https://github.com/bbuchfink/diamond); | ||
|
||
- [megahit](https://github.com/voutcn/megahit) installed on the server; | ||
|
||
- a protein database (defined in the code; we are using *swissprot*); | ||
|
||
- `undetermined_reads.fastq.gz`, which should be in the current working directory. | ||
|
||
### Output | ||
|
||
`diamet.py` will output the following files: | ||
|
||
- `undetermined_reads_diamet.pdf` which plots taxonomic classification distribution of all hits; | ||
|
||
- `undetermined_reads_diamet.tsv` which lists all hits and their Query Seq - id (qseqid), | ||
Query sequence length (qlen), Alignment length (length), Unique Subject Scientific Name (sscinames), | ||
and Unique Subject Super Kingdom (sskingdoms); | ||
|
||
- `undetermined_reads_diamet_viral.csv` which lists only the viral hits and their counts; | ||
|
||
- `undetermined_contigs_diamet.tsv` which lists all hits of the contigs and their | ||
Query Seq - id (qseqid), Query sequence length (qlen),Alignment length (length), | ||
Unique Subject Scientific Name (sscinames), and Unique Subject Super Kingdom (sskingdoms). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
<!-- README.md is generated from README.Rmd. Please edit that file --> | ||
|
||
# DiaMet | ||
|
||
This repository aims to analyze those metagenomic Illumina reads that | ||
were unable to be classified by | ||
[VirMet](https://github.com/medvir/VirMet) | ||
(`undetermined_reads.fastq.gz`) by aligning them on protein level using | ||
[DIAMOND](https://github.com/bbuchfink/diamond)/BLASTx. | ||
|
||
`diamet.py` analyzes all reads from `undetermined_reads.fastq.gz` | ||
|
||
- as single reads, and | ||
|
||
- as contigs created by *de novo* assembly using | ||
[megahit](https://github.com/voutcn/megahit). | ||
|
||
### How to run | ||
|
||
1. Enter timavo. | ||
|
||
`ssh timavo` | ||
|
||
2. Move into the directory of the sample whose undetermined reads you | ||
want to analyze. | ||
|
||
`cd /analysis/VirMet/<run>/<sample>/` | ||
|
||
3. Run the python script. | ||
|
||
`python <path to script>/diamet.py` | ||
|
||
### Input | ||
|
||
To run `diamet.py`, you need: | ||
|
||
- the `diamond` unix executable file which can be found | ||
[here](https://github.com/bbuchfink/diamond); | ||
|
||
- [megahit](https://github.com/voutcn/megahit) installed on the server; | ||
|
||
- a protein database (defined in the code; we are using *swissprot*); | ||
|
||
- `undetermined_reads.fastq.gz`, which should be in the current working | ||
directory. | ||
|
||
### Output | ||
|
||
`diamet.py` will output the following files: | ||
|
||
- `undetermined_reads_diamet.pdf` which plots taxonomic classification | ||
distribution of all hits; | ||
|
||
- `undetermined_reads_diamet.tsv` which lists all hits and their Query | ||
Seq - id (qseqid), Query sequence length (qlen), Alignment length | ||
(length), Unique Subject Scientific Name (sscinames), and Unique | ||
Subject Super Kingdom (sskingdoms); | ||
|
||
- `undetermined_reads_diamet_viral.csv` which lists only the viral hits | ||
and their counts; | ||
|
||
- `undetermined_contigs_diamet.tsv` which lists all hits of the contigs | ||
and their Query Seq - id (qseqid), Query sequence length | ||
(qlen),Alignment length (length), Unique Subject Scientific Name | ||
(sscinames), and Unique Subject Super Kingdom (sskingdoms). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
import os | ||
import subprocess | ||
import pandas as pd | ||
import matplotlib.pyplot as plt | ||
from matplotlib.backends.backend_pdf import PdfPages | ||
from matplotlib.patches import Patch | ||
|
||
def run_megahit_and_diamond(): | ||
fastq_file = "undetermined_reads.fastq.gz" | ||
if not os.path.isfile(fastq_file): | ||
print(f"Error: {fastq_file} not found in the working directory.") | ||
exit(1) | ||
|
||
# Run Megahit | ||
megahit_command = f"megahit -r {fastq_file} -o contigs" | ||
subprocess.run(megahit_command, shell=True, check=True) | ||
|
||
# Run Diamond blastx for contigs | ||
output_file = "undetermined_contigs_diamet.tsv" | ||
diamond_command = "/rv_home/iapich/DiaMet/diamond blastx -d /data/diamond/swissprot.dmnd " \ | ||
f"-q contigs/final.contigs.fa -o {output_file} -f 6 qseqid qlen length sscinames sskingdoms" | ||
subprocess.run(diamond_command, shell=True, check=True) | ||
|
||
if os.path.isfile(output_file): | ||
df = pd.read_csv(output_file, sep='\t', header=None, | ||
names=['qseqid', 'qlen', 'length', 'sscinames', 'sskingdoms']) | ||
|
||
df_filtered = df.drop_duplicates(subset='qseqid', keep='first') | ||
df_filtered.to_csv(output_file, sep='\t', header=None, index=False) | ||
|
||
print(f"Duplicates removed from {output_file}.") | ||
|
||
contigs_directory = "contigs" | ||
if os.path.isdir(contigs_directory): | ||
subprocess.run(f"rm -r {contigs_directory}", shell=True) | ||
print(f"Removed {contigs_directory}.") | ||
else: | ||
print(f"Error: {contigs_directory} not found.") | ||
|
||
else: | ||
print(f"Error: {output_file} not found.") | ||
exit(1) | ||
|
||
print("Megahit and Diamond blastx execution completed.") | ||
|
||
def run_diamond_blastx(): | ||
command = "/rv_home/iapich/DiaMet/diamond blastx -d /data/diamond/swissprot.dmnd -q undetermined_reads.fastq.gz -o undetermined_reads_diamet.tsv -f 6 qseqid qlen length sscinames sskingdoms --ultra-sensitive" | ||
os.system(command) | ||
|
||
def remove_duplicate_rows(file_path): | ||
df = pd.read_csv(file_path, sep='\t', header=None) | ||
df = df.drop_duplicates(subset=0, keep='first') | ||
df.to_csv(file_path, sep='\t', index=False, header=False) | ||
|
||
def custom_colors(entries): | ||
color_dict = { | ||
'Eukaryota': (20 / 255, 54 / 255, 66 / 255), | ||
'Bacteria': (16 / 255, 138 / 255, 140 / 255), | ||
'Viruses': (168 / 255, 31 / 255, 27 / 255), | ||
'Archaea': (236 / 255, 153 / 255, 41 / 255) | ||
} | ||
return [color_dict[entry] for entry in entries] | ||
|
||
def plot_column_5(file_path, total_sequences, pdf_pages): | ||
df = pd.read_csv(file_path, sep='\t', header=None) | ||
counts = df[4].value_counts() | ||
|
||
desired_order = ['Eukaryota', 'Bacteria', 'Archaea', 'Viruses'] | ||
|
||
unique_entries = sorted(counts.index.tolist(), key=lambda x: desired_order.index(x)) | ||
|
||
colors = custom_colors(unique_entries) | ||
|
||
plt.figure(figsize=(8, 5)) | ||
|
||
plt.gca().set_axisbelow(True) | ||
plt.grid(which='major', axis='y', linestyle='-', linewidth=1.2, alpha=0.5) | ||
plt.grid(which='minor', axis='y', linestyle='-', linewidth=0.5, alpha=0.2) | ||
|
||
bars = plt.bar(range(len(unique_entries)), counts[unique_entries], color=colors) | ||
|
||
plt.ylabel('Count', fontsize=12) | ||
plt.yscale('log') | ||
plt.title('Taxonomic classification of undetermined reads', fontsize=14, fontweight='bold', loc='left', y=1.05) | ||
|
||
percentage_classified = len(df[0].unique()) / total_sequences * 100 | ||
subtitle = f'{percentage_classified:.2f}% of undetermined reads could be classified on protein level' | ||
plt.suptitle(subtitle, y=0.92, x=0.125, fontsize=10, ha='left') | ||
|
||
legend_labels = [entry for entry in unique_entries] | ||
legend_colors = custom_colors(unique_entries) | ||
legend_patches = [Patch(color=color, label=label) for color, label in zip(legend_colors, legend_labels)] | ||
|
||
legend = plt.legend(handles=legend_patches, loc='center left', bbox_to_anchor=(1, 0.5), fontsize='small') | ||
|
||
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False) | ||
plt.tick_params(axis='y', which='both', left=False, right=False, labelleft=True) | ||
|
||
plt.gca().spines['top'].set_visible(False) | ||
plt.gca().spines['right'].set_visible(False) | ||
plt.gca().spines['bottom'].set_visible(True) | ||
plt.gca().spines['left'].set_visible(False) | ||
|
||
plt.ylim(1, counts.max() * 1.2) | ||
|
||
pdf_pages.savefig(bbox_inches='tight') | ||
|
||
df[df[4] == 'Viruses'][[3]].value_counts().reset_index().to_csv('undetermined_reads_diamet_viral.csv', index=False, header=['virus_species', 'count']) | ||
|
||
if __name__ == "__main__": | ||
pdf_path = 'undetermined_reads_diamet.pdf' | ||
|
||
with PdfPages(pdf_path) as pdf_pages: | ||
run_megahit_and_diamond() | ||
|
||
run_diamond_blastx() | ||
|
||
result_file = "undetermined_reads_diamet.tsv" | ||
remove_duplicate_rows(result_file) | ||
|
||
os.system("seqkit fq2fa undetermined_reads.fastq.gz > undetermined_reads.fasta") | ||
|
||
with open('undetermined_reads.fasta') as fasta_file: | ||
total_sequences = sum(1 for line in fasta_file if line.startswith('>')) | ||
|
||
os.remove('undetermined_reads.fasta') | ||
|
||
plot_column_5(result_file, total_sequences, pdf_pages) | ||
|
||
print(f"Script executed successfully.") |