Skip to content

Commit 6a0b0b2

Browse files
authored
Merge pull request #97 from EBI-Metagenomics/dev
Dev
2 parents 853487f + 663c311 commit 6a0b0b2

37 files changed

+1148
-348
lines changed

README.md

Lines changed: 33 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -7,36 +7,36 @@ Gurbich TA, Almeida A, Beracochea M, Burdett T, Burgin J, Cochrane G, Raj S, Ric
77
Detailed information about existing MGnify catalogues: https://docs.mgnify.org/src/docs/genome-viewer.html
88

99
### Tools used in the pipeline
10-
| Tool/Database | Version | Purpose |
11-
|----------------------------------|------------------|----------- |
12-
| CheckM | 1.1.3 | Determining genome quality |
13-
| dRep | 3.2.2 | Genome clustering |
14-
| Mash | 2.3 | Sketch for the catalogue; placement of genomes into clusters (update only); strain tree |
15-
| GUNC | 1.0.3 | Quality control |
16-
| GUNC DB | 2.0.4 | Database for GUNC |
17-
| GTDB-Tk | 2.3.0 | Assigning taxonomy; generating alignments |
18-
| GTDB | r214 | Database for GTDB-Tk |
19-
| Prokka | 1.14.6 | Protein annotation |
20-
| IQ-TREE 2 | 2.2.0.3 | Generating a phylogenetic tree |
21-
| Kraken 2 | 2.1.2 | Generating a kraken database |
22-
| Bracken | 2.6.2 | Generating a bracken database |
23-
| MMseqs2 | 13.45111 | Generating a protein catalogue |
24-
| eggNOG-mapper | 2.1.11 | Protein annotation (eggNOG, KEGG, COG, CAZy) |
25-
| eggNOG DB | 5.0 | Database for eggNOG-mapper |
26-
| Diamond | 2.0.11 | Protein annotation (eggNOG) |
27-
| InterProScan | 5.62-94.0 | Protein annotation (InterPro, Pfam) |
28-
| CRISPRCasFinder | 4.3.2 | Annotation of CRISPR arrays |
29-
| AMRFinderPlus | 3.11.4 | Antimicrobial resistance gene annotation; virulence factors, biocide, heat, acid, and metal resistance gene annotation |
30-
| AMRFinderPlus DB | 3.11 2023-02-23.1 | Database for AMRFinderPlus |
31-
| SanntiS | 0.9.3.2 | Biosynthetic gene cluster annotation |
32-
| Infernal | 1.1.4 | RNA predictions |
33-
| tRNAscan-SE | 2.0.9 | tRNA predictions |
34-
| Rfam | 14.9 | Identification of SSU/LSU rRNA and other ncRNAs |
35-
| Panaroo | 1.3.2 | Pan-genome computation |
36-
| Seqtk | 1.3 | Generating a gene catalogue |
37-
| VIRify | 2.0.0 | Viral sequence annotation |
38-
| [Mobilome annotation pipeline](https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline) | 2.0.0-rc.1 | Mobilome annotation |
39-
| samtools | 1.15 | FASTA indexing |
10+
| Tool/Database | Version | Purpose |
11+
|---------------------------------------------------------|-------------------|----------- |
12+
| CheckM2 | 1.0.1 | Determining genome quality |
13+
| dRep | 3.2.2 | Genome clustering |
14+
| Mash | 2.3 | Sketch for the catalogue; placement of genomes into clusters (update only); strain tree |
15+
| GUNC | 1.0.3 | Quality control |
16+
| GUNC DB | 2.0.4 | Database for GUNC |
17+
| GTDB-Tk | 2.3.0 | Assigning taxonomy; generating alignments |
18+
| GTDB | r214 | Database for GTDB-Tk |
19+
| Prokka | 1.14.6 | Protein annotation |
20+
| IQ-TREE 2 | 2.2.0.3 | Generating a phylogenetic tree |
21+
| Kraken 2 | 2.1.2 | Generating a kraken database |
22+
| Bracken | 2.6.2 | Generating a bracken database |
23+
| MMseqs2 | 13.45111 | Generating a protein catalogue |
24+
| eggNOG-mapper | 2.1.11 | Protein annotation (eggNOG, KEGG, COG, CAZy) |
25+
| eggNOG DB | 5.0.2 | Database for eggNOG-mapper |
26+
| Diamond | 2.0.11 | Protein annotation (eggNOG) |
27+
| InterProScan | 5.62-94.0 | Protein annotation (InterPro, Pfam) |
28+
| CRISPRCasFinder | 4.3.2 | Annotation of CRISPR arrays |
29+
| AMRFinderPlus | 3.11.4 | Antimicrobial resistance gene annotation; virulence factors, biocide, heat, acid, and metal resistance gene annotation |
30+
| AMRFinderPlus DB | 3.11 2023-02-23.1 | Database for AMRFinderPlus |
31+
| SanntiS | 0.9.3.2 | Biosynthetic gene cluster annotation |
32+
| Infernal | 1.1.4 | RNA predictions |
33+
| tRNAscan-SE | 2.0.9 | tRNA predictions |
34+
| Rfam | 14.9 | Identification of SSU/LSU rRNA and other ncRNAs |
35+
| Panaroo | 1.3.2 | Pan-genome computation |
36+
| Seqtk | 1.3 | Generating a gene catalogue |
37+
| VIRify | 2.0.1 | Viral sequence annotation |
38+
| [Mobilome annotation pipeline](https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline) | 2.0.1 | Mobilome annotation |
39+
| samtools | 1.15 | FASTA indexing |
4040

4141
## Setup
4242

@@ -52,12 +52,13 @@ Requirements:
5252
The pipeline needs the following reference databases and configuration files (roughtly ~150G):
5353

5454
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/gunc_db_2.0.4.dmnd.gz
55-
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/eggnog_db.tgz
55+
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/eggnog_db_5.0.2.tgz
5656
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/rfam_14.9/
5757
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/kegg_classes.tsv
5858
- ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/continent_countries.csv
5959
- https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz
6060
- ftp://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/3.11/2023-02-23.1
61+
- https://zenodo.org/records/4626519/files/uniref100.KO.v1.dmnd.gz
6162

6263
### Containers
6364

@@ -100,6 +101,7 @@ nextflow run EBI-Metagenomics/genomes-pipeline -c <custom.config> -profile <prof
100101
--ena_genomes_checkm=<path to genomes quality data> \
101102
--mgyg_start=0 \
102103
--mgyg_end=10 \
104+
--preassigned_accessions=<path to file with preassigned accessions if using>
103105
--catalogue_name=zebrafish-faecal \
104106
--catalogue_version="1.0" \
105107
--ftp_name="zebrafish-faecal" \

bin/add_extensions_to_checkm.py

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
#!/usr/bin/env python3
2+
3+
# This file is part of MGnify genome analysis pipeline.
4+
#
5+
# MGnify genome analysis pipeline is free software: you can redistribute it and/or modify
6+
# it under the terms of the GNU General Public License as published by
7+
# the Free Software Foundation, either version 3 of the License, or
8+
# (at your option) any later version.
9+
10+
# MGnify genome analysis pipeline is distributed in the hope that it will be useful,
11+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
12+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13+
# GNU General Public License for more details.
14+
15+
# You should have received a copy of the GNU General Public License
16+
# along with MGnify genome analysis pipeline. If not, see <https://www.gnu.org/licenses/>.
17+
18+
19+
import argparse
20+
import os
21+
22+
23+
def main(fasta_file_directory, checkm_directory):
24+
fasta_dict = load_file_list(fasta_file_directory)
25+
checkm_path = os.path.join(checkm_directory, "quality_report.tsv")
26+
assert os.path.isfile(checkm_path), "CheckM2 input doesn't exist"
27+
contents = ""
28+
with open(checkm_path, "r") as file_in:
29+
for line in file_in:
30+
if "Completeness" in line:
31+
contents += line
32+
else:
33+
genome_name = line.split("\t")[0]
34+
genome_with_ext = fasta_dict[genome_name]
35+
line = line.replace(genome_name, genome_with_ext)
36+
contents += line
37+
with open(checkm_path, "w") as file_out:
38+
file_out.write(contents)
39+
40+
41+
def load_file_list(fasta_file_directory):
42+
fasta_dict = dict()
43+
file_list = os.listdir(fasta_file_directory)
44+
for file in file_list:
45+
name = file.rsplit(".", 1)[0]
46+
fasta_dict[name] = file
47+
return fasta_dict
48+
49+
50+
def parse_args():
51+
parser = argparse.ArgumentParser(
52+
description=(
53+
"The script processes CheckM2 output to put the genome file extensions back in."
54+
)
55+
)
56+
parser.add_argument(
57+
"-d",
58+
dest="fasta_file_directory",
59+
required=True,
60+
help="Input directory containing FASTA files",
61+
)
62+
parser.add_argument(
63+
"-i",
64+
dest="checkm_directory",
65+
help=(
66+
"Folder containing output of checkm2"
67+
),
68+
)
69+
return parser.parse_args()
70+
71+
72+
if __name__ == "__main__":
73+
args = parse_args()
74+
main(
75+
args.fasta_file_directory,
76+
args.checkm_directory,
77+
)

bin/annotate_gff.py

Lines changed: 34 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,21 @@ def get_iprs(ipr_annot):
2525
for line in f:
2626
cols = line.strip().split("\t")
2727
protein = cols[0]
28+
try:
29+
evalue = float(cols[8])
30+
except ValueError:
31+
continue
32+
if evalue > 1e-10:
33+
continue
2834
if protein not in iprs:
2935
iprs[protein] = [set(), set()]
3036
if cols[3] == "Pfam":
3137
pfam = cols[4]
3238
iprs[protein][0].add(pfam)
3339
if len(cols) > 12:
3440
ipr = cols[11]
35-
iprs[protein][1].add(ipr)
41+
if not ipr == "-":
42+
iprs[protein][1].add(ipr)
3643
return iprs
3744

3845

@@ -45,26 +52,37 @@ def get_eggnog(eggnog_annot):
4552
if line.startswith("#"):
4653
eggnog_fields = get_eggnog_fields(line)
4754
else:
55+
try:
56+
evalue = float(cols[2])
57+
except ValueError:
58+
continue
59+
if evalue > 1e-10:
60+
continue
4861
protein = cols[0]
4962
eggnog = [cols[1]]
5063
try:
5164
cog = cols[eggnog_fields["cog_func"]]
52-
cog = cog.split()
65+
cog = list(cog)
5366
if len(cog) > 1:
5467
cog = ["R"]
5568
except Exception:
5669
cog = ["NA"]
5770
kegg = cols[eggnog_fields["KEGG_ko"]].split(",")
58-
eggnogs[protein] = [eggnog, cog, kegg]
71+
go = cols[eggnog_fields["GOs"]]
72+
eggnogs[protein] = [eggnog, cog, kegg, go]
5973
return eggnogs
6074

6175

6276
def get_eggnog_fields(line):
6377
cols = line.strip().split("\t")
78+
try:
79+
index_of_go = cols.index("GOs")
80+
except ValueError:
81+
sys.exit("Cannot find the GO terms column.")
6482
if cols[8] == "KEGG_ko" and cols[15] == "CAZy":
65-
eggnog_fields = {"KEGG_ko": 8, "cog_func": 20}
83+
eggnog_fields = {"KEGG_ko": 8, "cog_func": 20, "GOs": index_of_go}
6684
elif cols[11] == "KEGG_ko" and cols[18] == "CAZy":
67-
eggnog_fields = {"KEGG_ko": 11, "cog_func": 6}
85+
eggnog_fields = {"KEGG_ko": 11, "cog_func": 6, "GOs": index_of_go}
6886
else:
6987
sys.exit("Cannot parse eggNOG - unexpected field order or naming")
7088
return eggnog_fields
@@ -224,6 +242,8 @@ def add_gff(in_gff, eggnog_file, ipr_file, sanntis_file, amr_file):
224242
added_annot[protein]["COG"] = a
225243
elif pos == 3:
226244
added_annot[protein]["KEGG"] = a
245+
elif pos == 4:
246+
added_annot[protein]["Ontology_term"] = a
227247
except Exception:
228248
pass
229249
try:
@@ -257,7 +277,8 @@ def add_gff(in_gff, eggnog_file, ipr_file, sanntis_file, amr_file):
257277
if a == "AMR":
258278
cols[8] = "{};{}".format(cols[8], value)
259279
else:
260-
cols[8] = "{};{}={}".format(cols[8], a, value)
280+
if value != "-":
281+
cols[8] = "{};{}={}".format(cols[8], a, value)
261282
line = "\t".join(cols)
262283
out_gff.append(line)
263284
return out_gff
@@ -376,15 +397,15 @@ def add_ncrnas_and_crispr_to_gff(gff_outfile, ncrnas, crispr_annotations, res):
376397
help="GFF input file",
377398
)
378399
parser.add_argument(
379-
"-e",
380-
dest="eggnong",
381-
help="eggnog annontations for the clutser repo",
400+
"-i",
401+
dest="ips",
402+
help="InterproScan annotations results for the cluster rep",
382403
required=True,
383404
)
384405
parser.add_argument(
385-
"-i",
386-
dest="ips",
387-
help="InterproScan annontations results for the cluster rep",
406+
"-e",
407+
dest="eggnog",
408+
help="eggnog annotations for the cluster repo",
388409
required=True,
389410
)
390411
parser.add_argument(
@@ -414,7 +435,7 @@ def add_ncrnas_and_crispr_to_gff(gff_outfile, ncrnas, crispr_annotations, res):
414435

415436
extended_gff = add_gff(
416437
in_gff=gff,
417-
eggnog_file=args.eggnong,
438+
eggnog_file=args.eggnog,
418439
ipr_file=args.ips,
419440
sanntis_file=args.sanntis,
420441
amr_file=args.amr,

bin/change_extensions.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/usr/bin/env python3
2+
3+
# This file is part of MGnify genome analysis pipeline.
4+
#
5+
# MGnify genome analysis pipeline is free software: you can redistribute it and/or modify
6+
# it under the terms of the GNU General Public License as published by
7+
# the Free Software Foundation, either version 3 of the License, or
8+
# (at your option) any later version.
9+
10+
# MGnify genome analysis pipeline is distributed in the hope that it will be useful,
11+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
12+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13+
# GNU General Public License for more details.
14+
15+
# You should have received a copy of the GNU General Public License
16+
# along with MGnify genome analysis pipeline. If not, see <https://www.gnu.org/licenses/>.
17+
18+
19+
import os
20+
import argparse
21+
22+
23+
def change_file_extensions(directory_path):
24+
for filename in os.listdir(directory_path):
25+
file_path = os.path.join(directory_path, filename)
26+
27+
# Check if the path is a file (not a directory)
28+
if os.path.isfile(file_path):
29+
# Split the file name and extension
30+
file_name, file_extension = os.path.splitext(filename)
31+
32+
# Check if the current extension is not "fa" and rename
33+
if file_extension != '.fa':
34+
new_file_name = file_name + '.fa'
35+
new_file_path = os.path.join(directory_path, new_file_name)
36+
os.rename(file_path, new_file_path)
37+
38+
39+
def main():
40+
parser = argparse.ArgumentParser(description='The script changes extensions of genomes in the '
41+
'NCBI folder to .fa.')
42+
parser.add_argument('-i', dest='input_folder', required=True, help='Input folder name where genomes are located.')
43+
44+
args = parser.parse_args()
45+
input_folder = args.input_folder
46+
47+
assert os.path.isdir(input_folder), f"Error: The input folder '{input_folder}' does not exist."
48+
49+
change_file_extensions(input_folder)
50+
51+
52+
if __name__ == '__main__':
53+
main()

bin/checkm2csv.py

Lines changed: 30 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@
1616
# along with MGnify genome analysis pipeline. If not, see <https://www.gnu.org/licenses/>.
1717

1818

19-
import sys
2019
import argparse
2120

2221
if __name__ == "__main__":
@@ -27,24 +26,39 @@
2726
"-i",
2827
"--input",
2928
dest="input",
30-
help="checkm_results.tab: checkm output log",
29+
help="checkm_results.tab (for CheckM) or quality_report.tsv (for CheckM2)",
3130
required=True,
3231
)
32+
parser.add_argument(
33+
"--checkm2",
34+
action='store_true',
35+
help="Use flag if input is produced by CheckM2; default: False",
36+
default=False,
37+
)
3338

3439
args = parser.parse_args()
35-
36-
print("genome,completeness,contamination,strain_heterogeneity")
40+
41+
if args.checkm2:
42+
print("genome,completeness,contamination")
43+
else:
44+
print("genome,completeness,contamination,strain_heterogeneity")
3745

3846
with open(args.input, "r") as f:
39-
next(f)
40-
for line in f:
41-
if "INFO:" in line:
42-
continue
43-
if "Completeness" in line and "Contamination" in line:
44-
continue
45-
cols = line.strip("\n").split("\t")
46-
genome = cols[0]
47-
complet = cols[-3]
48-
cont = cols[-2]
49-
strain = cols[-1]
50-
print("%s.fa,%s,%s,%s" % (genome, complet, cont, strain))
47+
if args.checkm2:
48+
next(f)
49+
for line in f:
50+
genome, complet, cont = line.split("\t")[:3]
51+
print("{},{},{}".format(genome, complet, cont))
52+
else:
53+
next(f)
54+
for line in f:
55+
if "INFO:" in line:
56+
continue
57+
if "Completeness" in line and "Contamination" in line:
58+
continue
59+
cols = line.strip("\n").split("\t")
60+
genome = cols[0]
61+
complet = cols[-3]
62+
cont = cols[-2]
63+
strain = cols[-1]
64+
print("{},{},{},{}".format(genome, complet, cont, strain))

0 commit comments

Comments
 (0)