Skip to content

Commit

Permalink
Updated Annotation tools using branch of this repo
Browse files Browse the repository at this point in the history
  • Loading branch information
Youpu-Chen committed Jan 13, 2023
0 parents commit e2896f8
Show file tree
Hide file tree
Showing 27 changed files with 985 additions and 0 deletions.
20 changes: 20 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
The MIT License (MIT)

Copyright (c) 2022 Youpu-Chen

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
48 changes: 48 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# plastidUtilis

`plastidUtilis` is a collection utilities of module which could be applied in manual annotation of plastid genome.



## How to get it?

```shell
# pip install biopython
pip install plastidUtilis
```



## How to use it?

Note: this toolkit is only compatible with TBtools, so be careful with other format in this version.

```shell
# main utilities
python -m plastidUtilis.Sort -f <Geseq_out_seq> -o <name_of_output> --header <Input_header>
python -m plastidUtilis.AbnormalDetect -a <input_fasta> -o <output_filename>
python -m plastidUtilis.BLAST -d <tRNA_database> -t <sort_tRNA> -o <output.fasta>
python -m plastidUtilis.Table2GBK -t <self-made table file> -o <.gb>
python -m plastidUtilis.Filter -f <input_fasta> -i <minumum_length> -I <maximum_length> -o <output filename>
python -m plastidUtilis.Longest -f <input.fasta> -d <delimiter> -o <output.fasta>

# side utilities
python -m plastidUtilis.Translate -i <sorted_CDS> -o <output_protein_sequence>
python -m plastidUtilis.Stats -i <sorted_CDS> # further it will be designed as tabular output
python -m plastidUtilis.SequenceAppend -f <input_assembly> -i <input_abnormal_bed> -n <the_number_of_extending_bp> -o <output>
```



# Pipeline



# License

MIT License
Copyright (c) 2022 Zihao Huang, Youpu-Chen.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Binary file added dist/plastidUtilis-0.0.1-py3-none-any.whl
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.1.tar.gz
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.2-py3-none-any.whl
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.2.tar.gz
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.3-py3-none-any.whl
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.3.tar.gz
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.4-py3-none-any.whl
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.4.tar.gz
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.5-py3-none-any.whl
Binary file not shown.
Binary file added dist/plastidUtilis-0.0.5.tar.gz
Binary file not shown.
35 changes: 35 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
name = "plastidUtilis"
version = "0.0.5"
authors = [
{ name="Youpu Chen", email="[email protected]" }
]
description = "chloroplast/mitochondrion annotation toolkits"
readme = "README.md"
requires-python = ">=3.8"
dependencies = [
"biopython"
]
classifiers = [ # Optional
# How mature is this project? Common values are
# 3 - 3
# 4 - Beta
# 5 - Production/Stable
'Development Status :: 3 - Alpha',

# Indicate who your project is intended for
'Intended Audience :: Information Technology',
'Topic :: Scientific/Engineering :: Bio-Informatics',

# Pick your license as you wish
'License :: OSI Approved :: MIT License',

# Specify the Python versions you support here. In particular, ensure
# that you indicate whether you support Python 2, Python 3 or both.
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.8',
]
65 changes: 65 additions & 0 deletions src/plastidUtilis.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
Metadata-Version: 2.1
Name: plastidUtilis
Version: 0.0.5
Summary: chloroplast/mitochondrion annotation toolkits
Author-email: Youpu Chen <[email protected]>
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE

# plastidUtilis

`plastidUtilis` is a collection utilities of module which could be applied in manual annotation of plastid genome.



## How to get it?

```shell
# pip install biopython
pip install plastidUtilis
```



## How to use it?

Note: this toolkit is only compatible with TBtools, so be careful with other format in this version.

```shell
# main utilities
python -m plastidUtilis.Sort -f <Geseq_out_seq> -o <name_of_output> --header <Input_header>
python -m plastidUtilis.AbnormalDetect -a <input_fasta> -o <output_filename>
python -m plastidUtilis.BLAST -d <tRNA_database> -t <sort_tRNA> -o <output.fasta>
python -m plastidUtilis.Table2GBK -t <self-made table file> -o <.gb>
python -m plastidUtilis.Filter -f <input_fasta> -i <minumum_length> -I <maximum_length> -o <output filename>
python -m plastidUtilis.Longest -f <input.fasta> -d <delimiter> -o <output.fasta>

# side utilities
python -m plastidUtilis.Translate -i <sorted_CDS> -o <output_protein_sequence>
python -m plastidUtilis.Stats -i <sorted_CDS> # further it will be designed as tabular output
python -m plastidUtilis.SequenceAppend -f <input_assembly> -i <input_abnormal_bed> -n <the_number_of_extending_bp> -o <output>
```



# Pipeline





# License

MIT License
Copyright (c) 2022 Zihao Huang, Youpu-Chen.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
17 changes: 17 additions & 0 deletions src/plastidUtilis.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
LICENSE
README.md
pyproject.toml
src/plastidUtilis/AbnormalDetect.py
src/plastidUtilis/BLAST.py
src/plastidUtilis/Filter.py
src/plastidUtilis/Longest.py
src/plastidUtilis/SequenceAppend.py
src/plastidUtilis/Sort.py
src/plastidUtilis/Stats.py
src/plastidUtilis/Table2GBK.py
src/plastidUtilis/Translate.py
src/plastidUtilis.egg-info/PKG-INFO
src/plastidUtilis.egg-info/SOURCES.txt
src/plastidUtilis.egg-info/dependency_links.txt
src/plastidUtilis.egg-info/requires.txt
src/plastidUtilis.egg-info/top_level.txt
1 change: 1 addition & 0 deletions src/plastidUtilis.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

1 change: 1 addition & 0 deletions src/plastidUtilis.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
biopython
1 change: 1 addition & 0 deletions src/plastidUtilis.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
plastidUtilis
58 changes: 58 additions & 0 deletions src/plastidUtilis/AbnormalDetect.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
'''this module is used to detect the abnormal CDS sequence annotated by Geseq
Note: input sequence should be in protein / amino acid format.
'''
def AbnormalDetect(input, output):
aa_dict = {}
# aa_file = options.aa
aa_file = input
# out_file = options.output
out_file = output
with open(aa_file) as f:
data = f.readlines()
for line in data:
if line.startswith(">"):
aa_header = line.strip()
aa_dict[aa_header] = []
else:
aa_dict[aa_header].append(line.strip())

for k, v in aa_dict.items():
aa_dict[k] = "".join(v)

normal_dict = {}
abnormal_dict = {}

for k, v in aa_dict.items():
# normal
if v.endswith("*") and int(len(v.split("*"))) == 2:
normal_dict[k.split()[0][1:]] = v
else:
# abnormal
abnormal_dict[k.split()[0][1:]] = v
with open(out_file, "w") as out:
for k, v in abnormal_dict.items():
if k in normal_dict.keys():
# copy abnormal
out.write(k+ "\t" + "copy_abnormal" + "\n")
elif k not in normal_dict.keys():
out.write(k + "\n")
out.close()


def main():
import optparse

usage = """python -m plastidUtilis.AbnormalDetect -a <input_fasta> -o <output_filename>
--Joe"""
parser = optparse.OptionParser(usage)
parser.add_option("-a", dest="aa", help="input protein fasta",
metavar="FILE", action="store", type="string")
parser.add_option("-o", dest="output", help="output filename",
metavar="OUT", action="store", type="string")
(options, args) = parser.parse_args()

AbnormalDetect(options.aa, options.output)


if __name__ == "__main__":
main()
91 changes: 91 additions & 0 deletions src/plastidUtilis/BLAST.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
'''this module is able to execute local-installed BLAST to conduct sequence alignemnt analysis with preset parameters'''
import optparse
import os


def RunBLAST(INPUT, BLASTDATABASE, OUTPUT):
trn_dict = {}
# with open(tRNA_file) as f:
with open(INPUT) as f:
data = f.readlines()
trn_dict["trash"] = []
for line in data:
if line.startswith(">"):
seqname = line.strip().replace("\t", "@")[1:]
if seqname not in trn_dict.keys():
trn_dict[seqname] = []
elif seqname in trn_dict.keys():
seqname = "trash"
else:
trn_dict[seqname].append(line.strip())

del trn_dict['trash']
with open("temp_tRNA.fasta", "w") as t:
for k, v in trn_dict.items():
t.write(">" + k + "\n")
t.write("".join(v) + "\n")
t.close()

print(len(trn_dict.keys()))
os.system("makeblastdb -in %s -dbtype nucl -out %s" % (BLASTDATABASE, BLASTDATABASE))
os.system(
"blastn -query temp_tRNA.fasta -db %s -out blast.txt -perc_identity 90 -max_target_seqs 1 -subject_besthit -num_threads 6 -outfmt '6 qseqid qlen sseqid slen pident length score' " % database_file)
os.system("rm temp_tRNA.fasta")

blast_dict = {}
with open("blast.txt") as b:
data = b.readlines()
for line in data:
qseqid = line.split()[0]
qlen = line.split()[1]
sseqid = line.split()[2]
slen = line.split()[3]
pident = line.split()[4]
length = line.split()[5]
score = line.split()[6].strip()
# length constrain
if float(int(qlen) / int(slen)) > 0.9 and float(int(qlen) / int(slen)) < 1.1:
# alignment length constrain
if float(int(length) / int(slen)) > 0.9 and float(pident) > 0.90:
blast_dict[qseqid] = sseqid.split("-")[1] + "-" + sseqid.split("-")[2][:3]

with open(OUTPUT, "w") as out:
for k, v in blast_dict.items():
old_id = k.split("@")[0]
chr = k.split("@")[1]
start = k.split("@")[2]
end = k.split("@")[3]
chain = k.split("@")[4]
length = k.split("@")[5]
new_id = v
# intron
if int(end) - int(start) + 1 == int(length):
info = [new_id, chr, start, end, chain, length]
else:
info = [new_id, chr, start, end, chain, length, "intron"]
out.write(">" + "\t".join(info) + "\n")
out.write("".join(trn_dict[k]) + "\n")
out.close()


def main():
usage = """python -m plastidUtilis -d <tRNA_database> -t <sort_tRNA> -o <output.fasta>
--Joe"""
parser = optparse.OptionParser(usage)
parser.add_option("-d", dest="tRNA_database", help="tRNA blast database",
metavar="FILE", action="store", type="string")
parser.add_option("-t", dest="sort_tRNA", help="sort_tRNA.fasta",
metavar="OUT", action="store", type="string")
parser.add_option("-o", dest="output", help="output file name",
metavar="OUT", action="store", type="string")
(options, args) = parser.parse_args()

database_file = options.tRNA_database
tRNA_file = options.sort_tRNA
out_file = options.output

RunBLAST(tRNA_file, database_file, out_file)


if __name__ == "__main__":
main()
44 changes: 44 additions & 0 deletions src/plastidUtilis/Filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
'''
Description: filter the unvaluable tRNA sequences of given .fa file
Note: set "-i 64" in general case.
'''
import optparse

usage = """python -m plastidUtilis.Filter -f <input.fasta> -i <minumum_length> -I <maximum_length> -o <output filename>
--Joe"""

parser = optparse.OptionParser(usage)
parser.add_option("-f", dest="input_file", help="fasta",
metavar="FILE", action="store", type="string")
parser.add_option("-i", dest="min", help="minimum length",
metavar="STRING", action="store", type="string")
parser.add_option("-I", dest="max", help="maximum length",
metavar="STRING", action="store", type="string")
parser.add_option("-o", dest="output", help="output file",
metavar="FILE", action="store", type="string")
(options, args) = parser.parse_args()

fa_dict = {}
fa_dict["dump"] = []
with open(options.input_file) as f:
for line in f.readlines():
if line.startswith(">"):
header = line.strip()
if header not in fa_dict.keys():
fa_dict[header] = []
elif header in fa_dict.keys():
header = "dump"
else:
fa_dict[header].append(line.strip())
del fa_dict["dump"]

with open(options.output, "w") as out:
for k, v in fa_dict.items():
fa_dict[k] = "".join(v)
if len(fa_dict[k]) > int(options.min) and len(fa_dict[k]) < int(options.max):
out.write(k + "\n")
out.write(fa_dict[k] + "\n")
out.close()



Loading

0 comments on commit e2896f8

Please sign in to comment.