-
Notifications
You must be signed in to change notification settings - Fork 39
Tutorial
Here we have a specimen called DM01
, with its fastq files DM01_1.fastq.gz
and DM01_2.fastq.gz
(which are called raw data):
$ mkdir -p /share/pool/guanliang/test/DM01
$ cd /share/pool/guanliang/test/DM01
$ ls -hrtl
-rw-rw-r-- 1 gmeng 5.6G Jun 29 00:50 DM01_2.fastq.gz
-rw-rw-r-- 1 gmeng 4.9G Jun 29 00:50 DM01_1.fastq.gz
-rw-rw-r-- 1 gmeng 416 Jun 29 01:03 all.sh
The raw data
-
Can be paired-end (PE) or single-end (SE)
- SE data does not work with the
--assembler spades
option. - If you are using PE data but your read 1 and read 2 files are not properly paired, the assemblers and BWA will complain because you are going to use the wrong pair-end information for scaffolding, etc. To fix this problem, you can: (1) use some tools (see here) to fix your read 1 and read 2 files before using MitoZ. (2) treat your data as SE data by using only the
-fq1
but not the-fq2
option in MitoZ. See this for more information.
- SE data does not work with the
-
Generated typically by Illumina or BGISEQ
-
Typically from whole-genome shotgun (WGS) sequencing of the total DNA of a single species, but I suppose long-range PCR + NGS data is also great.
-
Target-enrichment data or transcriptome data might also work, but the success rate might not be as high as DNA data.
-
Fastq data of pooled multiple species might also work since MitoZ will output any sequences with >= 5 protein-coding genes (PCGs)
-
You do NOT need to trim and filter the raw data by yourself, just provide MitoZ with them directly (you can do it of course, but it may be just wasting your time).
-
In MitoZ ≥3.5, we use
--data_size_for_mt_assembly <float1>,<float2>
format. The float1 means the size (Gbp) of raw data to be subsampled, while the float2 means the size of clean data should be >= float2 Gbp, otherwise MitoZ will STOP running. When only float1 is set, float2 is assumed to be 0. (1) Set float1 to be0
if you want to use ALL raw data; (2) Set0,0
if you want to use ALL raw data and do NOT interrupt MitoZ even if you got very little clean data. -
If you want to skip the filtering step, set
--skip_filter --data_size_for_mt_assembly <float1>,<float2>
at the same time, MitoZ will subsample<float2>
Gbp of the input clean fastq data (<float1>
is useless here).
-
Have a look at the command we are going to use:
$ cat all.sh
source activate /share/pool/guanliang/soft/mitoz3.6
mitoz all \
--outprefix DM01 \
--thread_number 16 \
--clade Chordata \
--genetic_code 2 \
--species_name "Homo sapiens" \
--fq1 DM01_1.fastq.gz \
--fq2 DM01_2.fastq.gz \
--fastq_read_length 151 \
--data_size_for_mt_assembly 3,0 \
--assembler megahit \
--kmers_megahit 59 79 99 \
--memory 50 \
--requiring_taxa Chordata
# WARNING: it is better NOT to use the default kmers of `--kmers_megahit`!
# Instead, try larger kmers FIRST!
# If they do not work,
# then try smaller Kmers, e.g. `--kmers_megahit 43 71 99`
# Small kmers can lead to wrong results, e.g. quite a long mitogenome but low sequencing coverage in some regions.
-
Change the values of
--clade
and--genetic_code
according to your sample. -
You can set some values for
--species_name
, or just do not set this option. This option is only used when generating a GenBank file. -
If your
DM01_1.fastq.gz
andDM01_2.fastq.gz
are not in the current directory (i.e./share/pool/guanliang/test/DM01
), you need to specify the absolute paths or relative paths for them, e.g.--fq1 /path/to/DM01_1.fastq.gz --fq2 /path/to/DM01_2.fastq.gz
. -
Set the read length of your fastq data with
--fastq_read_length
, here my read length is151
. -
Tell MitoZ to use only 3G bp of raw fastq data for mitogenome assembly:
--data_size_for_mt_assembly 3,0
, or--data_size_for_mt_assembly 3
. The default value is2,0
. -
Here we use Megahit for assembly:
--assembler megahit
. -
We set kmers for Megahit:
--kmers_megahit 59 79 99 119 141
, which can avoid possible problems here: https://github.com/linzhi2013/MitoZ/wiki/Known-issues#8-megahit-gets-very-long-sequences and https://github.com/linzhi2013/MitoZ/issues/172#issuecomment-1608575634 . The default value is--kmers_megahit 21 29 39 59 79 99 119 141
, which can cause wrong mitogenomes sometimes, therefore, I would recommend trying larger kmers first, and if they do not work, you can try smaller kmers, say--kmers_megahit 39 59 79 99 119 141
. More kmers will take longer running time. I will update the default Kmer values in the next release. Tips: the kmer values should be smaller than your maximum read length. -
If you are going to use Spades for the assembly, make sure that your fq1 and fq2 files are properly paired. Otherwise, the Spades will not work. See here for more details and solutions.
-
And we ask Megahit to use a maximum of 50 GB RAM:
--memory 50
. However, if your server does not have enough RAM, you might get some errors like https://github.com/linzhi2013/MitoZ/issues/174#issuecomment-1544089254. -
Finally, set a required taxonomic name (e.g. order, class, family, genus) for filtering out non-target-taxon mitochondrial sequences:
--requiring_taxa Chordata
. Here, my specimen belongs toChordata
.
Because I used the SGE cluster to run the job, I submitted the job with:
$ qsub -cwd -l vf=50g -q small.q,medium.q,large.q -pe smp 16 -N DM01 all.sh
The above command will print out a JOBID, with which you can check how much RAM MitoZ is using from time to time (only if you want):
$ qstat -j JOBID
However, if you are going to run MitoZ locally, do:
$ nohup sh all.sh >m.log 2>m.err &
You can then use the top
or htop
(https://anaconda.org/conda-forge/htop) commands to check how much RAM MitoZ is using from time to time.
And wait for the running to finish.
Why do we check how much RAM MitoZ is using? Just to make sure if your servers have enough resources for it, and to know how many specimens you can run at the same time (especially if you run MitoZ locally). See also https://github.com/linzhi2013/MitoZ/wiki/Installation#4-about-the-thread-number-and-data-size.
When the run finished, I got:
$ ls -lhrt
total 11G
-rw-rw-r-- 1 gmeng 5.6G Jun 29 00:50 DM01_2.fastq.gz
-rw-rw-r-- 1 gmeng 4.9G Jun 29 00:50 DM01_1.fastq.gz
-rw-rw-r-- 1 gmeng 416 Jun 29 01:03 all.sh
drwxrwxr-x 2 gmeng 282 Jun 29 02:33 clean_data
drwxrwxr-x 4 gmeng 60 Jun 29 05:49 mt_assembly
drwxrwxr-x 4 gmeng 119 Jun 29 05:54 mt_annotation
-rw-r--r-- 1 gmeng 6.5K Jun 29 06:02 all.sh.o563317
-rw-rw-r-- 1 gmeng 45K Jun 29 06:02 mitoz.log
drwxrwxr-x 4 gmeng 91 Jun 29 06:02 DM01.result
-rw-r--r-- 1 gmeng 3.3M Jun 29 06:02 all.sh.e563317
The final resulting files are in DM01.result
. See the 5. The final resulting files section and sections after it below for more details.
I will explain these files step by step.
The clean_data directory:
$ ls -lhrt clean_data/
total 20G
-rw-rw-r-- 1 gmeng gmeng 137K Jun 29 02:28 DM01.QC.json
-rw-rw-r-- 1 gmeng gmeng 483K Jun 29 02:28 DM01.QC.html
-rw-rw-r-- 1 gmeng gmeng 4.7G Jun 29 02:28 DM01.clean_R1.fq.gz
-rw-rw-r-- 1 gmeng gmeng 5.3G Jun 29 02:28 DM01.clean_R2.fq.gz
-rw-rw-r-- 1 gmeng gmeng 0 Jun 29 02:28 FastpTask.o.log
-rw-rw-r-- 1 gmeng gmeng 0 Jun 29 02:28 FastpTask.e.log
-rw-rw-r-- 1 gmeng gmeng 0 Jun 29 02:28 FastpTask.done
-rw-rw-r-- 1 gmeng gmeng 4.5G Jun 29 02:33 sampling_3.0GB.clean_R1.fq.gz
-rw-rw-r-- 1 gmeng gmeng 5.2G Jun 29 02:37 sampling_3.0GB.clean_R2.fq.gz
In the all.sh
script, we provided MitoZ with the raw data files DM01_1.fastq.gz
and DM01_2.fastq.gz
, and MitoZ used the fastp
program to clean them, the clean data files are DM01.clean_R1.fq.gz
and DM01.clean_R2.fq.gz
.
You can have a look at the filter report files DM01.QC.json
and DM01.QC.html
if you like.
In the all.sh
script, we set --data_size_for_mt_assembly 3
, so MitoZ extracted 3G bp data for us automatically from the clean data files, they are sampling_3.0GB.clean_R1.fq.gz
and sampling_3.0GB.clean_R2.fq.gz
, which were used for subsequent mitogenome assembly.
The mt_assembly directory:
$ ls -lhrt mt_assembly
total 8.0K
drwxrwxr-x 3 gmeng gmeng 4.0K Jun 29 05:50 megahit
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 05:50 DM01.megahit.result
In the all.sh
script, we asked MitoZ to use Megahit to perform assembly (--assembler megahit
), so MitoZ created a subdirectory called megahit
here.
Temporary files for getting the mitochondrial genomes are in the mt_assembly/megahit/
directory:
$ ls -lhrt mt_assembly/megahit/
total 74M
drwxrwxr-x 3 gmeng gmeng 195 Jun 29 05:37 megahit_out
-rw-rw-r-- 1 gmeng gmeng 102K Jun 29 05:40 DM01.megahit.hmmout
-rw-rw-r-- 1 gmeng gmeng 6.6K Jun 29 05:40 DM01.megahit.hmmtblout
-rw-rw-r-- 1 gmeng gmeng 5.5K Jun 29 05:40 DM01.megahit.hmmtblout.besthit
-rw-rw-r-- 1 gmeng gmeng 1.8K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim
-rw-rw-r-- 1 gmeng gmeng 272K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa
-rw-rw-r-- 1 gmeng gmeng 20K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.ndb
-rw-rw-r-- 1 gmeng gmeng 80 Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.nto
-rw-rw-r-- 1 gmeng gmeng 16K Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.ntf
-rw-rw-r-- 1 gmeng gmeng 236 Jun 29 05:40 DM01.megahit.hmmtblout.besthit.sim.fa.not
-rw-rw-r-- 1 gmeng gmeng 15K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds
-rw-rw-r-- 1 gmeng gmeng 6.7K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.pep
-rw-rw-r-- 1 gmeng gmeng 2.4K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds.taxa
-rw-rw-r-- 1 gmeng gmeng 1.5K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds.taxa.sorted.uniq
-rw-rw-r-- 1 gmeng gmeng 14K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.fa.solar.genewise.gff.cds.position.cds.filtered
-rw-rw-r-- 1 gmeng gmeng 23K Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa
-rw-rw-r-- 1 gmeng gmeng 884 Jun 29 05:43 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa
-rw-rw-r-- 1 gmeng gmeng 73M Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.bwa.bam
-rw-rw-r-- 1 gmeng gmeng 490K Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.depth
-rw-rw-r-- 1 gmeng gmeng 23K Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.abun.fa
-rw-rw-r-- 1 gmeng gmeng 97 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance
-rw-rw-r-- 1 gmeng gmeng 787 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.high_abundance_10.0X
-rw-rw-r-- 1 gmeng gmeng 548 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta
-rw-rw-r-- 1 gmeng gmeng 407 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like
-rw-rw-r-- 1 gmeng gmeng 463 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted
-rw-rw-r-- 1 gmeng gmeng 219 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.dotted
-rw-rw-r-- 1 gmeng gmeng 388 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked
-rw-rw-r-- 1 gmeng gmeng 166 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.stat
-rw-rw-r-- 1 gmeng gmeng 17K Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
-rw-rw-r-- 1 gmeng gmeng 5.9K Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa
-rw-rw-r-- 1 gmeng gmeng 75 Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked
-rw-rw-r-- 1 gmeng gmeng 17K Jun 29 05:50 DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng gmeng 191 Jun 29 05:50 DM01.megahit.overlap_information
-rw-rw-r-- 1 gmeng gmeng 17K Jun 29 05:50 DM01.megahit.start2end_for-circular-mt-only
In this directory, somes of the most useful files could be mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim
and mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim.fa
, which are the initial resulting files output by HMMER, and can be used for checking cross-contamination in your big projects and search for other mitogenomes if your dataset contains mitogenomes of multiple species. See https://github.com/linzhi2013/MitoZ/wiki/Some-important-intermediate-files.
Other subsequent sequence files: DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa
, DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.abun.fa
, DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
and DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa
.
Meanwhile, MitoZ collected the import resulting files to the DM01.megahit.result
directory. Have a look at it:
$ ls -lhrt mt_assembly/DM01.megahit.result/
total 64K
-rw-rw-r-- 1 gmeng gmeng 97 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance
-rw-rw-r-- 1 gmeng gmeng 548 Jun 29 05:49 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta
-rw-rw-r-- 1 gmeng gmeng 75 Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked
-rw-rw-r-- 1 gmeng gmeng 5.9K Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa
-rw-rw-r-- 1 gmeng gmeng 17K Jun 29 05:50 DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
-rw-rw-r-- 1 gmeng gmeng 17K Jun 29 05:50 DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng gmeng 191 Jun 29 05:50 DM01.megahit.overlap_information
MitoZ uses the HMMER program to fish out the candidate mitochondrial sequences, and then further filter them by the taxonomic assignment, sequences belonging to non-target-taxon are discarded (--requiring_taxa Chordata
). Next, MitoZ separates the remained sequences by the abundance information, low-coverage sequences are output to DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance
and DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance.fasta
, in case the users want to search for some missing genes from them.
$ cat mt_assembly/DM01.megahit.result/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.low_abundance
k141_626356 COX3 658 784 506 378 506 - abun=1.0000
k141_626356 ND3 1 307 308 1 506 - abun=1.0000
For the high-coverage sequences, MitoZ further picks the sequences based on the completeness of protein-coding genes (PCGs). So we got the files DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
vs. DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked
and DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked.fa
.
$ cat mt_assembly/DM01.megahit.result/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.Not-picked
>k141_402622 abun=10.9365 length=5830 score=3.835
ND5 3319 3805 20 588 + 4
Although this sequence (k141_402622
) has high coverage of 10.9365
, but there better sequences:
$ cat mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked
>k141_250555 abun=1110.0000 length=16594 score=19535.356
ND3 1 309 1 308 - 4
COX3 379 1163 1 784 - 4
ATP6 1163 1846 1 683 - 4
ATP8 1837 2004 1 160 - 4
COX2 2086 2769 1 683 - 4
COX1 2927 4482 1 1542 - 4
ND2 4885 5928 1 1034 - 4
ND1 6139 7113 11 971 - 4
CYTB 10887 12027 1 1138 - 4
ND6 12099 12620 1 509 + 4
ND5 12615 14454 12 1828 - 4
ND4 14674 16054 1 1370 - 4
ND4L 16048 16344 2 297 - 4
The digit 4
of the last column indicates this gene is complete compared to a defined length for the gene (For me, the definition file is /share/pool/guanliang/soft/mitoz3.4/lib/python3.8/site-packages/mitoz/profiles/CDS_HMM/Chordata_CDS_length_list
)
Finally, MitoZ checks if the picked mitogenome (DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
) is complete:
$ grep '>' mt_assembly/DM01.megahit.result/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fasta-like.sorted.picked.fa
>k141_250555 abun=1110.0000 length=16594 score=19535.356
If there are overlapping between the 5'-end and 3'-end of the sequence, MitoZ will remove the overlapping region, generating two final resulting files for the assembly step:
$ ls -lhrt mt_assembly/DM01.megahit.result/DM01.megahit.overlap_information mt_assembly/DM01.megahit.result/DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng 17K Jun 29 05:50 mt_assembly/DM01.megahit.result/DM01.megahit.mitogenome.fa
-rw-rw-r-- 1 gmeng 191 Jun 29 05:50 mt_assembly/DM01.megahit.result/DM01.megahit.overlap_information
The DM01.megahit.mitogenome.fa
file is the mitochondrial genome file, which will be used for subsequent analysis.
$ grep '>' mt_assembly/DM01.megahit.result/DM01.megahit.mitogenome.fa
>k141_250555 topology=circular
The DM01.megahit.overlap_information
shows the overlapping information:
$ cat mt_assembly/DM01.megahit.result/DM01.megahit.overlap_information
>k141_250555 overlap between 5' and 3' are 142bp
AAGCCCGAGGGTTAGAAGTACTAGAATAGCTGATGCTCAAAAGAATGTTGTTACAGGAGAAGGGAGTTGGTCTCCTCAAGGGAGAGGAAGTAATAAGGCAATTTCTAGGTCAAACAAGAGAAATAGGATCGCAACTAGGAA
Because the overlapping region is quite long (142bp
), and the sequence content is not simple repeats, we have high confidence in saying the mitogenome is complete. See also https://github.com/linzhi2013/MitoZ/wiki/Known-issues#1-circularity-problem-and-its-solution and https://github.com/linzhi2013/MitoZ/wiki/Known-issues#2-simple-repeats-found-in-the-overlapping-region.
Besides, you can have a look at the sequencing depth along with the sequences:
$ head mt_assembly/megahit/DM01.megahit.hmmtblout.besthit.sim.filtered-by-taxa.fa.depth
k141_402622 1 38
k141_402622 2 38
k141_402622 3 38
k141_402622 4 39
k141_402622 5 39
k141_402622 6 40
k141_402622 7 40
k141_402622 8 40
k141_402622 9 40
k141_402622 10 40
The header's meaning: seqid position_of_the_sequence coverage
.
If there are some very high or very low coverage, which may indicate there are Ns or repeats and need further check. The completeness will be further verified in the final visualization step, from the SVG/PNG files you can see the coverage distribution clearly (see later section).
The mt_annotation directory:
$ ls -hlrt mt_annotation/
total 8.0K
drwxrwxr-x 3 gmeng gmeng 4.0K Jun 29 05:54 tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 06:02 DM01.DM01.megahit.mitogenome.fa.result
Let's have a look at the mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
first:
$ ls -lhrt mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
total 404K
-rw-rw-r-- 1 gmeng 17K Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
-rw-rw-r-- 1 gmeng 20 Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.not
-rw-rw-r-- 1 gmeng 20K Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.ndb
-rw-rw-r-- 1 gmeng 8 Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.nto
-rw-rw-r-- 1 gmeng 16K Jun 29 05:50 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.ntf
-rw-rw-r-- 1 gmeng 13K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.cds.position.cds
-rw-rw-r-- 1 gmeng 5.4K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.pep
-rw-rw-r-- 1 gmeng 503 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position
-rw-rw-r-- 1 gmeng 1.5K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.cds.position.cds.taxa
-rw-rw-r-- 1 gmeng 29 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
-rw-rw-r-- 1 gmeng 503 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position.sorted
-rw-rw-r-- 1 gmeng 388 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position.sorted.revised
-rw-rw-r-- 1 gmeng 378 Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position.sorted.revised.filtered
-rw-rw-r-- 1 gmeng 1.5K Jun 29 05:51 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.ft
-rw-rw-r-- 1 gmeng 1.4K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.trna
-rw-rw-r-- 1 gmeng 1.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.trna.ft
-rw-rw-r-- 1 gmeng 1.4K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.tbl
-rw-rw-r-- 1 gmeng 11K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.out
-rw-rw-r-- 1 gmeng 95 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.ft
-rw-rw-r-- 1 gmeng 1.5K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.tbl
-rw-rw-r-- 1 gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.out
-rw-rw-r-- 1 gmeng 95 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.ft
-rw-rw-r-- 1 gmeng 77 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.control_region.ft
-rw-rw-r-- 1 gmeng 3.2K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
-rw-rw-r-- 1 gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.fsa
-rw-rw-r-- 1 gmeng 79K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.sqn
-rw-rw-r-- 1 gmeng 36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
-rw-rw-r-- 1 gmeng 494 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.val
-rw-rw-r-- 1 gmeng 79 Jun 29 05:54 errorsummary.val
-rw-rw-r-- 1 gmeng 4.9K Jun 29 05:54 summary.txt
-rw-rw-r-- 1 gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
-rw-rw-r-- 1 gmeng 12K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
-rw-rw-r-- 1 gmeng 2.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
-rw-rw-r-- 1 gmeng 2.7K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
-rw-rw-r-- 1 gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
-rw-rw-r-- 1 gmeng 4.3K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
drwxrwxr-x 2 gmeng 4.0K Jun 29 06:02 mt_visualization
The input mitogenome file is renamed as mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa
.
The initial PCG annotation result file:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.solar.genewise.gff.cds.position.cds`
>gi_NC_004387_ATP8_Homo_sapiens_55_aa_Ref1:55aa [mRNA] locus=k141_250555:1840:2004:-
ATGCCCCAGTTAAATCCCGCCCCCTGATTTGCTATTATAATCTTTTCATG
ACTAATCTTTTTAACAGTTATTCCACCTAAAGTCTTAGCTCACCACTTTC
CTAATGAACCTACCCCCCAAAACGTAAAAAAATCAAAATCAGAAACCTGA
ACCTGACCATGACTG
>gi_NC_004387_ND2_Homo_sapiens_350_aa_Ref1:347aa [mRNA] locus=k141_250555:4891:5928:-
ATGAATCCTTACGTCCTCTCAGCCTTACTAATAGGCCTAGGCCTTGGAAC
CACAGTTACATTTGCTAGCTCCCACTGGTTATTAGCCTGGATAGGCCTTG
AAATAAACACACTGGCCATCCTCCCTCTCATAGCACAACACCACCACCCC
CGAGCCATTGAAGCTACAACCAAATACTTCTTAATTCAAGCATCCGCTGC
CGCTACCATCTTGTTTGCCAGCTCAACAAATGCCTGACTGTCCGGACAAT
GAGAAATTCAACAAATTACTCACCCCCTTCCCATTGTTATAATCACCATT
...
The CDS feature table:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.ft
>Feature k141_250555
309 >1 gene
gene ND3
309 >1 CDS
product NADH dehydrogenase subunit 3
transl_table 2
1163 379 gene
gene COX3
1163 379 CDS
product cytochrome c oxidase subunit III
transl_table 2
note TAA stop codon is completed by the addition of 3' Aresidues to the mRNA
...
The tRNA feature table:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.trna.ft
>Feature k141_250555
379 310 gene
gene trnG(ucc)
379 310 tRNA
product tRNA-Gly
2078 2006 gene
gene trnK(uuu)
...
The s-rRNA feature table:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.s-rRNA.ft
>Feature k141_250555
9891 8945 gene
gene s-rRNA
9891 8945 rRNA
product 12S ribosomal RNA
The l-rRNA feature table:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.l-rRNA.ft
>Feature k141_250555
8871 7188 gene
gene l-rRNA
8871 7188 rRNA
product 16S ribosomal RNA
The control region feature table:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.control_region.ft
>Feature k141_250555
9961 10745 misc_feature
note putative control region
The combined feature table:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
>Feature k141_250555
309 >1 gene
gene ND3
309 >1 CDS
product NADH dehydrogenase subunit 3
transl_table 2
which will be copied to mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
.
Some information about the mitogenome:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.val
ERROR: valid [SEQ_FEAT.NoStop] Missing stop codon FEATURE: CDS: cytochrome c oxidase subunit III [lcl|k141_250555:c1163-379] [lcl|k141_250555: raw, dna len= 16453] -> [lcl|k141_250555_2]
ERROR: valid [SEQ_FEAT.NoStop] Missing stop codon FEATURE: CDS: cytochrome b [lcl|k141_250555:c12027-10887] [lcl|k141_250555: raw, dna len= 16453] -> [lcl|k141_250555_9]
WARNING: valid [SEQ_INST.CompleteCircleProblem] Circular topology without complete flag set BIOSEQ: lcl|k141_250555: raw, dna len= 16453
and
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/errorsummary.val
2 ERROR: SEQ_FEAT.NoStop
1 WARNING: SEQ_INST.CompleteCircleProblem
The most important files generated at the annotation step are copied to the directory:
$ ls -lhrt mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/
total 5.7M
-rw-rw-r-- 1 gmeng gmeng 29 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
-rw-rw-r-- 1 gmeng gmeng 3.2K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
-rw-rw-r-- 1 gmeng gmeng 79 Jun 29 05:54 errorsummary.val
-rw-rw-r-- 1 gmeng gmeng 36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
-rw-rw-r-- 1 gmeng gmeng 12K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
-rw-rw-r-- 1 gmeng gmeng 4.3K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
-rw-rw-r-- 1 gmeng gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
-rw-rw-r-- 1 gmeng gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
-rw-rw-r-- 1 gmeng gmeng 2.7K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
-rw-rw-r-- 1 gmeng gmeng 2.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
-rw-rw-r-- 1 gmeng gmeng 4.9K Jun 29 05:54 summary.txt
The GenBank file generated at this step will be used for visualization.
$ head mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
LOCUS k141_250555 16453 bp DNA circular 29-JUN-2022
DEFINITION topology=circular.
ACCESSION
VERSION
KEYWORDS .
In the sub-sub-sub directory called mt_visualization:
$ ls -lhrt mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/
total 27M
-rw-rw-r-- 1 gmeng 36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.sortedByLength
-rw-rw-r-- 1 gmeng 1.2K Jun 29 05:54 visual_config.txt
-rw-rw-r-- 1 gmeng 1.4K Jun 29 05:54 circos.map.sh
-rw-rw-r-- 1 gmeng 34 Jun 29 05:54 circos.strand.text.txt
-rw-rw-r-- 1 gmeng 17K Jun 29 05:54 circos.mito.fa
-rw-rw-r-- 1 gmeng 36 Jun 29 05:54 circos.karyotype.txt
-rw-rw-r-- 1 gmeng 828 Jun 29 05:54 circos.gene.text.txt
-rw-rw-r-- 1 gmeng 5.4K Jun 29 05:54 circos.features.txt
-rw-rw-r-- 1 gmeng 13M Jun 29 06:02 circos.bam
-rw-rw-r-- 1 gmeng 8.3M Jun 29 06:02 circos.sorted.bam
-rw-rw-r-- 1 gmeng 231K Jun 29 06:02 circos.dep
-rw-rw-r-- 1 gmeng 316K Jun 29 06:02 circos.depth.txt
-rw-rw-r-- 1 gmeng 3.7K Jun 29 06:02 circos.conf
-rw-rw-r-- 1 gmeng 342K Jun 29 06:02 circos.png
-rw-rw-r-- 1 gmeng 4.9M Jun 29 06:02 circos.svg
A fasta mitogenome file (called circos.mito.fa
) was extracted from mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
from last step.
The subsampled 3G bp clean data were then mapped against this mitogenome: circos.map.sh
.
Finally, MitoZ creates a circos.conf
file the Circos program to run with:
$ cat mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.conf
<<include etc/colors_fonts_patterns.conf>>
#-----------------image------------------
<image>
###<<include etc/image.conf>>
dir = /share/pool/guanliang/test/DM01/mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization
file = circos.png
png = yes
svg = yes
...
The most important result files in this step are:
$ ls -lhrt mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/{circos.depth.txt,circos.png,circos.svg}
-rw-rw-r-- 1 gmeng gmeng 316K Jun 29 06:02 mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.depth.txt
-rw-rw-r-- 1 gmeng gmeng 342K Jun 29 06:02 mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png
-rw-rw-r-- 1 gmeng gmeng 4.9M Jun 29 06:02 mt_annotation/tmp_DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg
They have been copied to the mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/
directory. Therefore, now we have annotation and visualization result files together:
$ ls -rlht mt_annotation/DM01.DM01.megahit.mitogenome.fa.result/
total 5.7M
-rw-rw-r-- 1 gmeng 29 Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
-rw-rw-r-- 1 gmeng 3.2K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
-rw-rw-r-- 1 gmeng 79 Jun 29 05:54 errorsummary.val
-rw-rw-r-- 1 gmeng 36K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
-rw-rw-r-- 1 gmeng 12K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
-rw-rw-r-- 1 gmeng 4.3K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
-rw-rw-r-- 1 gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
-rw-rw-r-- 1 gmeng 17K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
-rw-rw-r-- 1 gmeng 2.7K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
-rw-rw-r-- 1 gmeng 2.6K Jun 29 05:54 DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
-rw-rw-r-- 1 gmeng 4.9K Jun 29 05:54 summary.txt
-rw-rw-r-- 1 gmeng 342K Jun 29 06:02 circos.png
-rw-rw-r-- 1 gmeng 4.9M Jun 29 06:02 circos.svg
-rw-rw-r-- 1 gmeng 316K Jun 29 06:02 circos.depth.txt
Users normally look at these files only.
$ ls -lhrt DM01.result/
total 8.0K
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 06:02 DM01.megahit.result
drwxrwxr-x 2 gmeng gmeng 4.0K Jun 29 06:02 DM01.DM01.megahit.mitogenome.fa.result
The DM01.result/DM01.megahit.result
directory is copied directly from the 2. The mitogenome assembly step. At least you should check the DM01.result/DM01.megahit.result/DM01.megahit.overlap_information
file and see if the overlapping is good enough (Please refer to 2. The mitogenome assembly step)
The DM01.result/DM01.DM01.megahit.mitogenome.fa.result
directory is copied directly from the 3. The annotation step and 4. The visualization step.
The summary file:
$ cat summary.txt
#Seq_id Length(bp) Circularity Closely_related_species
k141_250555 16453 yes Homo sapiens
#Seq_id Start End Length(bp) Direction Type Gene_name Gene_prodcut Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_250555 <1 309 309 - CDS ND3 NADH dehydrogenase subunit 3 1
k141_250555 310 379 70 - tRNA trnG(ucc) tRNA-Gly 1
k141_250555 379 1163 785 - CDS COX3 cytochrome c oxidase subunit III 1
k141_250555 1163 1846 684 - CDS ATP6 ATP synthase F0 subunit 6 1
k141_250555 1837 2004 168 - CDS ATP8 ATP synthase F0 subunit 8 1
k141_250555 2006 2078 73 - tRNA trnK(uuu) tRNA-Lys 1
k141_250555 2065 2769 705 - CDS COX2 cytochrome c oxidase subunit II 1
k141_250555 2774 2843 70 - tRNA trnD(guc) tRNA-Asp 1
k141_250555 2853 2924 72 + tRNA trnS(uga) tRNA-Ser 2
k141_250555 2926 4482 1557 - CDS COX1 cytochrome c oxidase subunit I 1
k141_250555 4484 4552 69 + tRNA trnY(gua) tRNA-Tyr 1
k141_250555 4553 4619 67 + tRNA trnC(gca) tRNA-Cys 1
k141_250555 4655 4727 73 + tRNA trnN(guu) tRNA-Asn 1
k141_250555 4729 4797 69 + tRNA trnA(ugc) tRNA-Ala 1
k141_250555 4799 4870 72 - tRNA trnW(uca) tRNA-Trp 1
k141_250555 4879 5928 1050 - CDS ND2 NADH dehydrogenase subunit 2 1
k141_250555 5929 5998 70 - tRNA trnM(cau) tRNA-Met 1
k141_250555 5998 6068 71 + tRNA trnQ(uug) tRNA-Gln 1
k141_250555 6068 6135 68 - tRNA trnI(gau) tRNA-Ile 1
k141_250555 6139 7113 975 - CDS ND1 NADH dehydrogenase subunit 1 1
k141_250555 7114 7187 74 - tRNA trnL(uaa) tRNA-Leu 2
k141_250555 7188 8871 1684 - rRNA l-rRNA 16S ribosomal RNA 1
k141_250555 8873 8944 72 - tRNA trnV(uac) tRNA-Val 1
k141_250555 8945 9891 947 - rRNA s-rRNA 12S ribosomal RNA 1
k141_250555 9892 9960 69 - tRNA trnF(gaa) tRNA-Phe 1
k141_250555 10746 10815 70 + tRNA trnP(ugg) tRNA-Pro 1
k141_250555 10815 10886 72 - tRNA trnT(ugu) tRNA-Thr 1
k141_250555 10887 12027 1141 - CDS CYTB cytochrome b 1
k141_250555 12031 12098 68 + tRNA trnE(uuc) tRNA-Glu 1
k141_250555 12099 12620 522 + CDS ND6 NADH dehydrogenase subunit 6 1
k141_250555 12616 14454 1839 - CDS ND5 NADH dehydrogenase subunit 5 1
k141_250555 14455 14527 73 - tRNA trnL(uag) tRNA-Leu 2
k141_250555 14534 14603 70 - tRNA trnS(gcu) tRNA-Ser 2
k141_250555 14604 14673 70 - tRNA trnH(gug) tRNA-His 1
k141_250555 14669 16054 1386 - CDS ND4 NADH dehydrogenase subunit 4 1
k141_250555 16048 16344 297 - CDS ND4L NADH dehydrogenase subunit 4L 1
k141_250555 16345 16413 69 - tRNA trnR(ucg) tRNA-Arg 1
Protein coding genes totally found: 13
tRNA genes totally found: 22
rRNA genes totally found: 2
--------------------------------------
Genes totally found: 37
The Genbank file:
$ head DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf
LOCUS k141_250555 16453 bp DNA circular 29-JUN-2022
DEFINITION topology=circular.
ACCESSION
VERSION
KEYWORDS .
SOURCE mitochondrion Homo sapiens
ORGANISM Homo sapiens
Unclassified.
REFERENCE 1 (bases 1 to 16453)
AUTHORS MENG,G.
The mitogenome in fasta format: DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
.
The most closely related species in MitoZ's PCG annotation database:
$ cat DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.most_related_species.txt
k141_250555 Homo sapiens
The combined feature tables (including PCGs, tRNAs, rRNA genes and control region if there is):
$ head DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.tbl
>Feature k141_250555
309 >1 gene
gene ND3
309 >1 CDS
product NADH dehydrogenase subunit 3
transl_table 2
1163 379 gene
gene COX3
1163 379 CDS
product cytochrome c oxidase subunit III
The summary file generated by tbl2asn
:
$ cat errorsummary.val
2 ERROR: SEQ_FEAT.NoStop
1 WARNING: SEQ_INST.CompleteCircleProblem
You should check if there are some warnings or errors in this file.
For example, internal stop codons for PCGs. If there are internal stop codons, check the DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
and DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
files.
Internal stop codons can be caused by:
(1) Ns in the PCGs;
(2) assembly error;
(3) wrong genetic code table for translation (--genetic_code 2
);
(4) nuclear mitochondrial DNA segments (NUMTs). Have a look at this paper: Paul Hebert, Dan Bock, Sean Prosser. Interrogating 1000 Insect Genomes for NUMTs: A Risk Assessment for Species Scans. Authorea. July 27, 2022. https://doi.org/10.22541/au.165893766.64488370/v1.
Check the sequencing depth around the problematic regions: circos.depth.txt
and circos.png
/circos.svg
files.
The sequence coverage distribution file:
$ head circos.depth.txt
mt1 1 1 1084
mt1 2 2 1092
mt1 3 3 1102
mt1 4 4 1113
mt1 5 5 1129
mt1 6 6 1139
The visualized mitogenome is in circos.svg
and circos.png
:
See Meanings of the different tracks on the Circos plot
Warning: You should check the coverage distribution of the mitogenome, extreme high or low (e.g 0X) regions could be unreliable, especially when using the megahit or spades assemblers! Larger kmers may fix the wrong assembly problem. See https://github.com/linzhi2013/MitoZ/issues/172#issuecomment-1600990445 for more details.
The CDS sequences in fasta format for the PCGs:
$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds.fasta
>k141_250555;ND3;len=309;[<0:309](-)
>k141_250555;COX3;len=785;[378:1163](-)
>k141_250555;ATP6;len=684;[1162:1846](-)
>k141_250555;ATP8;len=168;[1836:2004](-)
>k141_250555;COX2;len=705;[2064:2769](-)
>k141_250555;COX1;len=1557;[2925:4482](-)
>k141_250555;ND2;len=1050;[4878:5928](-)
>k141_250555;ND1;len=975;[6138:7113](-)
>k141_250555;CYTB;len=1141;[10886:12027](-)
>k141_250555;ND6;len=522;[12098:12620](+)
>k141_250555;ND5;len=1839;[12615:14454](-)
>k141_250555;ND4;len=1386;[14668:16054](-)
>k141_250555;ND4L;len=297;[16047:16344](-)
and the corresponding protein sequences of CDS in fasta format:
$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.cds_translation.fasta
>k141_250555;ND3;len=103;[<0:309](-)
>k141_250555;COX3;len=261;[378:1163](-)
>k141_250555;ATP6;len=227;[1162:1846](-)
>k141_250555;ATP8;len=55;[1836:2004](-)
>k141_250555;COX2;len=234;[2064:2769](-)
>k141_250555;COX1;len=518;[2925:4482](-)
>k141_250555;ND2;len=349;[4878:5928](-)
>k141_250555;ND1;len=324;[6138:7113](-)
>k141_250555;CYTB;len=380;[10886:12027](-)
>k141_250555;ND6;len=173;[12098:12620](+)
>k141_250555;ND5;len=612;[12615:14454](-)
>k141_250555;ND4;len=461;[14668:16054](-)
>k141_250555;ND4L;len=98;[16047:16344](-)
All the gene (PCGs, tRNA, rRNA) sequences:
$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.gene.fasta
>k141_250555;ND3;len=309;[<0:309](-)
>k141_250555;trnG(ucc);len=70;[309:379](-)
>k141_250555;COX3;len=785;[378:1163](-)
>k141_250555;ATP6;len=684;[1162:1846](-)
>k141_250555;ATP8;len=168;[1836:2004](-)
>k141_250555;trnK(uuu);len=73;[2005:2078](-)
>k141_250555;COX2;len=705;[2064:2769](-)
>k141_250555;trnD(guc);len=70;[2773:2843](-)
>k141_250555;trnS(uga);len=72;[2852:2924](+)
>k141_250555;COX1;len=1557;[2925:4482](-)
>k141_250555;trnY(gua);len=69;[4483:4552](+)
The rRNA sequences:
$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.rrna.fasta
>k141_250555;l-rRNA;len=1684;[7187:8871](-)
>k141_250555;s-rRNA;len=947;[8944:9891](-)
The tRNA sequences:
$ grep '>' DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.trna.fasta
>k141_250555;trnG(ucc);len=70;[309:379](-)
>k141_250555;trnK(uuu);len=73;[2005:2078](-)
>k141_250555;trnD(guc);len=70;[2773:2843](-)
>k141_250555;trnS(uga);len=72;[2852:2924](+)
>k141_250555;trnY(gua);len=69;[4483:4552](+)
>k141_250555;trnC(gca);len=67;[4552:4619](+)
>k141_250555;trnN(guu);len=73;[4654:4727](+)
...
You MUST check if a gene is complete in length, especially when your mitogenome is complete (circular). This is because the assembler just randomly chooses a breakpoint, and this could lead to broken genes (i.e., a gene is broken into two parts, one part is at the 5'-end of the fasta file, the second part is at the 3'-end of the fasta file.). Depending on the relative length of the two parts, MitoZ may fail to annotate either (or both?) of them!
The quickest way is to check the summary.txt
file and see if there are any >
or <
symbols in the Start
and End
columns.
Have a look at the summary.txt
file:
$ cat summary.txt
#Seq_id Length(bp) Circularity Closely_related_species
k141_250555 16453 yes Homo sapiens
#Seq_id Start End Length(bp) Direction Type Gene_name Gene_prodcut Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_250555 <1 309 309 - CDS ND3 NADH dehydrogenase subunit 3 1
k141_250555 310 379 70 - tRNA trnG(ucc) tRNA-Gly 1
k141_250555 379 1163 785 - CDS COX3 cytochrome c oxidase subunit III 1
The ND3 gene is not complete, but the mitogenome is complete ('Circularity yes'), this happens since the assembler just "randomly" determines the breakpoint.
In this case, we need to change the breakpoint and re-annotate the mitogenome. See also https://github.com/linzhi2013/MitoZ/wiki/Known-issues#3-breakpoint-and-incomplete-genes.
Here I will make the breakpoint as the beginning of the ND1 gene because between 6135
and 6139
, there are no genes:
k141_250555 6068 6135 68 - tRNA trnI(gau) tRNA-Ile 1
k141_250555 6139 7113 975 - CDS ND1 NADH dehydrogenase subunit 1 1
$ mdkir -p /share/pool/guanliang/test/DM01/moved_breakpoint
$ cd /share/pool/guanliang/test/DM01/moved_breakpoint
$ mitoz-tools gbfiletool sort -h # >=MitoZ v3.5
$ mitoz-tools gbfiletool sort -i ../DM01.result/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf \
-o DM01.moved-breakpoint.gbf \
-g ND1
# the 'genbank_file_tool.py' script can be found in
# https://github.com/linzhi2013/MitoZ/tree/master/version_2.4-alpha/useful_scripts
# I will add it into the mitoz-tools command later
$ wget https://raw.githubusercontent.com/linzhi2013/MitoZ/master/version_2.4-alpha/useful_scripts/genbank_file_tool.py
$ python3 genbank_file_tool.py sort \
-i ../DM01.result/DM01.DM01.megahit.mitogenome.fa.result/DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf \
-o DM01.moved-breakpoint.gbf \
-g ND1
- Tips: You can also do this by hand with the
vim
command or any text editor by working on theDM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.gbf.fasta
file directly.
$ ls -lhrt
total 36K
-rw-rw-r-- 1 gmeng gmeng 35K Jun 29 13:59 DM01.moved-breakpoint.gbf
And then extract the re-arranged mitogenome to a fasta file:
$ mitoz-tools gbseqextractor -f DM01.moved-breakpoint.gbf -prefix DM01.moved-breakpoint -types wholeseq
$ ll
total 56K
-rw-rw-r-- 1 gmeng 35K Jun 29 13:59 DM01.moved-breakpoint.gbf
-rw-rw-r-- 1 gmeng 17K Jun 29 14:03 DM01.moved-breakpoint.fasta
Because we know the mitogenome is circular, we need to add this information (topology=circular
) to the header of the DM01.moved-breakpoint.fasta
file (you can do this via vim
command or any text editor):
>k141_250555 topology=circular
Finally, we re-annotate the new mitogenome DM01.moved-breakpoint.fasta
.
mitoz annotate \
--thread_number 16 \
--outprefix DM01 \
--fastafiles DM01.moved-breakpoint.fasta \
--fq1 ../clean_data/sampling_5.0GB.clean_R1.fq.gz \
--fq2 ../clean_data/sampling_5.0GB.clean_R2.fq.gz \
--species_name "Homo sapiens" \
--genetic_code 2 \
--clade Chordata
The final result will be in DM01.result
under the moved_breakpoint
directory.
The summary.txt
file does not have any >
or <
symbols in the Start
and End
columns, like this:
#Seq_id Length(bp) Circularity Closely_related_species
k59_737328 16329 yes Human
#Seq_id Start End Length(bp) Direction Type Gene_name Gene_prodcut Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k59_737328 12 216 205 - CDS ATP8 ATP synthase F0 subunit 8 1
k59_737328 216 280 65 - tRNA trnK(uuu)
...
...
k59_737328 14919 15704 786 - CDS COX3 cytochrome c oxidase subunit III 1
k59_737328 15703 16330 628 - CDS ATP6 ATP synthase F0 subunit 6 1
But if you do a multiple-sequence alignment (MSA) (see Grouping the same genes of multiple samples together), you can see that the ATP6 gene of this sample is too short (on the 5'-end or 3'-end of the MSA), for example:
Here, the ATP6 could not be correctly annotated due to the wrong breakpoint of the mitogenome. You should use a new breakpoint and then reannotate the mitogenome with MitoZ.
-
Further, confirm if the resulting mitogenome belongs to your target species (MitoZ may not be able to exclude contamination of closely related species, depending on whether they are both present in MitoZ's database) or if there are assembly errors.
-
e.g. by blasting the COX1 gene or CYTB gene or whole mitogenome to the NCBI NT database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), or the BOLD database (http://www.boldsystems.org/index.php/IDS_OpenIdEngine)
-
If you are going to use the BOLD database (NB: BOLD database only has COX1 barcodes for animals), actually you can use the
mitoz-tools bold_identification
command directly. -
Eye-checking of the alignments of same gene sequences of multiple (closely related) species (e.g. with some reliable ones from NCBI) using MEGA (https://www.megasoftware.net/) on both DNA and protein levels, respectively. You can obtain such an alignment within the MEGA software using MAFFT/ClustW/MUSCLE after loading your gene fasta file to MEGA. Different species should at least have high similarity on the protein level generally.
-
-
If you have multiple samples and you want to combine the same genes of different samples into single files, you can check https://github.com/linzhi2013/MitoZ/wiki/The--mitoz-tools--group_seq_by_gene--command.
-
Use the gene sequences for your analysis.
-
Upload the mitogenome to GenBank (https://github.com/linzhi2013/MitoZ/wiki/Upload-to-GenBank).
Often, we have several samples, and we want to group the same gene sequences of different samples into the same file before going to subsequent analysis (e.g. multiple sequence alignments and phylogenetic tree construction).
To this end, please refer to mitoz-tools group_seq_by_gene
command.
Please check the mitoz-tools
command and here for more information.
About:
Commands:
- The -all- subcommand
- The -filter- subcommand
- The -assemble- subcommand
- The -findmitoscaf- subcommand
- The -annotate- subcommand
- The -visualize- subcommand
Usages:
- Installation
- Tutorial
- Extending MitoZ-s database
- Batch processing of many samples
- Known issues
- FAQ
- Some important intermediate files
- Upload to GenBank
MitoZ-tools:
- Overview: The -mitoz tools- command
- The -mitoz-tools--group_seq_by_gene- command
- The -mitoz tools bold_identification- command
- The -mitoz tools circle_check- command
- The -mitoz tools gbfiletool- command
- The -mitoz tools gbseqextractor- command
- The -mitoz tools msaconverter- command
- The -mitoz tools taxonomy_ranks- command