Hello,
Following #6, I checked the outputs of MashMap using blastn and got some quesions with the alignment. I think I shoud open a new issue to understand the ouput of MashMap.
I wanted to remove posssible contaminants within the PacBio data, which was from several insects (whole organisms were used to extract DNA, so there maybe some bacteria DNA). I downloaed all the archaea, bacteria, fungi, protozoa, and viral sequences and merged them together as a contamination library. I also included mitochondrion sequences of the insect into the library.
Then I used MashMap with different parameters and got several outputs:
# run1, with default parameters
$path2mashmap -r $contaminants -q third_all.fasta -o mashmap.out
# run2, with -s 2500 --pi 80
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 2500 --pi 80 -o mashmap2.out
# run3, with -s 500 --pi 85
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 500 --pi 85 -o mashmap3.out
And the outputs from three runs varied:
# there are 6633142 sequences of input
$ grep -c '>' third_all.fasta
6633142
# run1
cut -f 1 -d ' ' mashmap.out |sort|uniq|wc -l
463569
# run2
cut -f 1 -d ' ' mashmap2.out |sort|uniq|wc -l
2821004
# run3
cut -f 1 -d ' ' mashmap3.out |sort|uniq|wc -l
6189307
As can be seen, nearly all the sequences were aligned to contaminant library. That really shocked me.
Then I checked the top 10 sequences with highest identity and top 10 ones with loweset identity from the first run using blastn.
$ sort -r -t ' ' -k10 mashmap.out|head
m161111_191517_42256_c101055312550000001823247601061715_s1_p0/135507/26123_34791 8668 0 8667 + AV_I75 15629 300 8661 94.3979
m161111_191517_42256_c101055312550000001823247601061715_s1_p0/118488/0_10466 10466 0 4999 + AV_I338 16706 10618 15617 93.7729
m161111_062618_42256_c101055312550000001823247601061713_s1_p0/84256/10406_16318 5912 0 5911 + AV_I336 15276 533 6284 93.5688
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/108916/11216_16430 5214 0 5213 - kraken:taxid|163164|NC_002978.6 1267782 837214 842416 93.4375
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/29550/13832_24025 10193 0 10192 - AV_I75 15629 5633 15425 93.3745
m161123_002016_42256_c101049952550000001823247601061782_s1_p0/107348/0_13397 13397 0 4999 + kraken:taxid|1633785|NZ_CP011148.1 1267840 408697 413696 93.326
m161118_115530_42256_c101055932550000001823247601061713_s1_p0/96135/27165_36074 8909 0 8908 - kraken:taxid|615|NZ_CP021984.1 5241555 4731262 4740479 93.2709
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/37699/0_7490 7490 0 7489 + AV_I337 13969 4110 11232 93.2274
m161118_115530_42256_c101055932550000001823247601061713_s1_p0/148198/23006_28097 5091 0 5090 - AV_I336 15276 2279 7323 93.1338
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/86966/0_10197 10197 5197 10196 - kraken:taxid|1633785|NZ_CP011148.1 1267840 642883 647882 93.0843
$ sort -t ' ' -k10 mashmap.out|head > mashmap.lowest
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100026/37064_49118 12054 7054 12053 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100308/48537_58156 9619 4619 9618 - kraken:taxid|880070|NC_015914.1 6221273 4110216 4115215 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/103161/48125_53357 5232 232 5231 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104812/38481_50150 11669 0 4999 - kraken:taxid|877455|NC_015216.1 2583753 1814585 1819584 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104851/0_12783 12783 7783 12782 - kraken:taxid|76857|NZ_CP022123.1 2521394 1534616 1539615 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104909/21773_32004 10231 0 4999 - kraken:taxid|134821|NZ_CP021988.1 722452 74550 79549 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105012/0_20923 20923 0 4999 + kraken:taxid|552509|NC_033778.1 111453 67843 72842 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105790/29860_45799 15939 0 4999 - kraken:taxid|1360|NZ_CP025500.1 2346663 699615 704614 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105963/29235_38484 9249 4249 9248 - kraken:taxid|29430|NZ_CP018260.1 3267348 1015045 1020044 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/106112/0_20765 20765 0 4999 + kraken:taxid|2017483|NZ_CP022315.1 4071214 1765868 1770867 80.8247
The highest ones were fine. There were some differences between hits reported by blastn and MashMap, but maybe it's because they used different databases. But the loweset ones were problematic. Most of them were 'No significant similarity found' when default parameters of blastn were used. And when I unselected 'Low complexity regions', the alignments were unreliable. There maybe something with 'low complexity regions' or 'repeat' somthething.
So my question are:
- If blastn cannot find any significant alignment, why MashMap could?
- In my case, what percentage identity threshold do you think is proper for removing contaminants?
Thank you! Sorry if I missed something.
Bests,
Yiwei Niu
Hello,
Following #6, I checked the outputs of MashMap using blastn and got some quesions with the alignment. I think I shoud open a new issue to understand the ouput of MashMap.
I wanted to remove posssible contaminants within the PacBio data, which was from several insects (whole organisms were used to extract DNA, so there maybe some bacteria DNA). I downloaed all the archaea, bacteria, fungi, protozoa, and viral sequences and merged them together as a contamination library. I also included mitochondrion sequences of the insect into the library.
Then I used MashMap with different parameters and got several outputs:
And the outputs from three runs varied:
As can be seen, nearly all the sequences were aligned to contaminant library. That really shocked me.
Then I checked the top 10 sequences with highest identity and top 10 ones with loweset identity from the first run using blastn.
The highest ones were fine. There were some differences between hits reported by blastn and MashMap, but maybe it's because they used different databases. But the loweset ones were problematic. Most of them were 'No significant similarity found' when default parameters of blastn were used. And when I unselected 'Low complexity regions', the alignments were unreliable. There maybe something with 'low complexity regions' or 'repeat' somthething.
So my question are:
Thank you! Sorry if I missed something.
Bests,
Yiwei Niu