Skip to content
chrismit edited this page Feb 12, 2014 · 2 revisions

This allows you to specify an enzyme to digest a given fasta file.

If we wanted to digest a protein fasta file with trypsin, it goes as such (note trypsin is the default, but I am specifying it anyways):

python fastadigest.py --enzyme trypsin --file /home/chris/ref/refseq62.faa --out /home/chris/ref/refseq62_trypsin.fasta

Which outputs something like:

>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:1
YNMSNADYEILEATIVK
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:2
FSGAFYFATTVITTIGYGHSTPMTDAGK
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:3
VFCMLYALAGIPLGLIMFQSIGER
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:4
MNTFAAK
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:5
FLTMNTEDER
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:6
DEQEAILAAQGLVR
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:7
VGDPTADDDFGR
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:8
LPLSDNVSLASCSCYQLPDEK
>gi|17536613|ref|NP_494333.1| Protein SUP-9 [Caenorhabditis elegans] Pep:9
HTEPHGGPPTFSGMTTRPK
>gi|17535787|ref|NP_495499.1| Protein SRD-59 [Caenorhabditis elegans] Pep:1
SPATLDGLK
>gi|17535787|ref|NP_495499.1| Protein SRD-59 [Caenorhabditis elegans] Pep:2
IFLYNTSCVQIALITFAFLSQHR
>gi|17535787|ref|NP_495499.1| Protein SRD-59 [Caenorhabditis elegans] Pep:3
...

Where Pep: # refers to the tryptic peptide number obtained from that digested protein

Suppose I wanted to digest a whole genome nucleotide fasta:

python fastadigest.py --enzyme trypsin --file /home/chris/ref/human/hg19.fa --out /home/chris/ref/hg19_trypsin.fasta --type nt --frame 6 --genome

Which would output:

>chr1 F:+1 Start:10147 End:10182
PLTLTLTLTLT
>chr1 F:+1 Start:10231 End:10263
PLTLTLNPKP
>chr1 F:+1 Start:10288 End:10329
PQPQPQPQPQPQP
>chr1 F:+1 Start:10330 End:10359
PLTLTLTLP
>chr1 F:+1 Start:10390 End:10446
PLTPNPNPNPNPNPNPNP
>chr1 F:+1 Start:10474 End:10512
YPQPARPPGSDLR

Where the start and end coordinates correspond to either: the peptide which ends at a stop codon, or the peptide which ends at a tryptic cleavage site.

Suppose I wanted to digest a refseq nucleotide fasta:

python fastadigest.py --enzyme trypsin --file /home/chris/ref/human/refseq62.fa --out /home/chris/ref/refseq62_nt_trypsin.fasta --type nt --frame 3

Which would output:

>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:1
AEAAAGASGR
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:2
HHDSCEDAALPGR
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:3
ALHPARPAAR
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:4
GGRPADGGCPAVPAEGLWR
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:5
HLQQVEQSR
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:6
SQVQAIGEK
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:7
VSLAQAK
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:8
LQEYGSIFTGAQDPGLQR
>NR_024540 gene=WASH7P F:+1 Orf:1 Pep:9
HRPLDER
>NR_024540 gene=WASH7P F:+1 Orf:2 Pep:1
YVFLDPLAGAVTK
>NR_024540 gene=WASH7P F:+1 Orf:2 Pep:2
THVMLGAETEEK
>NR_024540 gene=WASH7P F:+1 Orf:2 Pep:3
LFDAPLSISK
>NR_024540 gene=WASH7P F:+1 Orf:2 Pep:4

Where the Orf is the given open reading frame and the Pep:# is the tryptic peptide in that ORF.

Clone this wiki locally