Skip to content
chrismit edited this page Apr 1, 2015 · 8 revisions

This will allow you to annotate a delimited file (such as the output of Proteome Discovered, MaxQuant, etc.) with a given reference annotation.

Example usage:

This is a simple usage where the header of the matching protein sequences will be appended to each row.

python proteinInference.py -f /home/chris/ref/celegans/c_elegans.PRJNA13758.WS239.protein.fa -t /home/chris/Celegans_wt_peptides.csv --peptide_out /home/chris/Celegans_wt_peptides_summary.csv --protein_out /home/chris/Celegans_wt_protein_summary.csv

Our reference file appears as so, however:

>2L52.1 CE32090 WBGene00007063 Zinc finger, C2H2 type status:Partially_confirmed UniProt:A4F336 protein_id:CCD61130.1 MSMVRNVSNQSEKLEILSCKWVGCLKSTEVFKTVEKLLDHVTADHIPEVIVNDDGSEEVV CQWDCCEMGASRGNLQKKKEWMENHFKTRHVRKAKIFKCLIEDCPVVKSSSQEIETHLRI SHPINPKKERLKEFKSSTDHIEPTQANRVWTIVNGEVQWKTPPRVKKKTVIYYDDGPRYV FPTGCARCNYDSDESELESDEFWSATEMSDNEEVYVNFRGMNCISTGKSASMVPSKRRNW PKRVKKRLSTQRNNQKTIRPPELNKNNIEIKDMNSNNLEERNREECIQPVSVEKNILHFE KFKSNQICIVRENNKFREGTRRRRKNSGESEDLKIHENFTEKRRPIRSCKQNISFYEMDG DIEEFEVFFDTPTKSKKVLLDIYSAKKMPKIEVEDSLVNKFHSKRPSRACRVLGSMEEVP FDVEIGY

>2RSSE.1 CE32785 WBGene00007064 status:Partially_confirmed UniProt:A4F337 protein_id:CCD61138.1

Which is a lot to add to our file. We can optionally add a regex argument to only capture pieces of information we want. Suppose we just want the protein_id:

python proteinInference.py -f /home/chris/ref/celegans/c_elegans.PRJNA13758.WS239.protein.fa -t /home/chris/Celegans_wt_peptides.csv --peptide_out /home/chris/Celegans_wt_peptides_summary.csv --protein_out /home/chris/Celegans_wt_protein_summary.csv --regex protein_id:([^\s]+)

Or we want the first identifier after the > and the protein id:

python proteinInference.py -f /home/chris/ref/celegans/c_elegans.PRJNA13758.WS239.protein.fa -t /home/chris/Celegans_wt_peptides.csv --peptide_out /home/chris/Celegans_wt_peptides_summary.csv --protein_out /home/chris/Celegans_wt_protein_summary.csv --regex ^([^\s]+).+?protein_id:([^\s]+)

This will output entries such as:

yVVLTGNQELKPLTAK ... C01G6.1a CAA84633.1;C01G6.1b CAA84642.1

Where C01G61.A is our identifier after the >, and CAA84633.1 is our protein_id. And C01G6.1b CAA84642.1 corresponds to another protein in the reference that also matches.

By default the program assumes the first row is your peptide sequence. If this is not true you can either pass the index of the column, or if you have headers its name to the -c argument:

python proteinInference.py -c Peptide ...

Now suppose we want to group by something like gene. We have a fasta file which looks as so:

gi|31791016|ref|NP_853648.1| NP_853648.1$KRTAP21-2$keratin-associated protein 21-2 [Homo sapiens]
MCCNYYRNCCGGCGYGSGWSSGCGYGCGYGCGYGSGCRYGSGYGTGCGYGCGYGSGCGYGCGYSSSCCGYRPLCYRRCYSSCY
gi|121582655|ref|NP_653299.3| NP_653299.3$ANKRD35$ankyrin repeat domain-containing protein 35 [Homo sapiens]
MKRIFSCSSTQV...
gi|305410858|ref|NP_056532.3| NP_056532.3$CD207$C-type lectin domain family 4 member K [Homo sapiens]

We can group based on gene by:

python pythomics/scripts/proteinInference.py -f /home/chris/ref/Protein/RefSeq62/Human_Refseq52_with_genes.fasta -t /home/chris/01BR1_PAGE_IOB_peptides_subset.csv -d , --regex \\$\(.+?\)\\$ --protein_out /home/chris/01BR1_PAGE_IOB_protein_subset2.csv --peptide_out /home/chris/Individualome/Proteomics/01BR1_PAGE_IOB_peptides_out2.csv --unique --ibaq

This will calculate the ibaq values at the gene level.

Clone this wiki locally