-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 49613d6
Showing
11 changed files
with
16,672 additions
and
0 deletions.
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Drift paper | ||
|
||
1. What does drift look like for Pfam families? | ||
2. What does sequence generation look like for Pfam/HMMER HMMS | ||
3. What does sequence generation look like for ESM on a single-seq and MSA basis | ||
4. Is drift of representation-ness correlated to performance in predictive tasks, namely CASP | ||
|
||
## 1. Calculating drift in Pfam families | ||
|
||
PFAM RELEASE 36 @ 8 March 2024 | ||
|
||
a. Get the Pfam-A.full.uniprot sequences and Pfam-A.hmm model sdataset from Interpro | ||
b. extract all pfam sequences to fasta and annotation with family ID: ~/bin/prep_pfam_fasta.py | ||
c. Convert to PSI-BLAST DB | ||
d. Using Hmmemit output the consensus sequence for each family: | ||
~/Applications/hmmer-3.3.2/src/hmmemit -c ~/Data/pfam/Pfam-A.hmm > pfam_consensus_reps.fa | ||
relabel_pfam_consensus_seqs.py ~/Data/pfam/Pfam-A.hmm ~/Data/pfam/pfam_consensus_reps.fa > pfam_consensus_reps_labelled.fa | ||
e. Using RaxML build the distance matrix over all Pfam domains, using mafft and raxml | ||
calculate_pfam_distances.py | ||
|
||
This gives us an all-against-all evoltionary distance matrix but built on an MAFFT MSA that may not be meaningful | ||
|
||
f. Perform an all-against-all psiblast of the reps. Extract bits scores for a similarity matrix | ||
Scale/Normalise to between 0 and one and invert for a distance matrix | ||
Use Diamond to generate all the distances | ||
~/Applications/Diamond/diamond makedb --in ./pfam_consensus_reps_labelled.fa --db pfam_consensus_reps_labelled.diamon | ||
|
||
|
||
g. Psiblast Consensus seq against pfam blast db | ||
h. record drift | ||
|
||
## Cluster analysis | ||
|
||
a. Take the distance matrices and cluster them. | ||
b. project with t-sne | ||
c. Are the clusters meaningful? | ||
|
||
## 2. For |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
|
||
|
||
This is RAxML version 8.2.12 released by Alexandros Stamatakis on May 2018. | ||
|
||
With greatly appreciated code contributions by: | ||
Andre Aberer (HITS) | ||
Simon Berger (HITS) | ||
Alexey Kozlov (HITS) | ||
Kassian Kobert (HITS) | ||
David Dao (KIT and HITS) | ||
Sarah Lutteropp (KIT and HITS) | ||
Nick Pattengale (Sandia) | ||
Wayne Pfeiffer (SDSC) | ||
Akifumi S. Tanabe (NRIFS) | ||
Charlie Taylor (UF) | ||
|
||
|
||
Alignment has 16538 distinct alignment patterns | ||
|
||
Proportion of gaps and completely undetermined characters in this alignment: 99.08% | ||
|
||
RAxML Computation of pairwise distances | ||
|
||
Using 1 distinct models/data partitions with joint branch length optimization | ||
|
||
|
||
All free model parameters will be estimated by RAxML | ||
GAMMA model of rate heterogeneity, ML estimate of alpha-parameter | ||
|
||
GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units | ||
|
||
Partition: 0 | ||
Alignment Patterns: 16538 | ||
Name: No Name Provided | ||
DataType: AA | ||
Substitution Matrix: BLOSUM62 | ||
Using fixed base frequencies | ||
|
||
|
||
|
||
|
||
RAxML was called as follows: | ||
|
||
/home/dbuchan/Applications/standard-RAxML/raxmlHPC-PTHREADS-AVX -T 4 -s reps.afa -n reps.dist -m PROTGAMMABLOSUM62 -N2 -p 123 -f x | ||
|
||
|
Large diffs are not rendered by default.
Oops, something went wrong.
16,365 changes: 16,365 additions & 0 deletions
16,365
pfam_distance_matrix/pfam_reps_nw_distances.csv
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
Traceback (most recent call last): | ||
File "/home/dbuchan/Projects/profile_drift/pfam_reps_nw.py", line 75, in <module> | ||
run_nw(rep_seqs) | ||
File "/home/dbuchan/Projects/profile_drift/pfam_reps_nw.py", line 51, in run_nw | ||
execute_process(nw_args) | ||
File "/home/dbuchan/Projects/profile_drift/pfam_reps_nw.py", line 14, in execute_process | ||
output, err = p.communicate() | ||
File "/usr/lib64/python3.9/subprocess.py", line 1134, in communicate | ||
stdout, stderr = self._communicate(input, endtime, timeout) | ||
File "/usr/lib64/python3.9/subprocess.py", line 1995, in _communicate | ||
ready = selector.select(timeout) | ||
File "/usr/lib64/python3.9/selectors.py", line 416, in select | ||
fd_event_list = self._selector.poll(timeout) | ||
KeyboardInterrupt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
>1-cysPrx_C-consensus|PF10417 | ||
ALQFVDKHGVVTPANWRPGDKVIVPPPATQEEAVKRYLEG |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
>Reovirus_Mu2-consensus|PF07781 | ||
MITIVVIATERFSWTDTNLLNSVDYRLSEAIKLRERFAVYAPAYLRDQLEELAASLTASELLDALQDINVLVKSVYLLLPKKKRLAQYLLDDPSANIWHIPVTVLRATVASKFKISDVYNYLVDHVSPSAELATVIRRVAAGEVVYTRTDKVLGAPLLLAAPAKYYAGYLSSTQLKGVYPSSKEPERFKKKEFCLTILPSLASPKTFLLDVDAERDASIPLSVLWPQLRSDAVKSKRLLPPVALLRRVVDPALKPEWSADVDAAFRALRLSRPSSAKKSASHDVSKVPVVDIICLLESEPVDDGEKAPATKLTIHAVPVDLLTVLDIEVGKEYPLRQESGMFVPWLLLALLLSDDVTLSGTRRSVKLETAHAAARPFVNIKIERCVSARVVDVRADIADYVNAVCLTLPKGSFKSTIIDTLPSLFLELSILERAAVVDSDELGDSLEPTFEQRFLEILETLAPDSLDDAIKVVLASALIADDDALTLLLDAFNELYKEVLTPAQRKRLPLLTQQGRVLVFAHSDYELLAANVPIQVLRGKIPIDHEVNLLARKNRVGGTALQLLLDYLYRMQASPLAPKPAGSLYRQLFGPWLRVEADSEQLIKLKLIAEAPAKVLRAAGWTIDGDIPLIVSVLRAYVSVRAVIEELLERRLDSRALVVISADQLVLVEYAPPLKLVSLPASFLLPATYEVRLVSPQRVLLTGSNVSYTSGLEFAEVDDPLVVTSTG |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
''' | ||
We calculate the NW alignment with MAFFT then the distance | ||
calculation with RAxML. | ||
python3 ./calculate_pfam_distances.py ~/Data/pfam/pfam_consensus_reps_labelled.fa | ||
''' | ||
|
||
import subprocess | ||
from subprocess import Popen, PIPE | ||
import sys | ||
import os | ||
import numpy as np | ||
from collections import defaultdict | ||
from time import time | ||
from multiprocessing import Pool, Array | ||
import itertools | ||
|
||
|
||
def select_string(seqlist): | ||
length = 1000000 | ||
ave = sum(map(len, seqlist)) / len(seqlist) | ||
selected = '' | ||
for seq in seqlist: | ||
if abs(len(seq)-ave) < length: | ||
length = abs(len(seq)-ave) | ||
selected = seq | ||
return(seq) | ||
|
||
def execute_process(executable_args, stdout_location=None): | ||
try: | ||
print(' '.join(executable_args)) | ||
p = Popen(executable_args, stdin=PIPE, stdout=PIPE, stderr=PIPE) | ||
output, err = p.communicate() | ||
code = p.returncode | ||
except Exception as e: | ||
print(str(e)) | ||
sys.exit(1) | ||
if code != 0: | ||
print("Non Zero Exit status: "+str(code)) | ||
print("Command:"+' '.join(executable_args)) | ||
if code != 255: | ||
raise OSError("Non Zero Exit status: "+str(code)+output.decode('utf-8')) | ||
if stdout_location: | ||
fhalign = open(stdout_location, "w") | ||
fhalign.write(output.decode("utf-8") ) | ||
fhalign.close() | ||
|
||
def process_distances(rep_file): | ||
rep_seqs = [] | ||
|
||
align_file = 'reps.afa' | ||
dist_file = 'reps.dist' | ||
mafft_args = ['/usr/local/bin/mafft', | ||
rep_file] | ||
execute_process(mafft_args, align_file) | ||
# run raxml | ||
|
||
raxml_args = ['/home/dbuchan/Applications/standard-RAxML/raxmlHPC-PTHREADS-AVX', | ||
'-T', | ||
'4', | ||
'-s', | ||
align_file, | ||
'-n', | ||
dist_file, | ||
'-m', | ||
'PROTGAMMABLOSUM62', | ||
'-N' | ||
'2', | ||
'-p', | ||
'123', | ||
'-f', | ||
'x'] | ||
execute_process(raxml_args) | ||
|
||
try: | ||
os.remove(align_file+".reduced") | ||
except: | ||
pass | ||
try: | ||
os.remove(f'RAxML_info.rep.dist') | ||
except: | ||
pass | ||
try: | ||
os.remove(f'RAxML_parsimonyTree.rep.dist.RUN.0') | ||
except: | ||
pass | ||
|
||
|
||
input_file = sys.argv[1] | ||
|
||
process_distances(input_file) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
import sys | ||
from subprocess import Popen, PIPE | ||
import os | ||
import csv | ||
|
||
# USAGE | ||
# python pfam_reps_blast.py ~/Data/pfam/pfam_consensus_reps.fa | ||
# | ||
|
||
def execute_process(executable_args, stdout_location=None): | ||
try: | ||
# print(' '.join(executable_args)) | ||
p = Popen(executable_args, stdin=PIPE, stdout=PIPE, stderr=PIPE) | ||
output, err = p.communicate() | ||
code = p.returncode | ||
except Exception as e: | ||
print(str(e)) | ||
sys.exit(1) | ||
if code != 0: | ||
print("Non Zero Exit status: "+str(code)) | ||
print("Command:"+' '.join(executable_args)) | ||
if code != 255: | ||
raise OSError("Non Zero Exit status: "+str(code)+output.decode('utf-8')) | ||
if stdout_location: | ||
fstdout = open(stdout_location, "w") | ||
fstdout.write(output.decode("utf-8") ) | ||
fstdout.close() | ||
|
||
|
||
def run_nw(rep_seqs): | ||
for seqa in rep_seqs.keys(): | ||
for seqb in rep_seqs.keys(): | ||
if seqa == seqb: | ||
continue | ||
fhOut = open('tmpa.fa', "w") | ||
fhOut.write(f'{seqa}\n') | ||
fhOut.write(f'{rep_seqs[seqa]}') | ||
fhOut.close() | ||
fhOut = open('tmpb.fa', "w") | ||
fhOut.write(f'{seqb}\n') | ||
fhOut.write(f'{rep_seqs[seqb]}') | ||
fhOut.close() | ||
nw_args = [ | ||
'/home/dbuchan/Applications/EMBOSS-6.4.0/emboss/needle', | ||
'tmpa.fa', | ||
'tmpb.fa', | ||
'out.needle', | ||
'-gapopen=10', | ||
'-gapextend=0' | ||
] | ||
execute_process(nw_args) | ||
os.remove('tmpa.fa') | ||
os.remove('tmpb.fa') | ||
with open('out.needle', "r") as fhIn: | ||
for line in fhIn: | ||
if line.startswith('# Score:'): | ||
print(f'{seqa[1:]},{seqb[1:]},{line.rstrip()[9:]}') | ||
os.remove(f'out.needle') | ||
# exit() | ||
|
||
|
||
reps_file = sys.argv[1] | ||
rep_seqs = {} | ||
fasta_name = None | ||
seq = None | ||
with open(reps_file, "r") as fhIn: | ||
for line in fhIn: | ||
if line.startswith(">"): | ||
if fasta_name: | ||
rep_seqs[fasta_name] = seq | ||
fasta_name = line.rstrip() | ||
seq = '' | ||
else: | ||
seq = seq+line.rstrip() | ||
run_nw(rep_seqs) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
import sys | ||
|
||
''' | ||
usage: | ||
relabel_pfam_consensus_seqs.py ~/Data/pfam/Pfam-A.hmm ~/Data/pfam/pfam_consensus_reps.fa > pfam_consensus_reps_labelled.fa | ||
''' | ||
|
||
def read_pfam_hmm(input_file): | ||
names = dict() | ||
current_name = '' | ||
curren_acc = '' | ||
with open(input_file, "r") as fhIn: | ||
for line in fhIn: | ||
if line.startswith("NAME "): | ||
current_name = line.rstrip()[6:] | ||
if line.startswith("ACC "): | ||
current_acc= line.rstrip()[6:13] | ||
names[current_name] = current_acc | ||
return names | ||
|
||
def annotate_fasta(names, fasta_file): | ||
|
||
print_ctl = False | ||
with open(fasta_file, "r") as fhIn: | ||
for line in fhIn: | ||
if line.startswith(">"): | ||
current_id = line[1:-11] | ||
if current_id in names: | ||
print_ctl = True | ||
print(f'{line[:-1]}|{names[current_id]}') | ||
else: | ||
print_ctl = False | ||
else: | ||
if print_ctl: | ||
print(line.rstrip()) | ||
|
||
pfam_names = read_pfam_hmm(sys.argv[1]) | ||
annotate_fasta(pfam_names, sys.argv[2]) |