Skip to content

Commit

Permalink
new profile drift project
Browse files Browse the repository at this point in the history
  • Loading branch information
DanBuchan committed Mar 11, 2024
0 parents commit 49613d6
Show file tree
Hide file tree
Showing 11 changed files with 16,672 additions and 0 deletions.
Empty file added .gitignore
Empty file.
38 changes: 38 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Drift paper

1. What does drift look like for Pfam families?
2. What does sequence generation look like for Pfam/HMMER HMMS
3. What does sequence generation look like for ESM on a single-seq and MSA basis
4. Is drift of representation-ness correlated to performance in predictive tasks, namely CASP

## 1. Calculating drift in Pfam families

PFAM RELEASE 36 @ 8 March 2024

a. Get the Pfam-A.full.uniprot sequences and Pfam-A.hmm model sdataset from Interpro
b. extract all pfam sequences to fasta and annotation with family ID: ~/bin/prep_pfam_fasta.py
c. Convert to PSI-BLAST DB
d. Using Hmmemit output the consensus sequence for each family:
~/Applications/hmmer-3.3.2/src/hmmemit -c ~/Data/pfam/Pfam-A.hmm > pfam_consensus_reps.fa
relabel_pfam_consensus_seqs.py ~/Data/pfam/Pfam-A.hmm ~/Data/pfam/pfam_consensus_reps.fa > pfam_consensus_reps_labelled.fa
e. Using RaxML build the distance matrix over all Pfam domains, using mafft and raxml
calculate_pfam_distances.py

This gives us an all-against-all evoltionary distance matrix but built on an MAFFT MSA that may not be meaningful

f. Perform an all-against-all psiblast of the reps. Extract bits scores for a similarity matrix
Scale/Normalise to between 0 and one and invert for a distance matrix
Use Diamond to generate all the distances
~/Applications/Diamond/diamond makedb --in ./pfam_consensus_reps_labelled.fa --db pfam_consensus_reps_labelled.diamon


g. Psiblast Consensus seq against pfam blast db
h. record drift

## Cluster analysis

a. Take the distance matrices and cluster them.
b. project with t-sne
c. Are the clusters meaningful?

## 2. For
46 changes: 46 additions & 0 deletions pfam_distance_matrix/RAxML_info.reps.dist
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@


This is RAxML version 8.2.12 released by Alexandros Stamatakis on May 2018.

With greatly appreciated code contributions by:
Andre Aberer (HITS)
Simon Berger (HITS)
Alexey Kozlov (HITS)
Kassian Kobert (HITS)
David Dao (KIT and HITS)
Sarah Lutteropp (KIT and HITS)
Nick Pattengale (Sandia)
Wayne Pfeiffer (SDSC)
Akifumi S. Tanabe (NRIFS)
Charlie Taylor (UF)


Alignment has 16538 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 99.08%

RAxML Computation of pairwise distances

Using 1 distinct models/data partitions with joint branch length optimization


All free model parameters will be estimated by RAxML
GAMMA model of rate heterogeneity, ML estimate of alpha-parameter

GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units

Partition: 0
Alignment Patterns: 16538
Name: No Name Provided
DataType: AA
Substitution Matrix: BLOSUM62
Using fixed base frequencies




RAxML was called as follows:

/home/dbuchan/Applications/standard-RAxML/raxmlHPC-PTHREADS-AVX -T 4 -s reps.afa -n reps.dist -m PROTGAMMABLOSUM62 -N2 -p 123 -f x


1 change: 1 addition & 0 deletions pfam_distance_matrix/RAxML_parsimonyTree.reps.dist.RUN.0

Large diffs are not rendered by default.

16,365 changes: 16,365 additions & 0 deletions pfam_distance_matrix/pfam_reps_nw_distances.csv

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions pfam_distance_matrix/reps.err
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Traceback (most recent call last):
File "/home/dbuchan/Projects/profile_drift/pfam_reps_nw.py", line 75, in <module>
run_nw(rep_seqs)
File "/home/dbuchan/Projects/profile_drift/pfam_reps_nw.py", line 51, in run_nw
execute_process(nw_args)
File "/home/dbuchan/Projects/profile_drift/pfam_reps_nw.py", line 14, in execute_process
output, err = p.communicate()
File "/usr/lib64/python3.9/subprocess.py", line 1134, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib64/python3.9/subprocess.py", line 1995, in _communicate
ready = selector.select(timeout)
File "/usr/lib64/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
2 changes: 2 additions & 0 deletions pfam_distance_matrix/tmpa.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
>1-cysPrx_C-consensus|PF10417
ALQFVDKHGVVTPANWRPGDKVIVPPPATQEEAVKRYLEG
2 changes: 2 additions & 0 deletions pfam_distance_matrix/tmpb.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
>Reovirus_Mu2-consensus|PF07781
MITIVVIATERFSWTDTNLLNSVDYRLSEAIKLRERFAVYAPAYLRDQLEELAASLTASELLDALQDINVLVKSVYLLLPKKKRLAQYLLDDPSANIWHIPVTVLRATVASKFKISDVYNYLVDHVSPSAELATVIRRVAAGEVVYTRTDKVLGAPLLLAAPAKYYAGYLSSTQLKGVYPSSKEPERFKKKEFCLTILPSLASPKTFLLDVDAERDASIPLSVLWPQLRSDAVKSKRLLPPVALLRRVVDPALKPEWSADVDAAFRALRLSRPSSAKKSASHDVSKVPVVDIICLLESEPVDDGEKAPATKLTIHAVPVDLLTVLDIEVGKEYPLRQESGMFVPWLLLALLLSDDVTLSGTRRSVKLETAHAAARPFVNIKIERCVSARVVDVRADIADYVNAVCLTLPKGSFKSTIIDTLPSLFLELSILERAAVVDSDELGDSLEPTFEQRFLEILETLAPDSLDDAIKVVLASALIADDDALTLLLDAFNELYKEVLTPAQRKRLPLLTQQGRVLVFAHSDYELLAANVPIQVLRGKIPIDHEVNLLARKNRVGGTALQLLLDYLYRMQASPLAPKPAGSLYRQLFGPWLRVEADSEQLIKLKLIAEAPAKVLRAAGWTIDGDIPLIVSVLRAYVSVRAVIEELLERRLDSRALVVISADQLVLVEYAPPLKLVSLPASFLLPATYEVRLVSPQRVLLTGSNVSYTSGLEFAEVDDPLVVTSTG
91 changes: 91 additions & 0 deletions scripts/calculate_pfam_distances.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
'''
We calculate the NW alignment with MAFFT then the distance
calculation with RAxML.
python3 ./calculate_pfam_distances.py ~/Data/pfam/pfam_consensus_reps_labelled.fa
'''

import subprocess
from subprocess import Popen, PIPE
import sys
import os
import numpy as np
from collections import defaultdict
from time import time
from multiprocessing import Pool, Array
import itertools


def select_string(seqlist):
length = 1000000
ave = sum(map(len, seqlist)) / len(seqlist)
selected = ''
for seq in seqlist:
if abs(len(seq)-ave) < length:
length = abs(len(seq)-ave)
selected = seq
return(seq)

def execute_process(executable_args, stdout_location=None):
try:
print(' '.join(executable_args))
p = Popen(executable_args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, err = p.communicate()
code = p.returncode
except Exception as e:
print(str(e))
sys.exit(1)
if code != 0:
print("Non Zero Exit status: "+str(code))
print("Command:"+' '.join(executable_args))
if code != 255:
raise OSError("Non Zero Exit status: "+str(code)+output.decode('utf-8'))
if stdout_location:
fhalign = open(stdout_location, "w")
fhalign.write(output.decode("utf-8") )
fhalign.close()

def process_distances(rep_file):
rep_seqs = []

align_file = 'reps.afa'
dist_file = 'reps.dist'
mafft_args = ['/usr/local/bin/mafft',
rep_file]
execute_process(mafft_args, align_file)
# run raxml

raxml_args = ['/home/dbuchan/Applications/standard-RAxML/raxmlHPC-PTHREADS-AVX',
'-T',
'4',
'-s',
align_file,
'-n',
dist_file,
'-m',
'PROTGAMMABLOSUM62',
'-N'
'2',
'-p',
'123',
'-f',
'x']
execute_process(raxml_args)

try:
os.remove(align_file+".reduced")
except:
pass
try:
os.remove(f'RAxML_info.rep.dist')
except:
pass
try:
os.remove(f'RAxML_parsimonyTree.rep.dist.RUN.0')
except:
pass


input_file = sys.argv[1]

process_distances(input_file)
75 changes: 75 additions & 0 deletions scripts/pfam_reps_nw.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import sys
from subprocess import Popen, PIPE
import os
import csv

# USAGE
# python pfam_reps_blast.py ~/Data/pfam/pfam_consensus_reps.fa
#

def execute_process(executable_args, stdout_location=None):
try:
# print(' '.join(executable_args))
p = Popen(executable_args, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, err = p.communicate()
code = p.returncode
except Exception as e:
print(str(e))
sys.exit(1)
if code != 0:
print("Non Zero Exit status: "+str(code))
print("Command:"+' '.join(executable_args))
if code != 255:
raise OSError("Non Zero Exit status: "+str(code)+output.decode('utf-8'))
if stdout_location:
fstdout = open(stdout_location, "w")
fstdout.write(output.decode("utf-8") )
fstdout.close()


def run_nw(rep_seqs):
for seqa in rep_seqs.keys():
for seqb in rep_seqs.keys():
if seqa == seqb:
continue
fhOut = open('tmpa.fa', "w")
fhOut.write(f'{seqa}\n')
fhOut.write(f'{rep_seqs[seqa]}')
fhOut.close()
fhOut = open('tmpb.fa', "w")
fhOut.write(f'{seqb}\n')
fhOut.write(f'{rep_seqs[seqb]}')
fhOut.close()
nw_args = [
'/home/dbuchan/Applications/EMBOSS-6.4.0/emboss/needle',
'tmpa.fa',
'tmpb.fa',
'out.needle',
'-gapopen=10',
'-gapextend=0'
]
execute_process(nw_args)
os.remove('tmpa.fa')
os.remove('tmpb.fa')
with open('out.needle', "r") as fhIn:
for line in fhIn:
if line.startswith('# Score:'):
print(f'{seqa[1:]},{seqb[1:]},{line.rstrip()[9:]}')
os.remove(f'out.needle')
# exit()


reps_file = sys.argv[1]
rep_seqs = {}
fasta_name = None
seq = None
with open(reps_file, "r") as fhIn:
for line in fhIn:
if line.startswith(">"):
if fasta_name:
rep_seqs[fasta_name] = seq
fasta_name = line.rstrip()
seq = ''
else:
seq = seq+line.rstrip()
run_nw(rep_seqs)
38 changes: 38 additions & 0 deletions scripts/relabel_pfam_consensus_seqs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import sys

'''
usage:
relabel_pfam_consensus_seqs.py ~/Data/pfam/Pfam-A.hmm ~/Data/pfam/pfam_consensus_reps.fa > pfam_consensus_reps_labelled.fa
'''

def read_pfam_hmm(input_file):
names = dict()
current_name = ''
curren_acc = ''
with open(input_file, "r") as fhIn:
for line in fhIn:
if line.startswith("NAME "):
current_name = line.rstrip()[6:]
if line.startswith("ACC "):
current_acc= line.rstrip()[6:13]
names[current_name] = current_acc
return names

def annotate_fasta(names, fasta_file):

print_ctl = False
with open(fasta_file, "r") as fhIn:
for line in fhIn:
if line.startswith(">"):
current_id = line[1:-11]
if current_id in names:
print_ctl = True
print(f'{line[:-1]}|{names[current_id]}')
else:
print_ctl = False
else:
if print_ctl:
print(line.rstrip())

pfam_names = read_pfam_hmm(sys.argv[1])
annotate_fasta(pfam_names, sys.argv[2])

0 comments on commit 49613d6

Please sign in to comment.