Skip to content

Commit

Permalink
count it 5
Browse files Browse the repository at this point in the history
  • Loading branch information
DanBuchan committed Nov 1, 2024
1 parent 6971244 commit 1c13cf6
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 9 deletions.
15 changes: 6 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,12 @@ c. Are the clusters meaningful?
outputs : results_data/hmm_closest/drift_summary.csv and results_data/prottrans_closest/drift_summary.csv

## Caclu

> count_contamination_at_it_five.py
Just totaly the amount of out of family sequences at iteration 5 of the psiblast search.

# 4. Alphafold plDDT experiment


Expand Down Expand Up @@ -146,12 +152,3 @@ read the merzio summary and work out, per class, how often the cath class change
hmm_drift_summary.R
prottrans_drift_summary.R
summarise_models.R

## ANALYSIS OF DRIFT QUESTIONS

1. Analyse *_blast_summary.csv, to find out which families show drift and what kinds of drift

1. How many and What families drift?
2. What kinds of drift pattern are there?
3. Do drift families share the same folds or not? If not what happens if we build alphafold models with the MSAs at each iteration (see above)
4. Do drifting families correlate with "hard" targets for alphafold/dmplfold. i.e. regions of low plDDT or just the poor models in CASP15
55 changes: 55 additions & 0 deletions scripts/psiblast_drift/count_contamination_at_it_five.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import csv
import glob

#
# python count_contamination_at_it_five.py /home/dbuchan/Projects/profile_drift/results_data/drift/pfam_rep_psiblast_iteration_summaries
#
query_purified = "set_where_the_query_was_purified_out.txt"
contam_purified = "set_where_contaminants_are_purified_out.txt"
contam_grew = "set_where_contaminants_grew.txt"
contam_complex = "set_with_complex_contamination_behaviours.txt"
files = [query_purified, contam_purified, contam_grew, contam_complex]

def read_summaries(path):
summaries = {}
for file in glob.glob(f'{path}*'):
# if "19377_A0A8C4KJU7.1_9-26_blast_summary.csv" not in file:
# continue
# print(file)
with open(file, "r", encoding="utf-8") as fh:
next(fh)
summaryreader = csv.reader(fh, delimiter=',')
for i, row in enumerate(summaryreader):
if i == 0:
summaries[row[1]] = {}
if int(row[0]) not in summaries[row[1]]:
# print(row)
summaries[row[1]][int(row[0])] = {}
# print(summaries)
summaries[row[1]][int(row[0])][row[2]] = int(row[3])
return(summaries)

def read_list(path, file):
pfam_list = []
with open(f'{path}/{file}', "r", encoding="utf-8") as fh:
for line in fh:
line = line.rstrip()
pfam_list.append(line)
return(pfam_list)


drift_summaries = read_summaries("/home/dbuchan/Projects/profile_drift/results_data/drift/pfam_rep_psiblast_iteration_summaries/")
query_purified = "set_where_the_query_was_purified_out.txt"
contam_purified = "set_where_contaminants_are_purified_out.txt"
contam_grew = "set_where_contaminants_grew.txt"
contam_complex = "set_with_complex_contamination_behaviours.txt"
insig = "insignificant_drifts.txt"
non_drift = "non_drift_list.txt"

files = [insig, non_drift, query_purified, contam_purified, contam_grew, contam_complex]

# find out ratio at iteration 5
for file in files:
pfam_set = read_list(summaries_location, file)
for pfam in pfam_set:
print(pfam_set[pfam])

0 comments on commit 1c13cf6

Please sign in to comment.