Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dna joao #9

Open
wants to merge 149 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
6e7ad4f
first commit
JoaoNunoAbreu Feb 23, 2022
a7f6dc2
add guides
JoaoNunoAbreu Feb 24, 2022
ebe389c
.DS_Store banished!
JoaoNunoAbreu Feb 24, 2022
7210890
add a few DNA descriptors
JoaoNunoAbreu Feb 24, 2022
28e5c32
update requirements
JoaoNunoAbreu Feb 24, 2022
95294ef
add more descriptors and a get all function
JoaoNunoAbreu Feb 24, 2022
c1b0127
fix get_kmer
JoaoNunoAbreu Feb 25, 2022
5513b4d
add binary and accumulated nucleotide frequency descriptors
JoaoNunoAbreu Feb 25, 2022
61385f2
add binary and accumulated nucleotide frequency descriptors
JoaoNunoAbreu Feb 25, 2022
9cdcf42
add enhanced nucleic acid composition descriptor
JoaoNunoAbreu Feb 25, 2022
1e107ba
add header to descriptors_dna.py
JoaoNunoAbreu Feb 25, 2022
52e9650
add k spaced nucleic acid pairs descriptor
JoaoNunoAbreu Mar 3, 2022
44cdc80
add get_PseDNC descriptor
JoaoNunoAbreu Mar 5, 2022
2c13eb9
add pseudo_k_composition descriptor
JoaoNunoAbreu Mar 8, 2022
d629ffb
added all Autocorrelation descriptors
JoaoNunoAbreu Mar 10, 2022
bf44863
fix docs
JoaoNunoAbreu Mar 10, 2022
77b49f5
add where code comes from
JoaoNunoAbreu Mar 10, 2022
4ce1a9c
add dna deep ml template code
JoaoNunoAbreu Mar 11, 2022
3054c56
cifar10 template code
JoaoNunoAbreu Mar 11, 2022
f3ec4ed
delete pickle file
JoaoNunoAbreu Mar 15, 2022
c7ffea4
add essential genes dataset
JoaoNunoAbreu Mar 17, 2022
db7fa90
update
JoaoNunoAbreu Mar 24, 2022
f34b3e0
add enhancer dataset
JoaoNunoAbreu Mar 25, 2022
25e9689
update
JoaoNunoAbreu Mar 29, 2022
2e65b5f
added essential genes fasta
JoaoNunoAbreu Mar 29, 2022
4990c72
update
JoaoNunoAbreu Apr 4, 2022
40477b0
update
Apr 5, 2022
83ec412
update
Apr 5, 2022
4563d54
model with 62% acc for enhancers
Apr 6, 2022
0ac4e3d
test update
JoaoNunoAbreu Apr 6, 2022
89cda1e
clean cdoe
JoaoNunoAbreu Apr 6, 2022
5308df9
update
JoaoNunoAbreu Apr 6, 2022
a8dc599
with conda env now
JoaoNunoAbreu Apr 8, 2022
f5deb5c
remove unnecessary print
JoaoNunoAbreu Apr 8, 2022
8f565a4
requirements file for conda env
JoaoNunoAbreu Apr 8, 2022
864cf99
one hot encoding attempt
JoaoNunoAbreu Apr 15, 2022
04969b2
change descriptors values to percentage
JoaoNunoAbreu Apr 19, 2022
c50d6f5
delete ugly print
JoaoNunoAbreu Apr 19, 2022
8b9fa9c
good acc and mcc with a new dataset
JoaoNunoAbreu Apr 21, 2022
5690917
remove enhancers and clean validate_descriptors
JoaoNunoAbreu Apr 27, 2022
1a9a7bc
with more comments and models
JoaoNunoAbreu Apr 28, 2022
d7904e5
fix linear svm feature importance
JoaoNunoAbreu May 2, 2022
d51edfc
testing DL models
JoaoNunoAbreu May 18, 2022
584ec1a
update testing
JoaoNunoAbreu May 19, 2022
04c9e7e
with more comments
JoaoNunoAbreu May 19, 2022
81095c2
update
JoaoNunoAbreu May 20, 2022
433e108
update
JoaoNunoAbreu May 20, 2022
055f1bf
with good acc and mcc
JoaoNunoAbreu May 21, 2022
518483f
with early stopping
JoaoNunoAbreu May 21, 2022
51c39b6
with early stopping and lr scheduler
JoaoNunoAbreu May 21, 2022
54e174e
clean up code
JoaoNunoAbreu Jun 2, 2022
b0e5aba
creating module
JoaoNunoAbreu Jun 2, 2022
dab678c
just to not lose progress
JoaoNunoAbreu Jun 7, 2022
34b5322
just not to lose progress
JoaoNunoAbreu Jun 7, 2022
1c98191
add readme
JoaoNunoAbreu Jun 7, 2022
5362909
update
JoaoNunoAbreu Jun 7, 2022
b821558
update
JoaoNunoAbreu Jun 7, 2022
437c3dc
update
JoaoNunoAbreu Jun 7, 2022
fd6a0a7
updating
JoaoNunoAbreu Jun 8, 2022
420bf1b
update
JoaoNunoAbreu Jun 10, 2022
f4d7e42
update
JoaoNunoAbreu Jun 13, 2022
5f918b1
update
JoaoNunoAbreu Jun 13, 2022
72ff9af
update
JoaoNunoAbreu Jun 13, 2022
cfe8c93
update
JoaoNunoAbreu Jun 14, 2022
8f7aeba
features calculated and stored in pickle obj
JoaoNunoAbreu Jun 15, 2022
6da15a3
removed old DeepHE file
JoaoNunoAbreu Jun 15, 2022
a893677
starting model with descriptors
JoaoNunoAbreu Jun 15, 2022
acec716
testing models
JoaoNunoAbreu Jun 15, 2022
4f42b12
removed extra descriptors and fixed accumulated
JoaoNunoAbreu Jun 16, 2022
22c1764
training with descriptors but guessing negatives
JoaoNunoAbreu Jun 17, 2022
e1a118b
features + DHE model in Primer and prepare_data.py
JoaoNunoAbreu Jun 17, 2022
36fced1
fix bug 0 MCC even when the classes were weighted
JoaoNunoAbreu Jun 17, 2022
27f646f
created hyperparameter tuning script
JoaoNunoAbreu Jun 18, 2022
7268392
added more docs
JoaoNunoAbreu Jun 18, 2022
8422653
primer with hyperparameter tuning
JoaoNunoAbreu Jun 18, 2022
42aaf24
hyperparam tuning in DHE, primer and testing
JoaoNunoAbreu Jun 19, 2022
c33db3a
chemical enc, deepHE stats
JoaoNunoAbreu Jun 21, 2022
2fa291d
rnn-lstm model. primer works for every model
JoaoNunoAbreu Jun 22, 2022
82eb382
update
JoaoNunoAbreu Jul 5, 2022
6172078
update
JoaoNunoAbreu Jul 6, 2022
a654c9d
with cnn-lstm model
JoaoNunoAbreu Jul 8, 2022
da9f74a
cnn-lstm works for primer and h3
JoaoNunoAbreu Jul 8, 2022
9d9905e
add bidirectional option for LSTM and CNN-LSTM
JoaoNunoAbreu Jul 11, 2022
a4dee74
add daniel buckle thesis model
JoaoNunoAbreu Jul 11, 2022
0005399
implemented kmer_one_hot encoding
JoaoNunoAbreu Jul 28, 2022
de594a5
clean up of repo
JoaoNunoAbreu Jul 28, 2022
030533f
clean up of repo
JoaoNunoAbreu Jul 28, 2022
45b131e
update readme
JoaoNunoAbreu Jul 28, 2022
e91dba5
removed unnecessary files
JoaoNunoAbreu Jul 28, 2022
e195cb5
update README
JoaoNunoAbreu Jul 28, 2022
d540a2c
primer added again
JoaoNunoAbreu Jul 28, 2022
70d24f1
fix PseDNC and PseKNC
JoaoNunoAbreu Aug 2, 2022
1b3093f
descriptors better commented
JoaoNunoAbreu Aug 2, 2022
2483b13
remove unnecessary code
JoaoNunoAbreu Aug 2, 2022
f8aeae4
calculation list descriptors, read_sequence module
JoaoNunoAbreu Aug 2, 2022
1b04065
quick start notebook for descriptors
JoaoNunoAbreu Aug 2, 2022
b25c714
update readme
JoaoNunoAbreu Aug 2, 2022
4632d6a
update text
JoaoNunoAbreu Aug 2, 2022
8a02aec
made quick start smaller
JoaoNunoAbreu Aug 2, 2022
4ff3722
fix table
JoaoNunoAbreu Aug 2, 2022
0d19414
fix table
JoaoNunoAbreu Aug 2, 2022
47fb670
fix table
JoaoNunoAbreu Aug 2, 2022
e7192ee
fix table 4
JoaoNunoAbreu Aug 2, 2022
b6c3d71
asdasd
JoaoNunoAbreu Aug 2, 2022
1af3db5
fix table 5
JoaoNunoAbreu Aug 2, 2022
1d6c4ce
prettify
JoaoNunoAbreu Aug 2, 2022
383a79e
prettify
JoaoNunoAbreu Aug 2, 2022
5bcb8c0
grammar update on quick start and gru model
JoaoNunoAbreu Aug 2, 2022
8a2a107
allows train without tuning
JoaoNunoAbreu Aug 2, 2022
da72bd8
calculates descriptors when they havent been
JoaoNunoAbreu Aug 3, 2022
678f154
update texts
JoaoNunoAbreu Aug 3, 2022
a5f641f
refactored to 1 big folder instead of 2
JoaoNunoAbreu Aug 3, 2022
ebf2a00
joined utils files and fixed imports
JoaoNunoAbreu Aug 3, 2022
4be425d
testing.py brough back (?) starting quick start DL
JoaoNunoAbreu Aug 4, 2022
3b20ea3
quick start DL until hyper tuning
JoaoNunoAbreu Aug 4, 2022
c17e4d8
renamed testing to deep_ml
JoaoNunoAbreu Aug 4, 2022
f5c86bb
with config file
JoaoNunoAbreu Aug 5, 2022
3a08296
change read me
JoaoNunoAbreu Aug 8, 2022
8c27dab
commit test
JoaoNunoAbreu Aug 9, 2022
a0d8097
update
JoaoNunoAbreu Aug 10, 2022
2fd321a
quick start dl done
JoaoNunoAbreu Aug 17, 2022
bd94bfd
with num layers in MLP as hyperparameter
JoaoNunoAbreu Aug 25, 2022
a551e54
fc layer of CNN similar to MLP's
JoaoNunoAbreu Sep 5, 2022
a9585e0
added bi_gru, mlp_half and cnn_half models
JoaoNunoAbreu Sep 7, 2022
65ae1e2
Delete Primer.ipynb
JoaoNunoAbreu Sep 8, 2022
8432f45
update
JoaoNunoAbreu Sep 14, 2022
2dbc374
update
JoaoNunoAbreu Sep 23, 2022
79e350d
Merge branch 'dna-joao' of https://github.com/BioSystemsUM/propythia …
JoaoNunoAbreu Sep 23, 2022
b6f2e29
final version of models i hope
JoaoNunoAbreu Oct 6, 2022
af2aa87
remove dynamic num_layers
JoaoNunoAbreu Oct 6, 2022
3a62372
com hyper tuning reproducible
JoaoNunoAbreu Oct 10, 2022
79f1f06
reproducible results with new scheduler
JoaoNunoAbreu Oct 12, 2022
d3c98dc
refactoring of a lot of things
JoaoNunoAbreu Oct 18, 2022
930b2ff
fix typo on quicker start
JoaoNunoAbreu Oct 18, 2022
66182d4
with SMOTE and cutting length
JoaoNunoAbreu Oct 19, 2022
6223911
with choice of reading from pickle
JoaoNunoAbreu Oct 19, 2022
21ee887
with more and pretty metrics
JoaoNunoAbreu Oct 19, 2022
5864af5
remove unnecessary parameter from oversample
JoaoNunoAbreu Oct 19, 2022
477b27a
smote in training
JoaoNunoAbreu Oct 19, 2022
2c9a28b
fix
JoaoNunoAbreu Oct 19, 2022
f5c683a
martelada
JoaoNunoAbreu Oct 19, 2022
c409faf
Revert "smote in training"
JoaoNunoAbreu Oct 20, 2022
9dea950
revert to last week
JoaoNunoAbreu Oct 20, 2022
119c40b
working so far
JoaoNunoAbreu Oct 20, 2022
ac0b0f9
still working with prepare_and_train
JoaoNunoAbreu Oct 20, 2022
c8c2e4f
update
JoaoNunoAbreu Oct 22, 2022
2aff896
fix last update
JoaoNunoAbreu Oct 22, 2022
3b39e5e
gg
JoaoNunoAbreu Oct 27, 2022
9f65d24
Rename requirements to requirements_dna
marta-seq Jul 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
build/*
__pycache__/*
.idea/*
dist/*
dist/*
venv/
.DS_Store
__pycache__
Binary file modified docs/_guides/propythia_descriptors_2021.pdf
Binary file not shown.
Binary file modified docs/_guides/propythia_user_guide_2021.pdf
Binary file not shown.
File renamed without changes.
7 changes: 7 additions & 0 deletions src/propythia/DNA/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
__pycache__/
.ipynb_checkpoints/
.mypy_cache/
.vscode/
datasets/
src_old
backup/
21 changes: 21 additions & 0 deletions src/propythia/DNA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Note

## Machine Learning Part

* `data` is where the physicochemical indices are stored, which are used to calculate some descriptors.
* `descriptors.py` is the file that contains the calculation of all descriptors for a given sequence.
* `calculate_features.py` is a script that calculates all descriptors for an entire dataset (with the help of `descriptors.py`) and creates a dataframe with all the descriptors.
* `notebooks/quick-start-ML.ipynb` is a notebook that explains how to perform every step of the developed modules. It includes data reading and validation, calculation of descriptors from sequences, descriptors processing and using processed descriptors to train ML models (already implemented in ProPythia).

## Deep Learning Part

* `deep_ml.py` runs a combination of set hyperparameters or performs hyperparameter tuning for the given model, feature mode, and data directory.
* `outputs` is a directory where the output of the hyperparameter tuning is stored. Only the filtered results with the score of each model is stored in the directory.
* `src` is a directory where the source code of the entire DL pipeline is stored.
* `essential_genes` is a directory where all the information about the essential genes is stored since it was needed a lot of data preprocessing to build the dataset.
* `config.json` is a file that contains the configuration of the entire DL pipeline.

## Both Parts

* `utils.py` is a file that contains some useful functions.
* `read_sequence.py` is the file that contains functions to read and validate DNA sequences. They can be read from a *CSV* file, a *FASTA* file, or from a single string.
133 changes: 133 additions & 0 deletions src/propythia/DNA/calculate_features.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
import pandas as pd
from typing import List
from descriptors import DNADescriptor

def _calculate_descriptors(data: pd.DataFrame, descriptor_list: List) -> pd.DataFrame:
"""
From a dataset of sequences and labels, this function calculates the descriptors and returns a dataframe with them.
The user can also specify which descriptors to calculate.
"""
list_feature = []
count = 0
for seq in data['sequence']:
res = {'sequence': seq}
dna = DNADescriptor(seq)
features = dna.get_descriptors(descriptor_list)
res.update(features)
list_feature.append(res)

# print progress every 100 sequences
if count % 100 == 0:
print(count, '/', len(data))

count += 1
print("Done!")
df = pd.DataFrame(list_feature)
return df


def _process_lists(fps_x, field):
"""
A helper function to normalize lists.
"""
l = fps_x[field].to_list()
new_df = pd.DataFrame(l)
new_df.columns = [str(field) + "_" + str(i) for i in new_df.columns]
fps_x.drop(field, axis=1, inplace=True)
return new_df


def _process_lists_of_lists(fps_x, field):
"""
A helper function to normalize lists of lists.
"""
l = fps_x[field].to_list()
new_df = pd.DataFrame(l)
new_df.columns = [str(field) + "_" + str(i) for i in new_df.columns]
empty_val = {} if field == "enhanced_nucleic_acid_composition" else []
small_processed = []
for f in new_df.columns:
col = [empty_val if i is None else i for i in new_df[f].to_list()]
sub = pd.DataFrame(col)
sub.columns = [str(f) + "_" + str(i) for i in sub.columns]
small_processed.append(sub)
fps_x.drop(field, axis=1, inplace=True)
return small_processed



def normalization(fps_x, descriptor_list):
"""
Because the model cannot process data in dictionaries and lists, the descriptors that produce these forms must still be normalized.

To normalize the data, dicts and lists need to "explode" into more columns.

E.g. dicts:

| descriptor_hello |
| ---------------- |
| {'a': 1, 'b': 2} |

will be transformed into:

| descriptor_hello_a | descriptor_hello_b |
| ------------------ | ------------------ |
| 1 | 2 |

E.g. lists:

| descriptor_hello |
| ---------------- |
| [1, 2, 3] |

will be transformed into:

| descriptor_hello_0 | descriptor_hello_1 | descriptor_hello_2 |
| ------------------ | ------------------ | ------------------ |
| 1 | 2 | 3 |
"""
lists = ["nucleic_acid_composition", "dinucleotide_composition", "trinucleotide_composition",
"k_spaced_nucleic_acid_pairs", "kmer", "PseDNC", "PseKNC", "DAC", "DCC", "DACC", "TAC", "TCC", "TACC"]
lists_of_lists = [
"accumulated_nucleotide_frequency"
]

# update to be normalized lists with only columns the user wants
if(descriptor_list != []):
lists = [l for l in lists if l in descriptor_list]
lists_of_lists = [l for l in lists_of_lists if l in descriptor_list]

small_processed = []
for i in lists:
new_df = _process_lists(fps_x, i)
small_processed.append(new_df)

for i in lists_of_lists:
smaller_processed = _process_lists_of_lists(fps_x, i)
small_processed += smaller_processed

new_fps_x = pd.concat([fps_x, *small_processed], axis=1)
return new_fps_x


def calculate_and_normalize(data: pd.DataFrame, descriptor_list: list = []) -> pd.DataFrame:
"""
This function calculates the descriptors and normalizes the data all at once from a dataframe of sequences and labels. The user can also specify which descriptors to calculate.
"""
features = _calculate_descriptors(data, descriptor_list)
if 'label' in data:
fps_y = data['label']
else:
fps_y = None
fps_x = features.loc[:, features.columns != 'label']
fps_x = fps_x.loc[:, fps_x.columns != 'sequence']
fps_x = normalization(fps_x, descriptor_list)
return fps_x, fps_y

if __name__ == "__main__":
from read_sequence import ReadDNA
reader = ReadDNA()
filename = 'datasets/primer/dataset.csv'
data = reader.read_csv(filename=filename, with_labels=True)
fps_x, fps_y = calculate_and_normalize(data)
print(fps_x)
34 changes: 34 additions & 0 deletions src/propythia/DNA/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"combination":{
"model_label": "bi_lstm",
"mode": "chemical",
"data_dir": "essential_genes_100k_cut",
"class_weights": [1.0, 1.0]
},
"do_tuning": true,
"fixed_vals":{
"epochs": 500,
"optimizer_label": "adam",
"loss_function": "cross_entropy",
"patience": 2,
"output_size": 2,
"cpus_per_trial": 2,
"gpus_per_trial": 2,
"num_samples": 5,
"kmer_one_hot": 2
},
"hyperparameters": {
"hidden_size": 32,
"lr": 1e-3,
"batch_size": 32,
"dropout": 0.35,
"num_layers": 1
},
"hyperparameter_search_space": {
"hidden_size": [32, 64, 128],
"lr": [1e-4, 1e-3, 1e-2],
"batch_size": [16, 32, 64],
"dropout": [0.2, 0.3, 0.4, 0.5],
"num_layers": [1, 2, 3]
}
}
Binary file added src/propythia/DNA/data/mmc3.data
Binary file not shown.
Binary file added src/propythia/DNA/data/mmc4.data
Binary file not shown.
47 changes: 47 additions & 0 deletions src/propythia/DNA/deep_ml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
"""
########################################################################
Runs a combination of hyperparameters or performs hyperparameter tuning
for the given model, feature mode, and data directory.
########################################################################
"""

import torch
import os
from src.prepare_data import prepare_data
from src.test import test
from src.hyperparameter_tuning import hyperparameter_tuning
from src.train import traindata
from utils import print_metrics, read_config

os.environ["CUDA_VISIBLE_DEVICES"] = '1,2,3,4,5'
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

def perform(config):
if config['do_tuning']:
hyperparameter_tuning(device, config)
else:
model_label = config['combination']['model_label']
mode = config['combination']['mode']
data_dir = config['combination']['data_dir']
class_weights = config['combination']['class_weights']
batch_size = config['hyperparameters']['batch_size']
kmer_one_hot = config['fixed_vals']['kmer_one_hot']
hyperparameters = config['hyperparameters']

trainloader, testloader, validloader, input_size, sequence_length = prepare_data(
data_dir=data_dir,
mode=mode,
batch_size=batch_size,
k=kmer_one_hot,
)

# train the model
model = traindata(hyperparameters, device, config, trainloader, validloader, input_size, sequence_length)

# test the model
metrics = test(device, model, testloader)
print_metrics(model_label, mode, data_dir, kmer_one_hot, class_weights, metrics)

if __name__ == '__main__':
config = read_config(device)
perform(config)
Loading