IFT6390_Text-Generation

Final year project for IFT 6390- Fundamentals of Machine learning Team Members:
Kuruba Vijaya Lakshmi
Saadaoui Houda
Binulal Narayanan

In this project, we use different Data cleaning strategies, Imputation techniques and Machine Learning models to classify protein sequences as AMP or non-AMP.

Approach

Exploratory data analysis
Based on data analysis we found 2 challenges:

Imbalanced dataset: No of non-AMPs are more way higher than AMP.
the distribution of the sequence lengths varies a lot between the two classes.

We have Tried 4 different options with dataset to address Imbalance in sequence lengths

Original dataset
Dataset 1: We reconstructed Non Amp sequences such that we ignored the samples whose sequence legths are not in AMP sequences.
Dataset 2: Dataset in which sequence lengths >125 is ignored, based on fact that 95 percentile of AMP sequence length is 125.
Dataset 3 : Considered only Amplify dataset since the samples and sequence lengths are balanced

For Data samples imbalance , Tried Oversampling and Undersampling Techniques.
Models :

1. Random forest classifier + Oversampling + Dataset1
1. Random forest classifier + Oversampling + Dataset2
1. Random forest classifier + Undersampling + Dataset1
1. SVM Classifier + No Sampling +Cost penalised+ Original dataset
1. Neural networks + Dataset 3

Results

This repositary has 2 colab files.

IFT_6390_Final_Project_1_.ipynb - It contains Models option1 - option 5
IFT_6390_Final_Project_1_supp.ipynb - It contains different data strategy (data split on different ratios) modelled with Random forest

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Result_csv		Result_csv
IFT6390_Final_Project_1_Supp.ipynb		IFT6390_Final_Project_1_Supp.ipynb
IFT_6390_Final_Project_1_.ipynb		IFT_6390_Final_Project_1_.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IFT6390_Text-Generation

Approach

About

Releases

Packages

Contributors 2

Languages

vijayakuruba/IFT6390_Classification_of_protien_sequences

Folders and files

Latest commit

History

Repository files navigation

IFT6390_Text-Generation

Approach

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages