MAIS 202 Final Project: Time Series Prediction Of SARS-CoV-2 Spike Glycoprotein Amino Acid Mutations With LSTM Networks

Final project files for MAIS 202 Fall 2020 cohort. This project sought to predict whether next generation amino acid mutations would occurr in the spike glycoprotein of SARS-CoV-2. Two approaches were taken, and dual stacked LSTM neural network architecture was employed.

1) Multilabel binary classification

Input a list of protein embedded representations of amino acid sequences tracing the estimated evolutionary path of SARS-CoV-2 in the past 10 months and return a binary vector labelling whether each site would mutate (1 = mutation, 0 = no mutation).

2) Single site binary classification

Input a list of single site embedded representations, also tracing the estimated evolutionary path, and return a binary value indicating whether mutation at the specific site is predicted to occur. This process can also be iterated across all amino acid sites to determine whether mutations are predicted to occur at each site.

Despite my best efforts, these two approaches cannot currently confidently predict mutations in the SARS-CoV-2 spike glycoprotein amino acid sequence. This can be illustrated with an example. Site 222 on the glycoprotein amino acid sequence was selected for single site mutation prediction because it was found to be the most abundant mutation. While the model classified whether it would mutate to an accuracy of ~56%, this value is inconclusive, because it is nearly identical to the data distribution. That is, it may as well have returned the exact same binary value each time, and would have still obtained a similar accuracy. Similarly, multilabel binary classification was inconclusive as well. Further details on the possible reasons for these results are included in the final project poster.

mais_final_project.pdf contains the final project poster

main.pynb contains all the code needed to run this project. It contains preprocessing steps, model assembling, and model training. All models were implemented using the Keras framework.

amino_acid contains a previous version of the preprocessing stage. It is not needed for running main.pynb

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
amino_acid		amino_acid
Deliverable 1.pdf		Deliverable 1.pdf
Deliverable 2.pdf		Deliverable 2.pdf
README.md		README.md
main.ipynb		main.ipynb
mais_final_project.pdf		mais_final_project.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAIS 202 Final Project: Time Series Prediction Of SARS-CoV-2 Spike Glycoprotein Amino Acid Mutations With LSTM Networks

Final project files for MAIS 202 Fall 2020 cohort. This project sought to predict whether next generation amino acid mutations would occurr in the spike glycoprotein of SARS-CoV-2. Two approaches were taken, and dual stacked LSTM neural network architecture was employed.

1) Multilabel binary classification

Input a list of protein embedded representations of amino acid sequences tracing the estimated evolutionary path of SARS-CoV-2 in the past 10 months and return a binary vector labelling whether each site would mutate (1 = mutation, 0 = no mutation).

2) Single site binary classification

mais_final_project.pdf contains the final project poster

main.pynb contains all the code needed to run this project. It contains preprocessing steps, model assembling, and model training. All models were implemented using the Keras framework.

amino_acid contains a previous version of the preprocessing stage. It is not needed for running main.pynb

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAIS 202 Final Project: Time Series Prediction Of SARS-CoV-2 Spike Glycoprotein Amino Acid Mutations With LSTM Networks

Final project files for MAIS 202 Fall 2020 cohort. This project sought to predict whether next generation amino acid mutations would occurr in the spike glycoprotein of SARS-CoV-2. Two approaches were taken, and dual stacked LSTM neural network architecture was employed.

1) Multilabel binary classification

Input a list of protein embedded representations of amino acid sequences tracing the estimated evolutionary path of SARS-CoV-2 in the past 10 months and return a binary vector labelling whether each site would mutate (1 = mutation, 0 = no mutation).

2) Single site binary classification

mais_final_project.pdf contains the final project poster

main.pynb contains all the code needed to run this project. It contains preprocessing steps, model assembling, and model training. All models were implemented using the Keras framework.

amino_acid contains a previous version of the preprocessing stage. It is not needed for running main.pynb

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages