Data Science Intern At Info Origin Inc.

May 2024 - August 2024

Repository for my work as Data Science Intern at Info Origin Inc.

Projects

BBC News Articles Classification using Google's NNLM & Custom Neural Network

Developed a custom neural network architecture from scratch for BBC News Articles Classification.
Used Google's NNLM model for text embeddings.
Defined training and testing PyTorch Datasets and DataLoaders.
Observed model behavior across various batch sizes, epochs, and learning rates.
Optimized model hyperparameters with Bayesian Optimization.
Highest Accuracy - 96.4%.
Notebook

Named Entity Recognition for Job Descriptions

Annotated Job Descriptions with custom entities using Doccano.
Trained a custom spaCy NER model to recognize entities like Education, Role, Tools & Tech, etc.
Developed a Streamlit app for real-time entity recognition.
Integrated displaCy for enhanced visualization of annotated text and potential HR tool integration for efficient job parsing.
Project Files

BBC News Articles Classification - RoBERTa with Enriched Vocabulary Layer

Developed a custom RoBERTa model architecture with added encoded vocabulary layer.
Conducted EDA. Explored Class Distribution and Text Length Distribution.
Observed most frequent words in articles by class before and after stop word removal.
Preprocessed articles with stemming and lemmatization.
Tokenized articles using RoBERTa tokenizer.
Accuracy - 98%.
Notebook

Fine-tuning LLMs for Sentiment Analysis on SST-5

Fine-tuned various language models like DeBERTa, RoBERTa, ERNIE, DistilBERT, BERT, and GPT-2 for sentiment analysis on the SST-5 dataset.
Loaded the SST-5 dataset using the datasets library.
Tokenized all the examples in the dataset using corresponding tokenizer for the model.
Implemented Bayesian Optimization with scikit-optimize.
Due to computational resource limitations, I could only fully run optimization for DistilBERT and ERNIE.
Achieved the highest accuracy of 52.71% with the optimized ERNIE model (highest-to-date: 59.8%).
Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
NER_for_Job_Descriptions		NER_for_Job_Descriptions
BBC_News_Articles_Classification_Goolge_NNLM_&_Bayesian_Opt.ipynb		BBC_News_Articles_Classification_Goolge_NNLM_&_Bayesian_Opt.ipynb
BBC_News_Articles_Classification_RoBERTa_with_Enriched_Vocab_Layer.ipynb		BBC_News_Articles_Classification_RoBERTa_with_Enriched_Vocab_Layer.ipynb
README.md		README.md
SST5_ERNIE.ipynb		SST5_ERNIE.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Intern At Info Origin Inc.

Table of Contents

Projects

BBC News Articles Classification using Google's NNLM & Custom Neural Network

Named Entity Recognition for Job Descriptions

BBC News Articles Classification - RoBERTa with Enriched Vocabulary Layer

Fine-tuning LLMs for Sentiment Analysis on SST-5

About

Uh oh!

Releases

Packages

Languages

KunalSachdev2005/Data_Science_Intern_at_Info_Origin

Folders and files

Latest commit

History

Repository files navigation

Data Science Intern At Info Origin Inc.

Table of Contents

Projects

BBC News Articles Classification using Google's NNLM & Custom Neural Network

Named Entity Recognition for Job Descriptions

BBC News Articles Classification - RoBERTa with Enriched Vocabulary Layer

Fine-tuning LLMs for Sentiment Analysis on SST-5

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages