Scientific Paper Subject Prediction

A machine learning approach to predict the subjects of scientific papers

Overview

This project aims to develop a machine learning approach to predict the subjects of scientific papers using the Cora dataset. The dataset consists of 2708 scientific publications classified into one of seven classes: Case_Based, Genetic_Algorithms, Neural_Networks, Probabilistic_Methods, Reinforcement_Learning, Rule_Learning, and Theory. Each paper is described by a 0/1-valued word vector indicating the absence/presence of corresponding words from the dictionary.

Dataset

The Cora dataset includes the following files:

cora.content: Descriptions of papers in the format <paper_id> <word_attributes>+ <class_label>.
cora.cites: Citation graph of the corpus in the format . The task involves the following steps:

Code Description

The provided Python script Scientific_Papers_Classification.ipynb contains the following functions:

load_data(): Loads the Cora dataset from the provided files and returns features representing word attributes (X) and class labels (y).
load_citation_graph(): Loads the citation graph from the .cites file and returns a list of citation information.
split_dataset(X, y): Splits the dataset into train and test sets using Stratified K-Fold cross-validation.
train_and_predict(X_train, y_train, X_test): Trains a Multinomial Naive Bayes classifier using the training set (X_train, y_train) and makes predictions on the test set (X_test).
save_predictions(predictions, test_indices, y_true): Saves the predictions to a tab-separated values (TSV) file.
evaluate_accuracy(y_true, predictions): Calculates the accuracy of the predictions.
Main Execution: Loads the data, performs cross-validation, trains the model, evaluates its performance, saves predictions, and calculates the mean accuracy across all folds.

Files Included

scientific_paper_subject_prediction.ipynb: Jupyter notebook containing the code for data loading, preprocessing, model development, evaluation, and prediction.
predictions.tsv: Tab-separated values file containing the predicted subjects for each paper.
cora.content and cora.cites: Original dataset files.

Dependencies

numpy
pandas
scikit-learn Install dependencies using the following command:

pip install numpy pandas scikit-learn

Author

Jayalaxmi Botsa

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
Scientific_Papers_Classification.ipynb		Scientific_Papers_Classification.ipynb
predictions.tsv		predictions.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scientific Paper Subject Prediction

Overview

Dataset

Code Description

Files Included

Dependencies

Author

About

Uh oh!

Releases

Packages

Languages

JayaBotsa/ML-Coding-Challenge

Folders and files

Latest commit

History

Repository files navigation

Scientific Paper Subject Prediction

Overview

Dataset

Code Description

Files Included

Dependencies

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages