Email Spam Filter using SVM

Many email services today provide spam filters that are able to classify emails into spam and non-spam email with high accuracy. In this project we will use SVMs to build our own spam filter. We will be training a classifier to classify whether a given email, is spam or non-spam. In particular, we need to convert each email into a feature vector. The dataset included for this project is based on a subset of the SpamAssassin Public Corpus. For the purpose of this project, we will only be using the body of the email (excluding the email headers).

We have implemented the following email preprocessing and normalization steps:

Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.
Normalizing URLs: All URLs are replaced with the text “httpaddr”.
Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.
Normalizing Numbers: All numbers are replaced with the text “number”.
Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar”.
Word Stemming: Words are reduced to their stemmed form. For example, “discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the Stemmer actually strips off additional characters from the end, so “include”, “includes”, “included”, and “including” are all replaced with “includ".
Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.

We load a preprocessed training dataset that will be used to train a SVM classifier. spamTrain.mat contains 4000 training examples of spam and non-spam email, while spamTest.mat contains 1000 test examples. Each original email was processed using the processEmail and emailFeatures functions and converted into a vector.

After loading the dataset, we will proceed to train a SVM to classify between spam and non-spam emails. Once the training completes, you should see that the classifier gets a training accuracy of about 99.8% and a test accuracy of about 98.5%.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
emailFeatures.m		emailFeatures.m
emailSample1.txt		emailSample1.txt
emailSample2.txt		emailSample2.txt
gaussianKernel.m		gaussianKernel.m
getVocabList.m		getVocabList.m
linearKernel.m		linearKernel.m
porterStemmer.m		porterStemmer.m
processEmail.m		processEmail.m
readFile.m		readFile.m
spam.m		spam.m
spamSample1.txt		spamSample1.txt
spamSample2.txt		spamSample2.txt
spamTest.mat		spamTest.mat
spamTrain.mat		spamTrain.mat
svmPredict.m		svmPredict.m
svmTrain.m		svmTrain.m
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Spam Filter using SVM

About

Releases

Packages

Languages

License

shivangbajaj/Email-Spam-Filter-using-SVM

Folders and files

Latest commit

History

Repository files navigation

Email Spam Filter using SVM

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages