SPAM classification

Date:

​Key Words: R, Logistic Regression , GBM , tf-idf​​
​​
This project is to classify email into SPAM , NOT SPAM and it is a in-class competition for the statistical learning course. Data set for the analysis can be found at UCI ML Repo , it contains normalized tf-idf of several important words that can discriminate an email into SPAM/NOT SPAM. As part of data exploration ,i) variable selection was performed based on the feature density plots ,ii) removed high correlations based on vif measure and iii) identified significant variables and their interactions by interpreting odds ratio from logistic regression. For the binary classification objective several machine learning models were used such as GLM, ElasticNet, GBM, Radial SVM with best set of variables and their interactions. Our feature selection with GLM and GBM model as classifier ranked top on the competition leader board. ​
Links : Git Repo