Faulty Water pump prediction
Date:
Key words: R, h2o , XGboost ,Sparklyr
This Project is a competition on DRIVENDATA, to predict which pumps are functional, which need some repairs, and which don’t work at all. Predicting one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which water points will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.
Data cleaning was as performed such as
i) Removed some of the variables as they are constant or no variation
ii) Some variables are similar to others such as latitude and longitude variables conveys location information ,so other redundant location variables were removed
iii) Some categorical variables have too many levels and such variables were discarded
iv) created new variables to account for age of pump based on installation date etc.,
After feature engineering ,Classification model was trained using Xgboost and Random Forest using h2o, sparklyr packages and interpreted significant variables for classifying faulty water based on information gain.
Links : Git Repo