Insurance Claims Loss Prediction

Date:

Key words: R, h2o, XGboost,LASSO,PCA,K-Means Clustering

​ Insurance domain is one of the prominent domains where statistical learning/Machine learning plays a crucial role. With the use of statistical learning we can automate the process of claim-cost prediction and, also the severity. This automated process ensures a worry-free customer experience and offers insight into better ways to predict claims severity.
For this Group project, our task was to predict the target variable “loss” (numeric value) based on the several anonymous predictor variables which include 116 categorical variables,14 continuous variables. Considering the large dimension of data with over 180,000 observations and 130 predictor variables great deal of feature selection ,engineering was needed.
​At first data exploration was performed for better understanding the problem using statistical tests and correlation analysis, then performed feature engineering using dimensionality reduction,clustering and feature selection using LASSO and Random Forest. Later modeled “loss” using XGboost , GBM ,LASSO with the selected important features.
​Jay’s role in this project was to design the pipeline for the analysis and feature engineering using PCA (Numeric data) , K-Means clustering by using Gower-distance, Correlation Analysis and Modelling “loss” using XGboost algorithm.
​Links : Project Report and Git Repo