Abstract:
The rising amount of fraud in claims has been of great concern to the insurance companies. In this research work, we developed two machine learning models namely, Extreme Gradient Boosting (XGBoost) and Random Forest for the purpose of insurance fraud detection based on auto insurance claims data. The models detect fraudulent claims and classify them into fraudulent or non-fraudulent. Different data pre-processing techniques are used to clean, explore, and extract relevant features. The effectiveness of the algorithms are observed using performance evaluation metrics: precision, recall and f1 score and confusion matrix. We also introduced the Synthetic Minority Oversampling (SMOTE) and Random Oversampling (ROS) data augmentation techniques to handle the imbalanced data and compare the results of the models before and after the data is balanced. The comparative results of classification algorithms conclude that the XGBoost model is effective in fraud detection than the Random Forest model on imbalanced data. In addition to this, the Random Forest model was effective in predicting fraudulent claims when the data augmentation techniques were applied.