In this project we illustrate point prediction in the context of supervised binary classification for measuring credit risk inherent in a financial application. The goal of this project is to apply various machine learning models in estimating the credit worthiness of an applicant. The project also utilizes PCA for dimension reduction of the various features in-order to increase efficiency during training the supervised binary classifiers.
Dataset
The dataset comprises of loan data from Lending Club, a peer-to-peer lending company.
Main goals
The project is designed to achieve the following key objectives:
Data inspection and visualization: Perform thorough inspection and visualization of the data to uncover initial insights and patterns.
Data loading and preprocessing: Load and preprocess historical loan data to ensure data integrity. Standardize continuous features and transform categorical features via one-hot encoding.
Principal component analysis: Apply PCA to reduce the dimensionality of the transformed dataset and identify the principal components that capture up to 95% of variance of the underlying structure and relationships within the data.
Fit and assessment: We separate the data into an estimation and test set. We calibrate logistic regression and CART classifiers along with gradient boosting on this estimation set.
Cross-validation and Hyperparameter tuning: We apply five fold cross validation and hyperparameter tuning to improve model’s accuracy and performance.
Performance metrics: We compare various performance metrics like confusion matrix, ROC curve and AUC, for the above binary classifiers to assess the goodness of fit.
Conclusion
The project tackles several challenges associated with estimating credit risk associated with financial data analysis. PCA plays a pivotal role in improving training efficiency and accuracy. This analysis is especially important when evaluating credit risk of a borrower, thus minimizing loan losses.