In this post, we are going to perform Exploratory Data Analysis to understand how data is used to minimize the risk of losing money while lending to customers.
We are going to have a look at the dataset for one of the financial institution called LandingClub.
Background
LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.
It enables borrowers to create unsecured personal loans between $1,000 and $40,000. The standard loan period is three years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.
Business Understanding
When a loan application is received by the company, the company must make a decision to approve the loan based on the profile of the applicant. Two kinds of risks are linked to the choice of the bank
- If the applicant is inclined to repay the loan, then not approving the loan will result in a business loss to the company
- If the applicant is unlikely to repay the loan, i.e. if the applicant is inclined to default, then approving the loan may result in a financial loss to the company.
The raw data includes information on previous credit candidates and whether or not they have 'defaulted.' The goal is to define patterns that show whether an individual is likely to default, which can be used to take measures such as denying the loan, decreasing credit quantity, lending (to dangerous candidates) at a greater interest rate, etc.
If the firm approves the loan, there are three possible scenarios outlined below:
- Fully paid: Applicant has fully paid the loan (the principal and the interest rate)
- Current: Applicant is paying the instalments, i.e. the credit holding is not yet finished. The labelling of these applicants is not' default.'
- Charged-off: The applicant has not paid the instalments for a lengthy period of time in due time, i.e. he/she has defaulted on the loan
Objective
To understand the driving factors behind loan default
Code
We are using Python for performing EDA
Raw Data
You can download the raw data from the below link.
Link -
loan.zip
Importing the libraries
Import the below libraries which are required for this analysis.
Reading the CSV
We can easily create a dataframe using below from a csv file.
Cleaning the Data
Data cleaning is the first and one of the most important step.
The raw data may contain some invalid data, empty/null records, corrupted data etc. We need to clean these inconsistencies based on the need, so that there is no noise in our data analysis.
Below code lets us know, what is the percentage of the Null/NAN values in each column and rows with all null values.
Deleting the null values
The columns which have all the values as null can be removed as they won't impact our analysis.
We can also remove the other columns which don't have much data. If for the columns more than around 70% ~80% data is null, we can remove them.
But these decisions also depend on business needs and those should be considered while taking any decision of removing any data.
Removing Outliers
Outliers are data points not belonging to a given population. They are an abnormal observation that lies far away from other values.
We can identify outliers in many ways.
Box Plot
Distribution Plot
There are many ways we can remove the outliers-
- Deleting observations
- Imputing Values
- Transforming and binning values
In this case, we can see that we have some outliers between -2000 and -3000. Hence we can remove these. Also based on business knowledge, the FICO cannot be below zero, so we can remove all the data below zero.
After removing outliers we can see that the data has a uniform spread.
Bivariate Analysis
Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis, used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y.
- Univariate analysis is the analysis of one (“uni”) variable.
- Bivariate analysis is the analysis of exactly two variables.
- Multivariate analysis is the analysis of more than two variables.
Correlation
The term "correlation" refers to a mutual relationship between the two variables. If two variables have a high correlation, we can keep only one as they will have same impact on the target value. This will also help us in reducing the dimensionality of the dataset.
Heatmap
Heatmap provides the visual representation of the correlation, which helps us analyze the relationships better.
Based on the correlation, we can infer that below columns have high correlation
- funded_amnt,funded_amnt_inv,loan_amnt,total_pymnt,total_pymnt_inv,total_rec_prncp
- out_prncp,out_prncp_inv
- pub_rec,pub_rec_bankruptcies
- open_acc,total_acc
columns which are inversely proportional
- credit_used and revol_util
We can choose one of the columns among each group as they will have same impact on the target column i.e. loan_status
Impact of the columns on the target
We need to identify the impact of each column on the target. There many ways to visualize this.
Pearson correlation coefficient
It is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.
Using Bar Chart
Based on the above observations we can identify the below variables which impact for the loan being defaulted.
Columns' Relation with loan being default
1. int_rate- direct
2. annual_inc-inverse
3. mths_since_last_delinq - direct
4. pub_rec_bankruptcies- direct
5. delinq_2yrs - direct
6. dti - direct
7. fico - direct
8. credit_history -inverse
9. credit_used -inverse
10. payment_hist - direct
11. revol_util -direct
Target Variables
Let's now just emphasize on the data where clients have been charged-off i.e. they were unable to pay the loan.
Values which are impacting the loan being charged-off
1. Grade B and C
2. 10+ Years Employee length
3. Rent and Mortgage - Home Ownership
4. Not Verified- Verification status
5. Purpose-Debt Consolidation
6. Term- 36 months
By consolidating all the information above, we can conclude that the below are the indicators for loan being default.
Nice article.
ReplyDelete