Performing Exploratory Data Analysis (EDA) using Python to predict Loan Defaulters

By Sudhanshu Shukla September 23, 2019

Performing Exploratory Data Analysis (EDA) using Python to predict Loan Defaulters

In this post, we are going to perform Exploratory Data Analysis to understand how data is used to minimize the risk of losing money while lending to customers.

We are going to have a look at the dataset for one of the financial institution called LandingClub.

Background

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It enables borrowers to create unsecured personal loans between $1,000 and $40,000. The standard loan period is three years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.

Business Understanding

When a loan application is received by the company, the company must make a decision to approve the loan based on the profile of the applicant. Two kinds of risks are linked to the choice of the bank

If the applicant is inclined to repay the loan, then not approving the loan will result in a business loss to the company
If the applicant is unlikely to repay the loan, i.e. if the applicant is inclined to default, then approving the loan may result in a financial loss to the company.

The raw data includes information on previous credit candidates and whether or not they have 'defaulted.' The goal is to define patterns that show whether an individual is likely to default, which can be used to take measures such as denying the loan, decreasing credit quantity, lending (to dangerous candidates) at a greater interest rate, etc.

If the firm approves the loan, there are three possible scenarios outlined below:

Fully paid: Applicant has fully paid the loan (the principal and the interest rate)
Current: Applicant is paying the instalments, i.e. the credit holding is not yet finished. The labelling of these applicants is not' default.'
Charged-off: The applicant has not paid the instalments for a lengthy period of time in due time, i.e. he/she has defaulted on the loan

Objective

To understand the driving factors behind loan default

Code

We are using Python for performing EDA

Raw Data

You can download the raw data from the below link.

Link - loan.zip

Importing the libraries

Import the below libraries which are required for this analysis.

Reading the CSV

We can easily create a dataframe using below from a csv file.

Cleaning the Data

Data cleaning is the first and one of the most important step.

The raw data may contain some invalid data, empty/null records, corrupted data etc. We need to clean these inconsistencies based on the need, so that there is no noise in our data analysis.

Below code lets us know, what is the percentage of the Null/NAN values in each column and rows with all null values.

Deleting the null values

The columns which have all the values as null can be removed as they won't impact our analysis.

We can also remove the other columns which don't have much data. If for the columns more than around 70% ~80% data is null, we can remove them.
But these decisions also depend on business needs and those should be considered while taking any decision of removing any data.

Removing Outliers

Outliers are data points not belonging to a given population. They are an abnormal observation that lies far away from other values.

We can identify outliers in many ways.

Box Plot

Distribution Plot

There are many ways we can remove the outliers-

Deleting observations
Imputing Values
Transforming and binning values

In this case, we can see that we have some outliers between -2000 and -3000. Hence we can remove these. Also based on business knowledge, the FICO cannot be below zero, so we can remove all the data below zero.

After removing outliers we can see that the data has a uniform spread.

Bivariate Analysis

Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis, used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y.

Univariate analysis is the analysis of one (“uni”) variable.
Bivariate analysis is the analysis of exactly two variables.
Multivariate analysis is the analysis of more than two variables.

Correlation

The term "correlation" refers to a mutual relationship between the two variables. If two variables have a high correlation, we can keep only one as they will have same impact on the target value. This will also help us in reducing the dimensionality of the dataset.

Heatmap

Heatmap provides the visual representation of the correlation, which helps us analyze the relationships better.

Based on the correlation, we can infer that below columns have high correlation

funded_amnt,funded_amnt_inv,loan_amnt,total_pymnt,total_pymnt_inv,total_rec_prncp
out_prncp,out_prncp_inv
pub_rec,pub_rec_bankruptcies
open_acc,total_acc

columns which are inversely proportional

credit_used and revol_util

We can choose one of the columns among each group as they will have same impact on the target column i.e. loan_status

Impact of the columns on the target

We need to identify the impact of each column on the target. There many ways to visualize this.

Pearson correlation coefficient

It is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

Using Bar Chart

Based on the above observations we can identify the below variables which impact for the loan being defaulted.

Quantitative Variables -

Columns' Relation with loan being default

1. int_rate- direct
2. annual_inc-inverse
3. mths_since_last_delinq - direct
4. pub_rec_bankruptcies- direct
5. delinq_2yrs - direct
6. dti - direct
7. fico - direct
8. credit_history -inverse
9. credit_used -inverse
10. payment_hist - direct
11. revol_util -direct

Target Variables

Let's now just emphasize on the data where clients have been charged-off i.e. they were unable to pay the loan.

Values which are impacting the loan being charged-off

1. Grade B and C
2. 10+ Years Employee length
3. Rent and Mortgage - Home Ownership
4. Not Verified- Verification status
5. Purpose-Debt Consolidation
6. Term- 36 months

Conclusion

By consolidating all the information above, we can conclude that the below are the indicators for loan being default.

Comments

Elka BrandSeptember 23, 2019 at 12:48 AM
Nice article.
ReplyDelete
Replies

Add comment

Search This Blog

Naive Data

Featured

Churn Score : Predict the customer churn using Machine Learning