Skip to main content

Featured

Churn Score : Predict the customer churn using Machine Learning

In today's world, in any industry, customers have multiple options to choose from like in Telecom, we have AT&T, Verizon, T- mobile, etc, in Media, we have Netflix, Amazon Prime, Apple TV, etc.  and similarly for others. In this highly competitive market, the industries face a large number of customer churn. Considering that obtaining a new customer costs 5-10 times more than maintaining a current one, hence,  retention of customers has now become even more important than acquiring a customer. Here we are going to take data from one of the telecom company, to understand the customer churn. What is the customer churn? If a client or subscriber ceases to do business with a company or service, its called Customer Churn.  Now let's take a deep dive to understand why the Customer Churn happens and how to predict it before happening. Understanding the data Before working on any model, we need to understand the data and try to a...

Performing Exploratory Data Analysis (EDA) using Python to predict Loan Defaulters



In this post, we are going to perform Exploratory Data Analysis to understand how data is used to minimize the risk of losing money while lending to customers.

We are going to have a look at the dataset for one of the financial institution called LandingClub.

Background

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It enables borrowers to create unsecured personal loans between $1,000 and $40,000. The standard loan period is three years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee.


Business Understanding

When a loan application is received by the company, the company must make a decision to approve the loan based on the profile of the applicant. Two kinds of risks are linked to the choice of the bank
  • If the applicant is inclined to repay the loan, then not approving the loan will result in a business loss to the company
  • If the applicant is unlikely to repay the loan, i.e. if the applicant is inclined to default, then approving the loan may result in a financial loss to the company.
The raw data includes information on previous credit candidates and whether or not they have 'defaulted.' The goal is to define patterns that show whether an individual is likely to default, which can be used to take measures such as denying the loan, decreasing credit quantity, lending (to dangerous candidates) at a greater interest rate, etc.

If the firm approves the loan, there are three possible scenarios outlined below:
  • Fully paid: Applicant has fully paid the loan (the principal and the interest rate) 
  • Current: Applicant is paying the instalments, i.e. the credit holding is not yet finished. The labelling of these applicants is not' default.'
  • Charged-off: The applicant has not paid the instalments for a lengthy period of time in due time, i.e. he/she has defaulted on the loan

Objective

To understand the driving factors behind loan default

Code

We are using Python for performing EDA

Raw Data


You can download the raw data from the below link.

Link -  loan.zip

Importing the libraries

Import the below libraries which are required for this analysis.


Reading the CSV

We can easily create a dataframe using below from a csv file.


Cleaning the Data

Data cleaning is the first and one of the most important step.
The raw data may contain some invalid data, empty/null records, corrupted data etc. We need to clean these inconsistencies based on the need, so that there is no noise in our data analysis.

Below code lets us know, what is the percentage of the Null/NAN values in each column and rows with all null values.


Deleting the null values


The columns which have all the values as null can be removed as they won't impact our analysis.

We can also remove the other columns which don't have much data. If for the columns more than around 70% ~80% data is null, we can remove them.
But these decisions also depend on business needs and those should be considered while taking any decision of removing any data.


Removing Outliers


Outliers are data points not belonging to a given population. They are an abnormal observation that lies far away from other values.

We can identify outliers in many ways.

Box Plot




Distribution Plot




There are many ways we can remove the outliers-

  • Deleting observations
  • Imputing Values
  • Transforming and binning values
In this case, we can see that we have some outliers between -2000 and -3000. Hence we can remove these. Also based on business knowledge, the FICO cannot be below zero, so we can remove all the data below zero.



After removing outliers we can see that the data has a uniform spread.


Bivariate Analysis


Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis, used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y.
  • Univariate analysis is the analysis of one (“uni”) variable.
  • Bivariate analysis is the analysis of exactly two variables.
  • Multivariate analysis is the analysis of more than two variables.

Correlation

The term "correlation" refers to a mutual relationship between the two variables. If two variables have a high correlation, we can keep only one as they will have same impact on the target value. This will also help us in reducing the dimensionality of the dataset.



Heatmap


Heatmap provides the visual representation of the correlation, which  helps us analyze the relationships better.



Based on the correlation, we can infer that below columns have high correlation
  • funded_amnt,funded_amnt_inv,loan_amnt,total_pymnt,total_pymnt_inv,total_rec_prncp
  • out_prncp,out_prncp_inv
  • pub_rec,pub_rec_bankruptcies
  • open_acc,total_acc

columns which are inversely proportional
  • credit_used and revol_util
We can choose one of the columns among each group as they will have same impact on the target column i.e. loan_status

Impact of the columns on the target


 We need to identify the impact of each column on the target. There many ways to visualize this.

Pearson correlation coefficient


It  is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.



Using Bar Chart



Based on the above observations we can identify the below variables which impact for the loan being defaulted.


Quantitative Variables - 


Columns' Relation with loan being default

1. int_rate- direct
2. annual_inc-inverse
3. mths_since_last_delinq - direct
4. pub_rec_bankruptcies- direct
5. delinq_2yrs - direct
6. dti - direct
7. fico - direct
8. credit_history -inverse
9. credit_used -inverse
10. payment_hist - direct
11. revol_util -direct

Target Variables


Let's now just emphasize on the data where clients have been charged-off i.e. they were unable to pay the loan.



Values which are impacting the loan being charged-off


1. Grade B and C
2. 10+ Years Employee length
3. Rent and Mortgage - Home Ownership
4. Not Verified- Verification status
5. Purpose-Debt Consolidation
6. Term- 36 months

Conclusion

By consolidating all the information above, we can conclude that the below are the indicators for loan being default.





Comments

Post a Comment