Prediction of Customer Churn in a Bank Using Machine Learning

Wuraolaifeoluwa
11 min readMay 21, 2021
Image: LinkedIn

Churn is the measure of how many customers stop using a product. This can be measured based on actual usage or failure to renew (when the product is sold using a subscription model). Often evaluated for a specific period of time, there can be a monthly, quarterly, or annual churn rate.

A bank is a financial institution licensed to receive deposits and make loans. Banks may also provide financial services such as wealth management, currency exchange. There are several different kinds of banks including retail banks, commercial or corporate banks, and investment banks.

When new customers start buying and/or using a product in a bank, each new user contributes to the bank product’s growth rate. Certainly, some of those customers in due course will stop their utilization or end their subscription; this could be because they switched to a competitor, no longer need the bank services, they’re unhappy with their user experience, or they can no longer afford the cost. The customers that stop using the bank products are the “churn” for a given period. Which can be can be a monthly, quarterly, or annual churn rate at the bank.

As we know, it is much more expensive to sign in a new client than keeping an existing one and the fact that more profits are produced through long-term costumers. Therefore, customer retention increases profitability. Many competitive companies have noticed that a key approach for survival within the industry is to retain existing customers. This leads to the importance of churn management in organizations such as a bank.

Since customers are the most valuable assets of most banking institutions, It is advantageous for banks to know what leads a client towards the decision to leave the company. Churn prevention allows companies to develop loyalty programs and retention campaigns to keep as many customers as possible.

I began this analysis with goals to discover key insights from the bank customers database and study the customers’ demographics such as customer (gender, age, and location). Also, I incline to understand the company’s product and customer’s financial history such as customer (credit score, estimated salary, balance, tenure, credit card possession, etc.). Lastly, how variable such as customers demographics and financial history affects the customers churn rate.

In this article, I will be performing analysis and developing a prediction model for bank customer churn.

METHODOLOGY

I used CRISP-DM to build a bank customer churn prediction model. In this methodology, a 5-phase technique was used:

1. Data collection

2. Data understanding

3. Data preprocessing

4. Modelling and Evaluation

5. Deployment

Data collection

The data used in this article to perform analysis and predictive modelling of bank customer churn was sourced from kaggle.

Data Understanding

Data understanding focuses on the identification and analysis of the data that can help us accomplish our project goals. Understanding the data involves various operations such as data loading, data description, data quality, data visualization, etc.

The first step is to import all the necessary libraries needed for analysis and modelling.

#load necessary librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineimport seaborn as snsfrom sklearn.metrics import roc_auc_scorefrom sklearn.metrics import plot_roc_curvefrom sklearn.model_selection import train_test_split, cross_val_score, GridSearchCVfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifierfrom catboost import CatBoostClassifierfrom xgboost import XGBClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom lightgbm import LGBMClassifierfrom sklearn.ensemble import VotingClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import accuracy_score, f1_score, classification_reportimport os, sysimport warningswarnings.filterwarnings(‘ignore’)

Since the data is in csv format, use .read_csv () format to read the data.

#load datasetchurn = pd.read_csv(‘/content/drive/MyDrive/churn.csv’)

To view the first 5 columns of the data, we use .head() function. Here is an overview of what the dataset looks like by calling the name of the file using .head() function.

Overview of the data

To get the statistical overview of the data, I used .describe()

Statistical description of the data

The .info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame. Check the image below to view basic information about the data.

#information about the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

From the above, there are 10000 observations and 14 variables in the data set and there were no missing values. Since there are no missing values let’s perform basic visualization to understand how the data is distributed.

From the visualization above, the number of customers that exited the bank is lower compared to the number of customers that didn’t leave the bank. Let’s visualize the relationship between the target variable (exited) and the categorical and numerical variables.

Gender distribution

From the visualization above, Female customers left the bank more often compared to the Male customers. Now let’s analyse the distribution of ‘Geography’ and the target variable (Exited).

Geography distribution

From the visualization above, the average loss of customers is highest in Germany followed by France and the least in Spain. Now lets’s analyse the distribution of ‘NumOfProducts’ and the target variable (Exited).

Distribution of NumOfProducts

From the visualization above, it is observed that customers who buy more than 2 products have a high rate of loss, but let’s not forget that our data is unstable. All of the customers (60 people) who bought 4 products left the bank. I believe there might be something unexplained in the data here. Perhaps it is because the bank used to have more products but now it doesn’t, and older customers, with greater tenure, that have been with them for a long time, benefited from different products/services that are no longer available.

Going further, let’s analyse the relationship between ‘Age’ and the target variable (Exited).

Age Distribution chart

From the visualization above, exited customers are older, on average, than those still active. This kind of makes sense, as clients who have left must have been with the bank some time. The young ones have not really had the reason or the opportunity to yet leave. The bank should look out for middle aged clients who might be looking for alternatives. Finally, lets’s analyse the distribution of ‘IsActiveMember’ with the target variable (Exited).

Distribution of active members

From the visualization above, customers who do not actively use the bank leave the bank more. This is a sure sign of not sticking with the bank much.

The analysis and visualization of the dataset above show the dataset is unstable/imbalanced. This means that the number of data points available for the classes is different. For example, the number of exited customers is lower than the number of customers that didn’t exit the bank.

Data preprocessing

This stage refers to data preparation or munging. It prepares the final data for modelling. It involves data cleaning, feature engineering, feature scaling, data formatting, etc. Firstly, I dropped the “RowNumber”, “CustomerId”, and “Surname” columns because they are not needed in this analysis i.e they don’t have any effect on the problem to be solved. Check the code snippet below to see how the variables are dropped;

data.drop([“RowNumber”,”CustomerId”,”Surname”], axis = 1 , inplace = True)

To detect the presence of outliers in the datasets, I performed basic visualization using a boxplot of the seaborn library to detect outliers. Check the image below to detect outliers in the dataset.

Boxplot chart to detect outliers

From the above visualization, there is the presence of outliers in columns such as “CreditScore”, “Age”, “NumOfProducts”. To remove and clean the outliers, I created a function to remove the outliers and I used the pandas library to clean the data.

def outlier_removal(data,column):q1 = data[column].quantile(0.25)q3 = data[column].quantile(0.75)iqr = q3 — q1point_low = q1–1.5 * iqrpoint_high = q3 + 1.5 * iqrcleaned_data = data.loc[(data[column] > point_low) & (data[column] < point_high)]return cleaned_data

Feature engineering was used to convert categorical data into numerical data to prepare our data ready for modelling and therefore creating more features in the dataset. Since the column “geography” is a categorical data let’s one-hot encode it by using pandas library (pd.get_dummies) to create more features from the “geography” column. Also, we create a function to convert the categorical data in “gender” to numerical data. For example, male = 0 while female = 1.

# since geography is a categorical data lets one-hot encode it by using pd.get_dummiesdata_cleaned = pd.get_dummies(data_cleaned, columns = [‘Geography’])# since gender is a categorical data lets label encode it as female = 1 and male = 0def func(data_cleaned):d =[]for m in data_cleaned:if m ==’Female’:d.append(1)else:d.append(0)return ddata_cleaned[‘Gender’] = func(data_cleaned[‘Gender’])

Now let’s check the correlation matrix of our variables.

Correlation chart of the variables

From the above, we observed age has the strongest relation with Exited (0.35). Here we can assume that as the age of the customer increases, the rate of losing the customer increases. (Positive strong relationship). Also, exited and balance variable have a relatively strong relationship (0.12). And Lastly, exited and the variable NumOfProducts have a moderately strong relationship (-0.11). They have a strong negative relationship.

I performed feature scaling (standardization) on some features using sklearn library (StandardScalar) to scale down features into properties of Standard Normal Distribution where mean = 0 and standard deviation = 1. I realized scaling gave a higher performance in algorithms that involves gradient descent such as Logistic Regression, Support Vector Classifier, and KNN. This means that feature scaling improved the performance of my (Logistic Regression, Support Vector Classifier, and KNN) models.

Data Modelling and Evaluation

This involves building and developing various models based on several different modelling techniques. In this stage, we determine the algorithm to use for predictive modelling and evaluate which models give the best performance.

Pending modelling selection and approach, we might need to split the data into training and test sets using sklearn train_test_split library. Since the project is a classification-based project (exited or not exited), I used classification models such as Logistic Regression, Support Vector Classifier, KNN, CatBoost, Gradient Boost, Light GBM to make a prediction.

After performing several operations such as hyper-parameter tuning, cross-validation the highest output was taken with CatBoost Classifier followed by Random Forest Classifier and XGB Classifier. Check below to see how each classifiers performed.

Model performance

Check below for the code implementation and accuracy score of the best 2 (Light GBM and Random Forest) performing model. Firstly, let’s view Light GBM;

lgbm_model = LGBMClassifier(silent = 0, learning_rate = 0.09, max_delta_step = 2, n_estimators = 100, boosting_type = ‘gbdt’, max_depth = 10, eval_metric = “logloss”, gamma = 3, base_score = 0.5)lgbm_model.fit(x_train, y_train)
y_pred = lgbm_model.predict(x_test)
print(classification_report(y_test, y_pred, digits=2))
print(“Accuracy score of LightGBM: “,accuracy_score(y_test, y_pred))
precision recall f1-score support

0 0.88 0.96 0.92 2277
1 0.76 0.47 0.58 578

accuracy 0.86 2855
macro avg 0.82 0.72 0.75 2855
weighted avg 0.85 0.86 0.85 2855

Accuracy score of tuned LightGBM model: 0.8626970227670753

Now let’s view Random Forest;

rand = RandomForestClassifier(random_state = 42)rand.fit(x_train, y_train)
pred = rand.predict(x_test)
print(classification_report(y_test, pred, digits=2))
print("Accuracy_score of RandForest: ",accuracy_score(y_test, pred))
precision recall f1-score support

0 0.87 0.97 0.92 2277
1 0.77 0.45 0.57 578

accuracy 0.86 2855
macro avg 0.82 0.71 0.74 2855
weighted avg 0.85 0.86 0.85 2855

Accuracy score of Random Forest model: 0.8619964973730297

In the framework of imbalanced datasets, accuracy score might not be the best metric to evaluate model performance, since it doesn’t distinguish between numbers of correctly classified examples of different classes. Therefore, the best metric to evaluate an imbalance dataset is recall, f1 score, area under curve, ROC (Receiver Operating Characteristics) and so on.

Therefore, I evaluated my model using ROC metric to assess each classifier output quality. When the success of the models was examined, KNN, SVM, Logistic Regression Algorithm achieved low success compared to Catboost, XGB, Gradient Boost, LightGBM, and Random Forest Classifier. I obtained the best success score from the Random Forest with an accuracy of 86.2% and ROC curve of 0.95 (since a larger area under the curve (AUC) is usually better). As for the other models, even though their accuracy hovers around 80%, their AUC is not as good as the Random Forest and thus shouldn’t be used for a real-world scenario in this project. Below is the ROC plot of the Random Forest Classifier.

Random Forest roc_auc plot

Evaluation determines whether to proceed to deployment, iterate further, or initiate new projects. Since our random forest model performed well, we move to the model deployment stage.

Model Deployment

A model is not particularly useful unless the clients can access its results. This is the stage the model goes into production i.e the stage in which clients can access the model results. It involves series of complex operations ranging from deployment plan to plan maintenance and final report production. I deployed my model into production using Streamlit which is a popular open-source framework and the simplest way of building web applications and deploying machine learning models. For the functionality of my application, I deployed my Streamlit app to Heroku. To see my model in the production stage, check churn customer predictor.

Conclusion

  • Our aim in this project was to develop a churn prediction model using machine learning algorithms.
  • There were 14 variables and 10000 observations in the data set and there were no missing values.

The following conclusions came from the analysis on the features:

  • Most customers who use products 3 and 4 stopped working with the bank. All customers using product number 4 were gone.
  • Customers between the ages of 40 and 65 were more likely to quit the bank.
  • Those who had a credit score below 450 had high abandonment rates.
  • Predictions were made with a total of 8 classification models. The highest head was taken with LightGBM method.
  • Accuracy scores and ROC metric were calculated for each model and results were displayed.

For the complete exploratory, predictive analysis and how I deployed my model, click here to view all the codes on Github. Thanks for reading!!

Kindly connect with me on LinkedIn and Twitter.

Gratitude

This is my final project as a mentee of She Code Africa in the Data Science Track. Many thanks to my mentor Steven Kolawole for the guidance and encouragement in making this a success.

Resources

https://jfin-swufe.springeropen.com/articles/10.1186/s40854-016-0029-6

https://www.productplan.com/glossary/churn/#:~:text=Churn%20is%20the%20measure%20of,quarterly%2C%20or%20annual%20churn%20rate.

https://towardsdatascience.com/various-ways-to-evaluate-a-machine-learning-models-performance-230449055f15

--

--