Prediction of Customer Churn in a Bank Using Machine Learning

11 min readMay 21, 2021

Churn is the measure of how many customers stop using a product. This can be measured based on actual usage or failure to renew (when the product is sold using a subscription model). Often evaluated for a specific period of time, there can be a monthly, quarterly, or annual churn rate.

A bank is a financial institution licensed to receive deposits and make loans. Banks may also provide financial services such as wealth management, currency exchange. There are several different kinds of banks including retail banks, commercial or corporate banks, and investment banks.

When new customers start buying and/or using a product in a bank, each new user contributes to the bank product’s growth rate. Certainly, some of those customers in due course will stop their utilization or end their subscription; this could be because they switched to a competitor, no longer need the bank services, they’re unhappy with their user experience, or they can no longer afford the cost. The customers that stop using the bank products are the “churn” for a given period. Which can be can be a monthly, quarterly, or annual churn rate at the bank.

As we know, it is much more expensive to sign in a new client than keeping an existing one and the fact that more profits are produced through long-term costumers. Therefore, customer retention increases profitability. Many competitive companies have noticed that a key approach for survival within the industry is to retain existing customers. This leads to the importance of churn management in organizations such as a bank.

Since customers are the most valuable assets of most banking institutions, It is advantageous for banks to know what leads a client towards the decision to leave the company. Churn prevention allows companies to develop loyalty programs and retention campaigns to keep as many customers as possible.

I began this analysis with goals to discover key insights from the bank customers database and study the customers’ demographics such as customer (gender, age, and location). Also, I incline to understand the company’s product and customer’s financial history such as customer (credit score, estimated salary, balance, tenure, credit card possession, etc.). Lastly, how variable such as customers demographics and financial history affects the customers churn rate.

In this article, I will be performing analysis and developing a prediction model for bank customer churn.

METHODOLOGY

I used CRISP-DM to build a bank customer churn prediction model. In this methodology, a 5-phase technique was used:

1. Data collection

2. Data understanding

3. Data preprocessing

4. Modelling and Evaluation

5. Deployment

Data collection

The data used in this article to perform analysis and predictive modelling of bank customer churn was sourced from kaggle.

Data Understanding

Data understanding focuses on the identification and analysis of the data that can help us accomplish our project goals. Understanding the data involves various operations such as data loading, data description, data quality, data visualization, etc.

The first step is to import all the necessary libraries needed for analysis and modelling.

#load necessary librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlineimport seaborn as snsfrom sklearn.metrics import roc_auc_scorefrom sklearn.metrics import plot_roc_curvefrom sklearn.model_selection import train_test_split, cross_val_score, GridSearchCVfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifierfrom catboost import CatBoostClassifierfrom xgboost import XGBClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom lightgbm import LGBMClassifierfrom sklearn.ensemble import VotingClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import accuracy_score, f1_score, classification_reportimport os, sysimport warningswarnings.filterwarnings(‘ignore’)

Since the data is in csv format, use .read_csv () format to read the data.

#load datasetchurn = pd.read_csv(‘/content/drive/MyDrive/churn.csv’)

To view the first 5 columns of the data, we use .head() function. Here is an overview of what the dataset looks like by calling the name of the file using .head() function.

To get the statistical overview of the data, I used .describe()

The .info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame. Check the image below to view basic information about the data.

#information about the data
data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

From the above, there are 10000 observations and 14 variables in the data set and there were no missing values. Since there are no missing values let’s perform basic visualization to understand how the data is distributed.

From the visualization above, the number of customers that exited the bank is lower compared to the number of customers that didn’t leave the bank. Let’s visualize the relationship between the target variable (exited) and the categorical and numerical variables.

From the visualization above, Female customers left the bank more often compared to the Male customers. Now let’s analyse the distribution of ‘Geography’ and the target variable (Exited).

From the visualization above, the average loss of customers is highest in Germany followed by France and the least in Spain. Now lets’s analyse the distribution of ‘NumOfProducts’ and the target variable (Exited).

From the visualization above, it is observed that customers who buy more than 2 products have a high rate of loss, but let’s not forget that our data is unstable. All of the customers (60 people) who bought 4 products left the bank. I believe there might be something unexplained in the data here. Perhaps it is because the bank used to have more products but now it doesn’t, and older customers, with greater tenure, that have been with them for a long time, benefited from different products/services that are no longer available.

Going further, let’s analyse the relationship between ‘Age’ and the target variable (Exited).

From the visualization above, exited customers are older, on average, than those still active. This kind of makes sense, as clients who have left must have been with the bank some time. The young ones have not really had the reason or the opportunity to yet leave. The bank should look out for middle aged clients who might be looking for alternatives. Finally, lets’s analyse the distribution of ‘IsActiveMember’ with the target variable (Exited).

From the visualization above, customers who do not actively use the bank leave the bank more. This is a sure sign of not sticking with the bank much.

The analysis and visualization of the dataset above show the dataset is unstable/imbalanced. This means that the number of data points available for the classes is different. For example, the number of exited customers is lower than the number of customers that didn’t exit the bank.

Data preprocessing

This stage refers to data preparation or munging. It prepares the final data for modelling. It involves data cleaning, feature engineering, feature scaling, data formatting, etc. Firstly, I dropped the “RowNumber”, “CustomerId”, and “Surname” columns because they are not needed in this analysis i.e they don’t have any effect on the problem to be solved. Check the code snippet below to see how the variables are dropped;

data.drop([“RowNumber”,”CustomerId”,”Surname”], axis = 1 , inplace = True)

To detect the presence of outliers in the datasets, I performed basic visualization using a boxplot of the seaborn library to detect outliers. Check the image below to detect outliers in the dataset.

From the above visualization, there is the presence of outliers in columns such as “CreditScore”, “Age”, “NumOfProducts”. To remove and clean the outliers, I created a function to remove the outliers and I used the pandas library to clean the data.

def outlier_removal(data,column):q1 = data[column].quantile(0.25)q3 = data[column].quantile(0.75)iqr = q3 — q1point_low = q1–1.5 * iqrpoint_high = q3 + 1.5 * iqrcleaned_data = data.loc[(data[column] > point_low) & (data[column] < point_high)]return cleaned_data

Feature engineering was used to convert categorical data into numerical data to prepare our data ready for modelling and therefore creating more features in the dataset. Since the column “geography” is a categorical data let’s one-hot encode it by using pandas library (pd.get_dummies) to create more features from the “geography” column. Also, we create a function to convert the categorical data in “gender” to numerical data. For example, male = 0 while female = 1.

# since geography is a categorical data lets one-hot encode it by using pd.get_dummiesdata_cleaned = pd.get_dummies(data_cleaned, columns = [‘Geography’])# since gender is a categorical data lets label encode it as female = 1 and male = 0def func(data_cleaned):d =[]for m in data_cleaned:if m ==’Female’:d.append(1)else:d.append(0)return ddata_cleaned[‘Gender’] = func(data_cleaned[‘Gender’])

Now let’s check the correlation matrix of our variables.

From the above, we observed age has the strongest relation with Exited (0.35). Here we can assume that as the age of the customer increases, the rate of losing the customer increases. (Positive strong relationship). Also, exited and balance variable have a relatively strong relationship (0.12). And Lastly, exited and the variable NumOfProducts have a moderately strong relationship (-0.11). They have a strong negative relationship.

I performed feature scaling (standardization) on some features using sklearn library (StandardScalar) to scale down features into properties of Standard Normal Distribution where mean = 0 and standard deviation = 1. I realized scaling gave a higher performance in algorithms that involves gradient descent such as Logistic Regression, Support Vector Classifier, and KNN. This means that feature scaling improved the performance of my (Logistic Regression, Support Vector Classifier, and KNN) models.

Data Modelling and Evaluation

This involves building and developing various models based on several different modelling techniques. In this stage, we determine the algorithm to use for predictive modelling and evaluate which models give the best performance.

Pending modelling selection and approach, we might need to split the data into training and test sets using sklearn train_test_split library. Since the project is a classification-based project (exited or not exited), I used classification models such as Logistic Regression, Support Vector Classifier, KNN, CatBoost, Gradient Boost, Light GBM to make a prediction.

After performing several operations such as hyper-parameter tuning, cross-validation the highest output was taken with CatBoost Classifier followed by Random Forest Classifier and XGB Classifier. Check below to see how each classifiers performed.

Check below for the code implementation and accuracy score of the best 2 (Light GBM and Random Forest) performing model. Firstly, let’s view Light GBM;

lgbm_model = LGBMClassifier(silent = 0, learning_rate = 0.09, max_delta_step = 2, n_estimators = 100, boosting_type = ‘gbdt’, max_depth = 10, eval_metric = “logloss”, gamma = 3, base_score = 0.5)lgbm_model.fit(x_train, y_train)
y_pred = lgbm_model.predict(x_test)print(classification_report(y_test, y_pred, digits=2))
print(“Accuracy score of LightGBM: “,accuracy_score(y_test, y_pred))precision    recall  f1-score   support

           0       0.88      0.96      0.92      2277
           1       0.76      0.47      0.58       578

    accuracy                           0.86      2855
   macro avg       0.82      0.72      0.75      2855
weighted avg       0.85      0.86      0.85      2855

Accuracy score of tuned LightGBM model:  0.8626970227670753

Now let’s view Random Forest;

rand = RandomForestClassifier(random_state = 42)rand.fit(x_train, y_train)
pred = rand.predict(x_test)print(classification_report(y_test, pred, digits=2))
print("Accuracy_score of RandForest: ",accuracy_score(y_test, pred))precision    recall  f1-score   support

           0       0.87      0.97      0.92      2277
           1       0.77      0.45      0.57       578

    accuracy                           0.86      2855
   macro avg       0.82      0.71      0.74      2855
weighted avg       0.85      0.86      0.85      2855

Accuracy score of Random Forest model:  0.8619964973730297

In the framework of imbalanced datasets, accuracy score might not be the best metric to evaluate model performance, since it doesn’t distinguish between numbers of correctly classified examples of different classes. Therefore, the best metric to evaluate an imbalance dataset is recall, f1 score, area under curve, ROC (Receiver Operating Characteristics) and so on.

Therefore, I evaluated my model using ROC metric to assess each classifier output quality. When the success of the models was examined, KNN, SVM, Logistic Regression Algorithm achieved low success compared to Catboost, XGB, Gradient Boost, LightGBM, and Random Forest Classifier. I obtained the best success score from the Random Forest with an accuracy of 86.2% and ROC curve of 0.95 (since a larger area under the curve (AUC) is usually better). As for the other models, even though their accuracy hovers around 80%, their AUC is not as good as the Random Forest and thus shouldn’t be used for a real-world scenario in this project. Below is the ROC plot of the Random Forest Classifier.

Evaluation determines whether to proceed to deployment, iterate further, or initiate new projects. Since our random forest model performed well, we move to the model deployment stage.

Model Deployment

A model is not particularly useful unless the clients can access its results. This is the stage the model goes into production i.e the stage in which clients can access the model results. It involves series of complex operations ranging from deployment plan to plan maintenance and final report production. I deployed my model into production using Streamlit which is a popular open-source framework and the simplest way of building web applications and deploying machine learning models. For the functionality of my application, I deployed my Streamlit app to Heroku. To see my model in the production stage, check churn customer predictor.

Conclusion

Our aim in this project was to develop a churn prediction model using machine learning algorithms.
There were 14 variables and 10000 observations in the data set and there were no missing values.

The following conclusions came from the analysis on the features:

Most customers who use products 3 and 4 stopped working with the bank. All customers using product number 4 were gone.
Customers between the ages of 40 and 65 were more likely to quit the bank.
Those who had a credit score below 450 had high abandonment rates.
Predictions were made with a total of 8 classification models. The highest head was taken with LightGBM method.
Accuracy scores and ROC metric were calculated for each model and results were displayed.

For the complete exploratory, predictive analysis and how I deployed my model, click here to view all the codes on Github. Thanks for reading!!

Kindly connect with me on LinkedIn and Twitter.

Gratitude

This is my final project as a mentee of She Code Africa in the Data Science Track. Many thanks to my mentor Steven Kolawole for the guidance and encouragement in making this a success.

Resources

https://jfin-swufe.springeropen.com/articles/10.1186/s40854-016-0029-6

https://www.productplan.com/glossary/churn/#:~:text=Churn%20is%20the%20measure%20of,quarterly%2C%20or%20annual%20churn%20rate.

https://towardsdatascience.com/various-ways-to-evaluate-a-machine-learning-models-performance-230449055f15