Marketing Analytics - Predicting Customer Churn in Python (Datacamp course)
Churn is when a customer stops doing business or ends a relationship with a company. It’s a common problem across a variety of industries, from telecommunications to cable TV to SaaS, and a company that can predict churn can take proactive action to retain valuable customers and get ahead of the competition.
This course will provide you a roadmap to create your own customer churn models. You’ll learn how to explore and visualize your data, prepare it for modeling, make predictions using machine learning, and communicate important, actionable insights to stakeholders. By the end of the course, you’ll become comfortable using the pandas library for data analysis and the scikit-learn library for machine learning.
Exploratory Data Analysis
Begin exploring the Telco Churn Dataset using pandas to compute summary statistics and Seaborn to create attractive visualizations.
- 1.1 Defining and Exploring customer churn
- 1.2 Grouping and Summarizing data
- 1.3 Exploring Data with Visualization
- Feature distributions
- Customer service call and Churn
import pandas as pd
import numpy as np
import warnings
pd.set_option('display.expand_frame_repr', False)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)
telco = pd.read_csv("Churn.csv")
display(telco.head(10))
# show Churners and Non-Chuners
print(telco['Churn'].value_counts())
# Group 'CustServ_Calls' and 'Vmail_Message' by 'Churn' and compute the mean
print(telco[['CustServ_Calls', 'Vmail_Message', 'Churn']].groupby(['Churn']).mean())
# Adapt your code to compute the standard deviation
print(telco[['CustServ_Calls', 'Vmail_Message', 'Churn']].groupby(['Churn']).std())
# Count the number of churners and non-churners by State
print(telco.groupby('State')['Churn'].value_counts())
display(telco.describe())
import matplotlib.pyplot as plt
import seaborn as sns
# visualize the distribution of Accoung length
# sns.distplot(telco['Total_Charges'])
""" Seaborn New Version:
stat="density" - show density instead of count in the histogram
alpha=0.4 - for the same transparency
kde=True - add a kernel density plot
kde_kws={"cut": 3} - extend the kernel density plot beyond the histogram limit
"""
sns.histplot(telco['Account_Length'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})
plt.show() # We can see that is a normal distribution
# differences in account length box plot
sns.boxplot(x ='Churn',
y = 'Account_Length',
data = telco,
sym ='') # remove ouliners
plt.show()
# Adding 3rd variable
sns.boxplot(x ='Churn',
y = 'Account_Length',
data = telco,
hue = 'Intl_Plan')
plt.show()
Exploring feature distributions
Previously, we saw that the 'Account_Length' feature was normally distributed. Let's now visualize the distributions of the following features using seaborn's distribution plot:
- 'Day_Mins'
- 'Eve_Mins'
- 'Night_Mins'
- 'Intl_Mins'
To create a feature's distribution plot, pass it in as an argument to sns.distplot()
.
# g = sns.PairGrid(telco[['Day_Mins','Eve_Mins','Night_Mins','Intl_Mins']])
fig = plt.figure(figsize=(8, 6), dpi=100)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
# Visualize the distribution of 'Day_Mins'
ax = fig.add_subplot(2, 2, 1)
sns.distplot(telco['Day_Mins'])
# visualize the distribution of 'Eve_Mins'
ax = fig.add_subplot(2, 2, 2)
sns.distplot(telco['Eve_Mins'])
# visualize the distribution of 'Night_Mins'
ax = fig.add_subplot(2, 2, 3)
sns.distplot(telco['Night_Mins'])
# visualize the distribution of 'Intl_Mins'
ax = fig.add_subplot(2, 2, 4)
sns.distplot(telco['Intl_Mins'])
plt.show()
""" Display with the new version of Seaborn, using histplot()
All of these features appear to be well approximated by the normal distribution.
"""
fig = plt.figure(figsize=(8, 6), dpi=100)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
# Visualize the distribution of 'Day_Mins'
ax = fig.add_subplot(2, 2, 1)
sns.histplot(telco['Day_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})
# visualize the distribution of 'Eve_Mins'
ax = fig.add_subplot(2, 2, 2)
sns.histplot(telco['Eve_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})
# visualize the distribution of 'Night_Mins'
ax = fig.add_subplot(2, 2, 3)
sns.histplot(telco['Night_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})
# visualize the distribution of 'Intl_Mins'
ax = fig.add_subplot(2, 2, 4)
sns.histplot(telco['Intl_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3})
plt.show()
Customer service calls and churn
You've already seen that there's not much of a difference in account lengths between churners and non-churners, but that there is a difference in the number of customer service calls left by churners.
Let's now visualize this difference using a box plot and incorporate other features of interest - do customers who have international plans make more customer service calls? Or do they tend to churn more? How about voicemail plans? Let's find out!
Recall the syntax for creating a box plot using seaborn:
sns.boxplot(x = "X-axis variable",
y = "Y-axis variable",
data = DataFrame)
If you want to remove outliers, you can specify the additional parameter sym=""
, and you can add a third variable using hue.
import matplotlib.pyplot as plt
import seaborn as sns
# Add "Intl_Plan" as a third variable
sns.boxplot(x = 'Churn',
y = 'CustServ_Calls',
data = telco,
sym = "",
hue = "Intl_Plan")
# Display the plot
plt.show()
It looks like customers who do churn, have more calls to customer service than those who do not. Among those churners, those who do not have international plan, have more calls to customer service.
Preprocessing for Churn Modeling
Having explored your data, it's now time to preprocess it and get it ready for machine learning. Learn the why, what, and how of preprocessing, including feature selection and feature engineering.
- 2.1 Data Preprocessing
- Identifying features to convert
- Encoding binary features
- One hot encoding
- Feature Scaling
- 2.2 Feature Selection and Engineering
- Droping unecessary features
- Engineering a new column
print(telco.dtypes.sort_values(), len(telco.dtypes))
""" there are some object datatype: Churn, State, Intl_Plan, Vmail_Plan, Phone, while the rest are numeric. """
import time
######### Encoding binary features
print(telco['Intl_Plan'].head())
""" since this column only includes two string values ("yes" and "no"), we can encode it as a binary feature."""
df_1, df_2 = telco['Intl_Plan'].copy(), telco['Intl_Plan'].copy()
# Method 1: replace
t1 = time.time()
for _ in range(1000):
df_1.copy().replace({'no':0, 'yes':1})
print("Method 1 took {}s".format(time.time()-t1))
# Method 2: Label Encoder (Much Faster when iterate a large number of features)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
t2 = time.time()
for _ in range(1000):
# LabelEncoder().fit_transform(df_2.copy())
le.fit_transform(df_2.copy())
print("Method 2 took {}s".format(time.time()-t2))
# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco["Intl_Plan"]= le.fit_transform(telco["Intl_Plan"])
print(telco['Intl_Plan'].head())
# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco['Vmail_Plan'] = le.fit_transform(telco["Vmail_Plan"])
# Replace 'no' with 0 and 'yes' with 1 in 'Churn'
test_le = LabelEncoder()
le_fitted = test_le.fit_transform(telco["Churn"])
telco["Churn"] = test_le.fit_transform(telco["Churn"])
inverted = test_le.inverse_transform(telco['Vmail_Plan'] )
inverted
"""
telco['State'].head(4)
0 KS
1 OH
2 NJ
3 OH
Name: State, dtype: objec
it would be a bad idea to encode them as int number for each stage --> recomend to use One Hot Encoding
pandas has a get_dummies() function which automatically applies one hot encoding over the selected feature.
"""
# Import pandas
import pandas as pd
# Perform one hot encoding on 'State'
telco_state = pd.get_dummies(telco['State'])
display(telco_state.head())
""" We can see that originally the features are not on the same scale.
For ex: Intl_Calls range from 0 to 20, while for example, Night_Mins is from 23 to 395
"""
"""Standardization
- Centers the distribution around the mean
- Calculates the number of std away from the mean each oint is"""
print(telco[['Intl_Calls','Night_Mins']].describe())
from sklearn.preprocessing import StandardScaler
telco_scaled= StandardScaler().fit_transform(telco[['Intl_Calls','Night_Mins']])
# Add column names back for readability
telco[['Intl_Calls','Night_Mins']] = pd.DataFrame(telco_scaled, columns=["Intl_Calls", "Night_Mins"])
print(telco[['Intl_Calls','Night_Mins']].describe())
""" Both features are now on the same scale. In practice, you'll need to carefully ensure this is the case for all features of interest. """
Feature Selection and Engineering
Dropping unnecessary features
- Unique identifiers such as Phone numbers, Social Security numbers, Account number shoult be drop.
Example:
telco.drop(['Soc_Sec', 'Tax_ID'], axis = 1)
Dropping correlated features
- Highly correlated features can be dropped
- They provide no additional information to the model
Some features such as 'Area_Code' and 'Phone' are not useful when it comes to predicting customer churn, and they need to be dropped prior to modeling.
The easiest way to do so in Python is using the .drop() method of pandas DataFrames, just as you saw in the video, where 'Soc_Sec'
and 'Tax_ID'
were dropped:
telco.drop(['Soc_Sec', 'Tax_ID'], axis=1)
Here, axis=1
indicates that you want to drop 'Soc_Sec'
and 'Tax_ID'
from the columns.
telco.corr()
plt.figure(figsize = (10,10), dpi = 128)
ax = sns.heatmap(telco.corr(), annot =True, fmt='.1f')
ax.xaxis.set_ticks_position('top')
plt.xticks(rotation=45,ha='left', rotation_mode='anchor')
sns.set(rc = {'figure.figsize':(100,100)})#<--responsible for changing the size of a seaborn plot
plt.show()
telco = telco.drop(telco[['Area_Code','Phone']], axis=1)
# Verify dropped features
print(telco.columns)
print(telco.dtypes.sort_values(), len(telco.dtypes))
telco['Day_Cost'] = telco['Day_Mins'] / telco['Day_Charge']
# Create the new feature
telco['Avg_Night_Calls'] = telco['Night_Mins'] / telco['Night_Calls']
print(telco.dtypes.sort_values(), len(telco.dtypes))
Churn Prediction
With your data preprocessed and ready for machine learning, it's time to predict churn! Learn how to build supervised learning machine models in Python using scikit-learn.
- 3.1 Making Prediction
- Predicting whether a new customer will churn
- Training another sklearn model
- 3.2 Evaluating Model Performace
- Creating training and test sets
- Checking each sets length
- Computing accuracy
- 3.3 Model Metric
- Confusion Matrix
- Varying training set size
- Computing precision and recall
- 3.4 Other Metrics
- ROC curve
- AUC (Area Under the Curve)
- Precision and Recall Curve
- F1 Score
features = ['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Intl_Plan', 'Vmail_Plan',
'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge','Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge']
from sklearn.model_selection import train_test_split
X, y = telco[features], telco['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
# this customer value was introduced in the example:
new_customer = X_test[(X_test.Account_Length == 91) & (X_test.Vmail_Message == 23)]
prediction = svc.predict(new_customer)
print(prediction)
print(svc.score(X_test, y_test))
Training another scikit-learn model
All sklearn models have.fit()
and .predict() methods like the one you used in the previous exercise for the LogisticRegression model. This feature allows you to easily try many different models to see which one gives you the best performance. To get you more confident with using the sklearn API, in this exercise you'll try fitting a DecisionTreeClassifier
instead of a LogisticRegression.
from sklearn.tree import DecisionTreeClassifier
# Instantiate the classifier
clf = DecisionTreeClassifier()
# Fit the classifier
clf.fit(X_train,y_train)
# this customer value was introduced in the example:
new_customer = X_test[(X_test.Account_Length == 91) & (X_test.Vmail_Message == 23)]
# Predict the label of new_customer
print(clf.predict(new_customer))
print(clf.score(X_test, y_test))
from sklearn.model_selection import train_test_split
# we select the below features to reproduce the results in the example
features = ['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Intl_Plan', 'Vmail_Plan', 'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge','Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge']
# Create feature variable
X = telco[features] # DONT USE just telco.drop('Churn', axis=1), there are other oject columns
# Create target variable
y = telco['Churn']
# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# check each set length
print(len(X_train), len(X_test), len(y_train), len(y_test))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
print(X_train.dtypes, len(X_train.dtypes))
from sklearn.ensemble import RandomForestClassifier
# Instantiate the classifier
clf = RandomForestClassifier()
# Fit to the training data
clf.fit(X_train, y_train)
# Compute accuracy
print(clf.score(X_test, y_test))
Model Metrics
Confusion Matrix
Computing precision and recall
The sklearn.metrics submodule has many functions that allow you to easily calculate interesting metrics. So far, you've calculated precision and recall by hand - this is important while you develop your intuition for both these metrics.
In practice, once you do, you can leverage the precision_score and recall_score functions that automatically compute precision and recall, respectively. Both work similarly to other functions in sklearn.metrics - they accept 2 arguments: the first is the actual labels (y_test), and the second is the predicted labels (y_pred).
Let's now try a training size of 90%.
from sklearn.model_selection import train_test_split
# Create training and testing sets, try a training size of 90%.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
# Instantiate the classifier
clf = RandomForestClassifier()
# Fit to the training data
clf.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Import precision_score
from sklearn.metrics import precision_score, recall_score, f1_score
# Print the precision
print(precision_score(y_test, y_pred))
# Print the recall
print(recall_score(y_test, y_pred))
# Print the F1 score
print(f1_score(y_test, y_pred))
"""
[[True Negative (TN), False Positive(FP)],
[False Negative (FN), True Positive(TP)]]
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
"""
print("Presition: {}".format(43/(43+2)))
print("Recall: {}".format(43/(43+8)))
ROC curve
Let's now create an ROC curve for our random forest classifier. The first step is to calculate the predicted probabilities output by the classifier for each label using its .predict_proba() method. Then, you can use the roc_curve function from sklearn.metrics to compute the false positive rate and true positive rate, which you can then plot using matplotlib.
A RandomForestClassifier with a training set size of 70% has been fit to the data and is available in your workspace as clf.
y_pred_prob = clf.predict_proba(X_test)[:, 1]
# Import roc_curve
from sklearn.metrics import roc_curve
# Calculate the roc metrics for the model RandomForestClassifier (right above)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(10,10), dpi=80)
# Plot the ROC curve
plt.plot(fpr, tpr)
# Add labels and diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot([0, 1], [0, 1], "k--")
plt.show()
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, y_pred_prob)
print("AUC: {}".format(auc))
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
# Calculate the precision-recall metrics for the model
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
plt.figure(figsize=(10,10), dpi=80)
# Plot the ROC curve
plt.plot(recall,precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
# show the plot
plt.show()
F1 score
As you've discovered, there's a tradeoff between precision and recall. Both are important metrics, and depending on how the business is trying to model churn, you may want to focus on optimizing one over the other. Often, stakeholders are interested in a single metric that can quantify model performance. The AUC is one metric you can use in these cases, and another is the F1 score, which is calculated as below:
2 * (precision * recall) / (precision + recall)
The advantage of the F1 score is it incorporates both precision and recall into a single metric, and a high F1 score is a sign of a well-performing model, even in situations where you might have imbalanced classes. In scikit-learn, you can compute the f-1 score using using the f1_score function.
clf = RandomForestClassifier()
# Fit to the training data
clf.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = clf.predict(X_test)
# Import f1_score
from sklearn.metrics import f1_score
# Print the F1 score
print(f1_score(y_test, y_pred))
Model Tuning
Learn how to improve the performance of your models using hyperparameter tuning and gain a better understanding of the drivers of customer churn that you can take back to the business.
- 4.1 Tuning your model:
- Tuning the number of features
- Tuning other hyperparameters
- Randomized Search
- 4.2 Feature Importance
- Visualize feature importances
- Improving the plot
- Interpreting feature importances
- 4.3 Adding New Features
- Does model performance imporve?
- Computing other metrics
""" Grid Search in sklearn """
from sklearn.model_selection import GridSearchCV
import numpy as np
# grid parameters
param_grid = { 'n_estimators': np.arange(10, 51), # n_estimators: controls the number of trees in the forest.
'max_features': ['auto', 'sqrt', 'log2'], # max_features: controls the number of features to consider when looking for the best split.
}
# Instantiate the grid search model
clf_cv = GridSearchCV(RandomForestClassifier(), param_grid)
clf_cv.fit(X_train, y_train)
clf_cv.best_params_
clf_cv.best_score_
Tuning other hyperparamters
The power of GridSearchCV really comes into play when you're tuning multiple hyperparameters, as then the algorithm tries out all possible combinations of hyperparameters to identify the best combination. Here, you'll tune the following random forest hyperparameters:
Hyperparameter | Purpose |
---|---|
criterion | Quality of Split |
max_features | Number of features for best split |
max_depth | Max depth of tree |
bootstrap | Whether Bootstrap samples are used |
The hyperparameter grid has been specified for you, along with a random forest classifier called clf.
other_param_grid = {'max_depth': [3, None],
'max_features': [1,3,10],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy'],
}
other_grid_search = GridSearchCV(RandomForestClassifier(), other_param_grid)
other_grid_search.fit(X_train, y_train)
print(other_grid_search.best_params_)
print(other_grid_search.best_score_)
Randomized Search
# Call GridSearchCV
grid_search = GridSearchCV(clf, param_grid)
# Fit the model
grid_search.fit(X, y)
In the above chunk of code from the previous exercise, you may have noticed that the first line of code did not take much time to run, while the call to .fit() took several seconds to execute.
This is because .fit() is what actually performs the grid search, and in our case, it was grid with many different combinations. As the hyperparameter grid gets larger, grid search becomes slower. In order to solve this problem, instead of trying out every single combination of values, we could randomly jump around the grid and try different combinations. There's a small possibility we may miss the best combination, but we would save a lot of time, or be able to tune more hyperparameters in the same amount of time.
In scikit-learn, you can do this using RandomizedSearchCV. It has the same API as GridSearchCV, except that you need to specify a parameter distribution that it can sample from instead of specific hyperparameter values. Let's try it out now! The parameter distribution has been set up for you, along with a random forest classifier called clf.
from sklearn.model_selection import RandomizedSearchCV
# Create the hyperparameter grid
param_dist = {"max_depth": [3, None],
"max_features": [1, 5, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
rf = RandomForestClassifier()
# Call RandomizedSearchCV
random_search = RandomizedSearchCV(rf, param_dist)
# Fit the model
random_search.fit(X_train, y_train)
# Print best parameters
print(random_search.best_params_)
print(random_search.best_score_)
random_search.best_estimator_.feature_importances_
Visualizing feature importances
Your random forest classifier from earlier exercises has been fit to the telco data and is available to you as clf. Let's visualize the feature importances and get a sense for what the drivers of churn are, using matplotlib's barh to create a horizontal bar plot of feature importances.
importances = clf.feature_importances_
plt.figure(figsize=(8,4), dpi=80)
# Create plot
plt.barh(range(X_train.shape[1]), importances)
plt.show()
Improving the above plot
To make the plot more readable:
- Re-order the bars in ascending order.
- Add labels to the plot that correspond to the feature names.
To do this, we'll take advantage of NumPy indexing. The .argsort()
method sorts an array and returns the indices. We'll use these indices to achieve both goal
sorted_index = np.argsort(importances)
# create labels
labels = X_train.columns.values[sorted_index]
# clear the current plot
plt.clf()
plt.figure(figsize=(8,6), dpi=80)
# create new plot: pyplot.barh(y: The y coordinates of the bars , width, height=0.8, left=None, *, align='center', **kwargs)
plt.barh(range(X_train.shape[1]), importances[sorted_index], tick_label=labels)
plt.show()
from sklearn.metrics import f1_score
# Instantiate the classifier
clf = RandomForestClassifier()
# Fit to the data
clf.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = clf.predict(X_test)
# Print the F1 score
print(f1_score(y_pred, y_test))