Download Datasets and Presentation slides for this post HERE

This course will provide you a roadmap to create your own customer churn models. You’ll learn how to explore and visualize your data, prepare it for modeling, make predictions using machine learning, and communicate important, actionable insights to stakeholders. By the end of the course, you’ll become comfortable using the pandas library for data analysis and the scikit-learn library for machine learning.

Exploratory Data Analysis

Begin exploring the Telco Churn Dataset using pandas to compute summary statistics and Seaborn to create attractive visualizations.

1.1 Defining and Exploring customer churn
1.2 Grouping and Summarizing data
1.3 Exploring Data with Visualization
- Feature distributions
- Customer service call and Churn

Defining and Exploring Dataset.

import pandas as pd
import numpy as np
import warnings

pd.set_option('display.expand_frame_repr', False)

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

telco = pd.read_csv("Churn.csv")


display(telco.head(10))

Grouping and Summarizing Data.

Visualizing the distribution of some features (Ex: Account Length)

It is important to understand how your variabbles are distributed.

# show Churners and Non-Chuners
print(telco['Churn'].value_counts())

# Group 'CustServ_Calls' and 'Vmail_Message' by 'Churn' and compute the mean
print(telco[['CustServ_Calls', 'Vmail_Message', 'Churn']].groupby(['Churn']).mean())
# Adapt your code to compute the standard deviation
print(telco[['CustServ_Calls', 'Vmail_Message', 'Churn']].groupby(['Churn']).std())

# Count the number of churners and non-churners by State
print(telco.groupby('State')['Churn'].value_counts())

no     2850
yes     483
Name: Churn, dtype: int64
       CustServ_Calls  Vmail_Message
Churn                               
no           1.449825       8.604561
yes          2.229814       5.115942
       CustServ_Calls  Vmail_Message
Churn                               
no           1.163883      13.913125
yes          1.853275      11.860138
State  Churn
AK     no       49
       yes       3
AL     no       72
       yes       8
AR     no       44
       yes      11
AZ     no       60
       yes       4
CA     no       25
       yes       9
CO     no       57
       yes       9
CT     no       62
       yes      12
DC     no       49
       yes       5
DE     no       52
       yes       9
FL     no       55
       yes       8
GA     no       46
       yes       8
HI     no       50
       yes       3
IA     no       41
       yes       3
ID     no       64
       yes       9
IL     no       53
       yes       5
IN     no       62
       yes       9
KS     no       57
       yes      13
KY     no       51
       yes       8
LA     no       47
       yes       4
MA     no       54
       yes      11
MD     no       53
       yes      17
ME     no       49
       yes      13
MI     no       57
       yes      16
MN     no       69
       yes      15
MO     no       56
       yes       7
MS     no       51
       yes      14
MT     no       54
       yes      14
NC     no       57
       yes      11
ND     no       56
       yes       6
NE     no       56
       yes       5
NH     no       47
       yes       9
NJ     no       50
       yes      18
NM     no       56
       yes       6
NV     no       52
       yes      14
NY     no       68
       yes      15
OH     no       68
       yes      10
OK     no       52
       yes       9
OR     no       67
       yes      11
PA     no       37
       yes       8
RI     no       59
       yes       6
SC     no       46
       yes      14
SD     no       52
       yes       8
TN     no       48
       yes       5
TX     no       54
       yes      18
UT     no       62
       yes      10
VA     no       72
       yes       5
VT     no       65
       yes       8
WA     no       52
       yes      14
WI     no       71
       yes       7
WV     no       96
       yes      10
WY     no       68
       yes       9
Name: Churn, dtype: int64

display(telco.describe())

import matplotlib.pyplot as plt
import seaborn as sns

# visualize the distribution of Accoung length
# sns.distplot(telco['Total_Charges'])
""" Seaborn New Version:
stat="density" - show density instead of count in the histogram
alpha=0.4 - for the same transparency
kde=True - add a kernel density plot
kde_kws={"cut": 3} - extend the kernel density plot beyond the histogram limit
"""
sns.histplot(telco['Account_Length'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

plt.show() # We can see that is a normal distribution

# differences in account length box plot
sns.boxplot(x ='Churn',
            y = 'Account_Length', 
            data = telco,
            sym ='') # remove ouliners
plt.show()

# Adding 3rd variable

sns.boxplot(x ='Churn',
            y = 'Account_Length',
            data = telco,
            hue = 'Intl_Plan')
plt.show()

Exploring feature distributions

Previously, we saw that the 'Account_Length' feature was normally distributed. Let's now visualize the distributions of the following features using seaborn's distribution plot:

'Day_Mins'
'Eve_Mins'
'Night_Mins'
'Intl_Mins'

To create a feature's distribution plot, pass it in as an argument to sns.distplot().

# g = sns.PairGrid(telco[['Day_Mins','Eve_Mins','Night_Mins','Intl_Mins']])

fig = plt.figure(figsize=(8, 6), dpi=100)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
# Visualize the distribution of 'Day_Mins'
ax = fig.add_subplot(2, 2, 1)
sns.distplot(telco['Day_Mins'])

# visualize the distribution of 'Eve_Mins'
ax = fig.add_subplot(2, 2, 2)
sns.distplot(telco['Eve_Mins'])

# visualize the distribution of 'Night_Mins'
ax = fig.add_subplot(2, 2, 3)
sns.distplot(telco['Night_Mins'])

# visualize the distribution of 'Intl_Mins'
ax = fig.add_subplot(2, 2, 4)
sns.distplot(telco['Intl_Mins'])
plt.show()

""" Display with the new version of Seaborn, using histplot()
All of these features appear to be well approximated by the normal distribution.

"""
fig = plt.figure(figsize=(8, 6), dpi=100)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
# Visualize the distribution of 'Day_Mins'
ax = fig.add_subplot(2, 2, 1)
sns.histplot(telco['Day_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

# visualize the distribution of 'Eve_Mins'
ax = fig.add_subplot(2, 2, 2)
sns.histplot(telco['Eve_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

# visualize the distribution of 'Night_Mins'
ax = fig.add_subplot(2, 2, 3)
sns.histplot(telco['Night_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

# visualize the distribution of 'Intl_Mins'
ax = fig.add_subplot(2, 2, 4)
sns.histplot(telco['Intl_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 
plt.show()

Customer service calls and churn

You've already seen that there's not much of a difference in account lengths between churners and non-churners, but that there is a difference in the number of customer service calls left by churners.

Let's now visualize this difference using a box plot and incorporate other features of interest - do customers who have international plans make more customer service calls? Or do they tend to churn more? How about voicemail plans? Let's find out!

Recall the syntax for creating a box plot using seaborn:

sns.boxplot(x = "X-axis variable",
            y = "Y-axis variable",
            data = DataFrame)

If you want to remove outliers, you can specify the additional parameter sym="", and you can add a third variable using hue.

import matplotlib.pyplot as plt
import seaborn as sns

# Add "Intl_Plan" as a third variable
sns.boxplot(x = 'Churn',
            y = 'CustServ_Calls',
            data = telco,
            sym = "",
            hue = "Intl_Plan")

# Display the plot
plt.show()

It looks like customers who do churn, have more calls to customer service than those who do not. Among those churners, those who do not have international plan, have more calls to customer service.

Preprocessing for Churn Modeling

Having explored your data, it's now time to preprocess it and get it ready for machine learning. Learn the why, what, and how of preprocessing, including feature selection and feature engineering.

2.1 Data Preprocessing
- Identifying features to convert
- Encoding binary features
- One hot encoding
- Feature Scaling
  - 2.2 Feature Selection and Engineering
- Droping unecessary features
- Engineering a new column

Data Preprocessing

Identify features to convert

print(telco.dtypes.sort_values(), len(telco.dtypes))
""" there are some object datatype: Churn, State, Intl_Plan, Vmail_Plan, Phone, while the rest are numeric.  """

Account_Length      int64
Intl_Calls          int64
Night_Calls         int64
Eve_Calls           int64
Area_Code           int64
CustServ_Calls      int64
Day_Calls           int64
Vmail_Message       int64
Night_Mins        float64
Day_Charge        float64
Eve_Mins          float64
Eve_Charge        float64
Day_Mins          float64
Night_Charge      float64
Intl_Charge       float64
Intl_Mins         float64
Churn              object
Intl_Plan          object
Vmail_Plan         object
State              object
Phone              object
dtype: object 21

' there are some object datatype: Churn, State, Intl_Plan, Vmail_Plan, Phone, while the rest are numeric.  '

Encoding binary features

import time
######### Encoding binary features
print(telco['Intl_Plan'].head())
""" since this column only includes two string values ("yes" and  "no"), we can encode it as a binary feature."""
df_1, df_2 = telco['Intl_Plan'].copy(), telco['Intl_Plan'].copy()

# Method 1: replace
t1 = time.time()
for _ in range(1000):
    df_1.copy().replace({'no':0, 'yes':1})
print("Method 1 took {}s".format(time.time()-t1))

# Method 2: Label Encoder (Much Faster when iterate a large number of features)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

t2 = time.time()
for _ in range(1000):
    # LabelEncoder().fit_transform(df_2.copy())
    le.fit_transform(df_2.copy())
print("Method 2 took {}s".format(time.time()-t2))

# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco["Intl_Plan"]= le.fit_transform(telco["Intl_Plan"])
print(telco['Intl_Plan'].head())
# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco['Vmail_Plan'] =  le.fit_transform(telco["Vmail_Plan"])

# Replace 'no' with 0 and 'yes' with 1 in 'Churn'
test_le = LabelEncoder()
le_fitted = test_le.fit_transform(telco["Churn"])
telco["Churn"] = test_le.fit_transform(telco["Churn"])

0     no
1     no
2     no
3    yes
4    yes
Name: Intl_Plan, dtype: object
Method 1 took 1.096604585647583s
Method 2 took 0.45363402366638184s
0    0
1    0
2    0
3    1
4    1
Name: Intl_Plan, dtype: int32

inverted = test_le.inverse_transform(telco['Vmail_Plan'] )
inverted

array(['yes', 'yes', 'no', ..., 'no', 'no', 'yes'], dtype=object)

One hot encoding

"""
telco['State'].head(4)
0    KS
1    OH
2    NJ
3    OH
Name: State, dtype: objec

it would be a bad idea to encode them as int number for each stage --> recomend to use One Hot Encoding
pandas has a get_dummies() function which automatically applies one hot encoding over the selected feature.
"""
# Import pandas
import pandas as pd

# Perform one hot encoding on 'State'
telco_state = pd.get_dummies(telco['State'])

display(telco_state.head())

Feature Scaling

""" We can see that originally the features are not on the same scale.
For ex: Intl_Calls range from 0 to 20, while for example, Night_Mins is from 23 to 395
"""
"""Standardization 
- Centers the distribution around the mean 
- Calculates the number of std away from the mean each oint is"""
print(telco[['Intl_Calls','Night_Mins']].describe())

from sklearn.preprocessing import StandardScaler
telco_scaled= StandardScaler().fit_transform(telco[['Intl_Calls','Night_Mins']])
# Add column names back for readability
telco[['Intl_Calls','Night_Mins']] = pd.DataFrame(telco_scaled, columns=["Intl_Calls", "Night_Mins"])

print(telco[['Intl_Calls','Night_Mins']].describe())
""" Both features are now on the same scale. In practice, you'll need to carefully ensure this is the case for all features of interest. """

        Intl_Calls   Night_Mins
count  3333.000000  3333.000000
mean      4.479448   200.872037
std       2.461214    50.573847
min       0.000000    23.200000
25%       3.000000   167.000000
50%       4.000000   201.200000
75%       6.000000   235.300000
max      20.000000   395.000000
         Intl_Calls    Night_Mins
count  3.333000e+03  3.333000e+03
mean  -8.527366e-18  7.887813e-17
std    1.000150e+00  1.000150e+00
min   -1.820289e+00 -3.513648e+00
25%   -6.011951e-01 -6.698545e-01
50%   -1.948306e-01  6.485803e-03
75%    6.178983e-01  6.808485e-01
max    6.307001e+00  3.839081e+00

" Both features are now on the same scale. In practice, you'll need to carefully ensure this is the case for all features of interest. "

Feature Selection and Engineering

Dropping unnecessary features

Unique identifiers such as Phone numbers, Social Security numbers, Account number shoult be drop. Example:telco.drop(['Soc_Sec', 'Tax_ID'], axis = 1)

Dropping correlated features

Highly correlated features can be dropped
They provide no additional information to the model

Some features such as 'Area_Code' and 'Phone' are not useful when it comes to predicting customer churn, and they need to be dropped prior to modeling. The easiest way to do so in Python is using the .drop() method of pandas DataFrames, just as you saw in the video, where 'Soc_Sec'and 'Tax_ID' were dropped:

telco.drop(['Soc_Sec', 'Tax_ID'], axis=1) Here, axis=1 indicates that you want to drop 'Soc_Sec' and 'Tax_ID' from the columns.

telco.corr()

plt.figure(figsize = (10,10), dpi = 128)

ax = sns.heatmap(telco.corr(), annot =True, fmt='.1f')
ax.xaxis.set_ticks_position('top')
plt.xticks(rotation=45,ha='left', rotation_mode='anchor')

sns.set(rc = {'figure.figsize':(100,100)})#<--responsible for changing the size of a seaborn plot
plt.show()

telco = telco.drop(telco[['Area_Code','Phone']], axis=1)

# Verify dropped features
print(telco.columns)
print(telco.dtypes.sort_values(), len(telco.dtypes))

Index(['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Churn', 'Intl_Plan', 'Vmail_Plan', 'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge', 'Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge', 'State'], dtype='object')
Vmail_Plan          int32
Intl_Plan           int32
Churn               int32
Night_Calls         int64
Eve_Calls           int64
Day_Calls           int64
Account_Length      int64
Vmail_Message       int64
CustServ_Calls      int64
Intl_Mins         float64
Night_Mins        float64
Intl_Charge       float64
Eve_Mins          float64
Day_Charge        float64
Day_Mins          float64
Eve_Charge        float64
Night_Charge      float64
Intl_Calls        float64
State              object
dtype: object 19

Feature engineering

Leveraging domain knowledge to engineer new features is an essential part of modeling. This exercise is to create a new feature that contains information about the average length of night calls made by customers.

Creating new feature for model

telco['Day_Cost'] = telco['Day_Mins'] / telco['Day_Charge']
# Create the new feature
telco['Avg_Night_Calls'] = telco['Night_Mins'] / telco['Night_Calls']

print(telco.dtypes.sort_values(), len(telco.dtypes))

Vmail_Plan           int32
Intl_Plan            int32
Churn                int32
Account_Length       int64
Night_Calls          int64
Eve_Calls            int64
Day_Calls            int64
Vmail_Message        int64
CustServ_Calls       int64
Intl_Mins          float64
Night_Mins         float64
Eve_Mins           float64
Day_Cost           float64
Day_Charge         float64
Day_Mins           float64
Eve_Charge         float64
Night_Charge       float64
Intl_Calls         float64
Intl_Charge        float64
Avg_Night_Calls    float64
State               object
dtype: object 21

Churn Prediction

With your data preprocessed and ready for machine learning, it's time to predict churn! Learn how to build supervised learning machine models in Python using scikit-learn.

3.1 Making Prediction
- Predicting whether a new customer will churn
- Training another sklearn model
3.2 Evaluating Model Performace
- Creating training and test sets
- Checking each sets length
- Computing accuracy
3.3 Model Metric
- Confusion Matrix
- Varying training set size
- Computing precision and recall
3.4 Other Metrics
- ROC curve
- AUC (Area Under the Curve)
- Precision and Recall Curve
- F1 Score

Making Prediction

Predicting whether a new customer will churn

features = ['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Intl_Plan', 'Vmail_Plan',
            'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge','Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge']


from sklearn.model_selection import train_test_split
X, y = telco[features], telco['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)


# this customer value was introduced in the example:
new_customer = X_test[(X_test.Account_Length == 91) & (X_test.Vmail_Message == 23)]

prediction = svc.predict(new_customer)
print(prediction)

print(svc.score(X_test, y_test))

[0]
0.8590704647676162

Training another scikit-learn model

All sklearn models have.fit()and .predict() methods like the one you used in the previous exercise for the LogisticRegression model. This feature allows you to easily try many different models to see which one gives you the best performance. To get you more confident with using the sklearn API, in this exercise you'll try fitting a DecisionTreeClassifier instead of a LogisticRegression.

from sklearn.tree import DecisionTreeClassifier

# Instantiate the classifier
clf = DecisionTreeClassifier()

# Fit the classifier
clf.fit(X_train,y_train)

# this customer value was introduced in the example:
new_customer = X_test[(X_test.Account_Length == 91) & (X_test.Vmail_Message == 23)]

# Predict the label of new_customer
print(clf.predict(new_customer))

print(clf.score(X_test, y_test))

[0]
0.9205397301349325

Evaluating Model Performance

Creating training and test sets

Check each sets length

from sklearn.model_selection import train_test_split

# we select the below features to reproduce the results in the example
features = ['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Intl_Plan', 'Vmail_Plan', 'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge','Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge']

# Create feature variable
X = telco[features] # DONT USE just telco.drop('Churn', axis=1), there are other oject columns
# Create target variable
y = telco['Churn']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# check each set length
print(len(X_train), len(X_test), len(y_train), len(y_test))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

2333 1000 2333 1000
(2333, 17) (1000, 17) (2333,) (1000,)

Computing Accuracy

Use a RandomForestClassifier as an ensemble of Decision Trees that generally outperforms a single Decision Tree.

print(X_train.dtypes, len(X_train.dtypes))

Account_Length      int64
Vmail_Message       int64
Day_Mins          float64
Eve_Mins          float64
Night_Mins        float64
Intl_Mins         float64
CustServ_Calls      int64
Intl_Plan           int32
Vmail_Plan          int32
Day_Calls           int64
Day_Charge        float64
Eve_Calls           int64
Eve_Charge        float64
Night_Calls         int64
Night_Charge      float64
Intl_Calls        float64
Intl_Charge       float64
dtype: object 17

from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Compute accuracy
print(clf.score(X_test, y_test))

0.957

Model Metrics

Confusion Matrix

Computing precision and recall

The sklearn.metrics submodule has many functions that allow you to easily calculate interesting metrics. So far, you've calculated precision and recall by hand - this is important while you develop your intuition for both these metrics.

In practice, once you do, you can leverage the precision_score and recall_score functions that automatically compute precision and recall, respectively. Both work similarly to other functions in sklearn.metrics - they accept 2 arguments: the first is the actual labels (y_test), and the second is the predicted labels (y_pred).

Let's now try a training size of 90%.

from sklearn.model_selection import train_test_split

# Create training and testing sets,  try a training size of 90%.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Import precision_score
from sklearn.metrics import precision_score, recall_score, f1_score

# Print the precision
print(precision_score(y_test, y_pred))
# Print the recall
print(recall_score(y_test, y_pred))
# Print the F1 score
print(f1_score(y_test, y_pred))

[[273   5]
 [ 14  42]]
0.8936170212765957
0.75
0.8155339805825244

""" 
[[True Negative (TN), False Positive(FP)],
 [False Negative (FN), True Positive(TP)]]

Precision   = TP / (TP + FP)
Recall      = TP / (TP + FN)
"""
print("Presition: {}".format(43/(43+2)))
print("Recall: {}".format(43/(43+8)))

Presition: 0.9555555555555556
Recall: 0.8431372549019608

Other Model Metric

3.4.1 ROC curve
3.4.2 AUC (Area Under the Curve) -> the bigger the area, the better the model.
3.4.3 Precision and Recall Curve
3.4.5 F1 Score

ROC curve

Let's now create an ROC curve for our random forest classifier. The first step is to calculate the predicted probabilities output by the classifier for each label using its .predict_proba() method. Then, you can use the roc_curve function from sklearn.metrics to compute the false positive rate and true positive rate, which you can then plot using matplotlib.

A RandomForestClassifier with a training set size of 70% has been fit to the data and is available in your workspace as clf.

y_pred_prob = clf.predict_proba(X_test)[:, 1]


# Import roc_curve
from sklearn.metrics import roc_curve

# Calculate the roc metrics for the model　RandomForestClassifier (right above)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.figure(figsize=(10,10), dpi=80)

# Plot the ROC curve
plt.plot(fpr, tpr)

# Add labels and diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot([0, 1], [0, 1], "k--")
plt.show()

Area under the Curve

A pefect model will have AUC = 1, while a worse model has AUC = 0.5 (Random probability guess for 0 and 1)

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test, y_pred_prob)
print("AUC: {}".format(auc))

AUC: 0.9405511305241522

Precision-Recall Curve

Another way to evaluate model performance is using a precision-recall curve, which shows the tradeoff between precision and recall for different thresholds.

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score

# Calculate the precision-recall metrics for the model
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

plt.figure(figsize=(10,10), dpi=80)

# Plot the ROC curve
plt.plot(recall,precision)



plt.xlabel('Recall')
plt.ylabel('Precision')

# show the plot
plt.show()

F1 score

As you've discovered, there's a tradeoff between precision and recall. Both are important metrics, and depending on how the business is trying to model churn, you may want to focus on optimizing one over the other. Often, stakeholders are interested in a single metric that can quantify model performance. The AUC is one metric you can use in these cases, and another is the F1 score, which is calculated as below:

2 * (precision * recall) / (precision + recall)

The advantage of the F1 score is it incorporates both precision and recall into a single metric, and a high F1 score is a sign of a well-performing model, even in situations where you might have imbalanced classes. In scikit-learn, you can compute the f-1 score using using the f1_score function.

clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Import f1_score
from sklearn.metrics import f1_score

# Print the F1 score
print(f1_score(y_test, y_pred))

0.854368932038835

Model Tuning

Learn how to improve the performance of your models using hyperparameter tuning and gain a better understanding of the drivers of customer churn that you can take back to the business.

4.1 Tuning your model:
- Tuning the number of features
- Tuning other hyperparameters
- Randomized Search
4.2 Feature Importance
- Visualize feature importances
- Improving the plot
- Interpreting feature importances
4.3 Adding New Features
- Does model performance imporve?
- Computing other metrics

Tuning

Tuning the number of trees + features

""" Grid Search in sklearn """
from sklearn.model_selection import GridSearchCV
import numpy as np

# grid parameters
param_grid = {  'n_estimators': np.arange(10, 51),          # n_estimators: controls the number of trees in the forest.
                'max_features': ['auto', 'sqrt', 'log2'],   # max_features: controls the number of features to consider when looking for the best split.
            }

# Instantiate the grid search model
clf_cv = GridSearchCV(RandomForestClassifier(), param_grid)

clf_cv.fit(X_train, y_train)
clf_cv.best_params_

{'max_features': 'auto', 'n_estimators': 43}

clf_cv.best_score_

0.9549849749582637

Tuning other hyperparamters

The power of GridSearchCV really comes into play when you're tuning multiple hyperparameters, as then the algorithm tries out all possible combinations of hyperparameters to identify the best combination. Here, you'll tune the following random forest hyperparameters:

Hyperparameter	Purpose
criterion	Quality of Split
max_features	Number of features for best split
max_depth	Max depth of tree
bootstrap	Whether Bootstrap samples are used

The hyperparameter grid has been specified for you, along with a random forest classifier called clf.

other_param_grid = {'max_depth': [3, None],
                    'max_features': [1,3,10],
                    'bootstrap': [True, False],
                    'criterion': ['gini', 'entropy'],
                    } 

other_grid_search = GridSearchCV(RandomForestClassifier(), other_param_grid)

other_grid_search.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'bootstrap': [True, False],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [3, None], 'max_features': [1, 3, 10]})

print(other_grid_search.best_params_)
print(other_grid_search.best_score_)

{'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10}
0.9549860879243184

Randomized Search

# Call GridSearchCV
grid_search = GridSearchCV(clf, param_grid)

# Fit the model
grid_search.fit(X, y)

In the above chunk of code from the previous exercise, you may have noticed that the first line of code did not take much time to run, while the call to .fit() took several seconds to execute.

This is because .fit() is what actually performs the grid search, and in our case, it was grid with many different combinations. As the hyperparameter grid gets larger, grid search becomes slower. In order to solve this problem, instead of trying out every single combination of values, we could randomly jump around the grid and try different combinations. There's a small possibility we may miss the best combination, but we would save a lot of time, or be able to tune more hyperparameters in the same amount of time.

In scikit-learn, you can do this using RandomizedSearchCV. It has the same API as GridSearchCV, except that you need to specify a parameter distribution that it can sample from instead of specific hyperparameter values. Let's try it out now! The parameter distribution has been set up for you, along with a random forest classifier called clf.

from sklearn.model_selection import RandomizedSearchCV


# Create the hyperparameter grid
param_dist = {"max_depth": [3, None],
              "max_features": [1, 5, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

rf = RandomForestClassifier()
# Call RandomizedSearchCV
random_search = RandomizedSearchCV(rf, param_dist)

# Fit the model
random_search.fit(X_train, y_train)

# Print best parameters
print(random_search.best_params_)

{'max_features': 5, 'max_depth': None, 'criterion': 'entropy', 'bootstrap': False}

print(random_search.best_score_)

0.9563199777406789

Feature Importances

Scores representing how much each feature contribute to a prediction
Which features are important

random_search.best_estimator_.feature_importances_

array([0.02768986, 0.02973571, 0.1284745 , 0.06796968, 0.04463708,
       0.04414193, 0.13070052, 0.08343848, 0.01765399, 0.02959724,
       0.13125914, 0.02546774, 0.06996309, 0.0267514 , 0.04178584,
       0.05953731, 0.0411965 ])

Visualizing feature importances

Your random forest classifier from earlier exercises has been fit to the telco data and is available to you as clf. Let's visualize the feature importances and get a sense for what the drivers of churn are, using matplotlib's barh to create a horizontal bar plot of feature importances.

importances = clf.feature_importances_

plt.figure(figsize=(8,4), dpi=80)

# Create plot
plt.barh(range(X_train.shape[1]), importances)
plt.show()

Improving the above plot

To make the plot more readable:

- Re-order the bars in ascending order.
- Add labels to the plot that correspond to the feature names.

To do this, we'll take advantage of NumPy indexing. The .argsort() method sorts an array and returns the indices. We'll use these indices to achieve both goal

sorted_index = np.argsort(importances)

# create labels
labels = X_train.columns.values[sorted_index]

# clear the current plot
plt.clf()
plt.figure(figsize=(8,6), dpi=80)

# create new plot: pyplot.barh(y: The y coordinates of the bars , width, height=0.8, left=None, *, align='center', **kwargs)
plt.barh(range(X_train.shape[1]), importances[sorted_index], tick_label=labels)
plt.show()

<Figure size 7200x7200 with 0 Axes>

Adding new features

Does model performance imporve?
Computing other metrics

Does model performance improve?

6 new features have been added to the telco DataFrame:

- Region_Code
- Cost_Call
- Total_Charge
- Total_Minutes
- Total_Calls
- Min_Call

Computing other metrics

In addition to accuracy, let's also compute the F1 score of this new model to get a better picture of model performance.

A 70-30 train-test split has already been done for you, and all necessary modules have been imported.

from sklearn.metrics import f1_score

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Print the F1 score
print(f1_score(y_pred, y_test))

0.8282828282828282

	Account_Length	Vmail_Message	Day_Mins	Eve_Mins	Night_Mins	Intl_Mins	CustServ_Calls	Churn	Intl_Plan	Vmail_Plan	Day_Calls	Day_Charge	Eve_Calls	Eve_Charge	Night_Calls	Night_Charge	Intl_Calls	Intl_Charge	State	Area_Code	Phone
0	128	25	265.1	197.4	244.7	10.0	1	no	no	yes	110	45.07	99	16.78	91	11.01	3	2.70	KS	415	382-4657
1	107	26	161.6	195.5	254.4	13.7	1	no	no	yes	123	27.47	103	16.62	103	11.45	3	3.70	OH	415	371-7191
2	137	0	243.4	121.2	162.6	12.2	0	no	no	no	114	41.38	110	10.30	104	7.32	5	3.29	NJ	415	358-1921
3	84	0	299.4	61.9	196.9	6.6	2	no	yes	no	71	50.90	88	5.26	89	8.86	7	1.78	OH	408	375-9999
4	75	0	166.7	148.3	186.9	10.1	3	no	yes	no	113	28.34	122	12.61	121	8.41	3	2.73	OK	415	330-6626
5	118	0	223.4	220.6	203.9	6.3	0	no	yes	no	98	37.98	101	18.75	118	9.18	6	1.70	AL	510	391-8027
6	121	24	218.2	348.5	212.6	7.5	3	no	no	yes	88	37.09	108	29.62	118	9.57	7	2.03	MA	510	355-9993
7	147	0	157.0	103.1	211.8	7.1	0	no	yes	no	79	26.69	94	8.76	96	9.53	6	1.92	MO	415	329-9001
8	117	0	184.5	351.6	215.8	8.7	1	no	no	no	97	31.37	80	29.89	90	9.71	4	2.35	LA	408	335-4719
9	141	37	258.6	222.0	326.4	11.2	0	no	yes	yes	84	43.96	111	18.87	97	14.69	5	3.02	WV	415	330-8173

	Account_Length	Vmail_Message	Day_Mins	Eve_Mins	Night_Mins	Intl_Mins	CustServ_Calls	Day_Calls	Day_Charge	Eve_Calls	Eve_Charge	Night_Calls	Night_Charge	Intl_Calls	Intl_Charge	Area_Code
count	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000	3333.000000
mean	101.064806	8.099010	179.775098	200.980348	200.872037	10.237294	1.562856	100.435644	30.562307	100.114311	17.083540	100.107711	9.039325	4.479448	2.764581	437.182418
std	39.822106	13.688365	54.467389	50.713844	50.573847	2.791840	1.315491	20.069084	9.259435	19.922625	4.310668	19.568609	2.275873	2.461214	0.753773	42.371290
min	1.000000	0.000000	0.000000	0.000000	23.200000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	33.000000	1.040000	0.000000	0.000000	408.000000
25%	74.000000	0.000000	143.700000	166.600000	167.000000	8.500000	1.000000	87.000000	24.430000	87.000000	14.160000	87.000000	7.520000	3.000000	2.300000	408.000000
50%	101.000000	0.000000	179.400000	201.400000	201.200000	10.300000	1.000000	101.000000	30.500000	100.000000	17.120000	100.000000	9.050000	4.000000	2.780000	415.000000
75%	127.000000	20.000000	216.400000	235.300000	235.300000	12.100000	2.000000	114.000000	36.790000	114.000000	20.000000	113.000000	10.590000	6.000000	3.270000	510.000000
max	243.000000	51.000000	350.800000	363.700000	395.000000	20.000000	9.000000	165.000000	59.640000	170.000000	30.910000	175.000000	17.770000	20.000000	5.400000	510.000000

	Account_Length	Vmail_Message	Day_Mins	Eve_Mins	Night_Mins	Intl_Mins	CustServ_Calls	Churn	Intl_Plan	Vmail_Plan	Day_Calls	Day_Charge	Eve_Calls	Eve_Charge	Night_Calls	Night_Charge	Intl_Calls	Intl_Charge	Area_Code
Account_Length	1.000000	-0.004628	0.006216	-0.006757	-0.008955	0.009514	-0.003796	0.016541	0.024735	0.002918	0.038470	0.006214	0.019260	-0.006745	-0.013176	-0.008960	0.020661	0.009546	-0.012463
Vmail_Message	-0.004628	1.000000	0.000778	0.017562	0.007681	0.002856	-0.013263	-0.089728	0.008745	0.956927	-0.009548	0.000776	-0.005864	0.017578	0.007123	0.007663	0.013957	0.002884	-0.001994
Day_Mins	0.006216	0.000778	1.000000	0.007043	0.004323	-0.010155	-0.013423	0.205151	0.049396	-0.001684	0.006750	1.000000	0.015769	0.007029	0.022972	0.004300	0.008033	-0.010092	-0.008264
Eve_Mins	-0.006757	0.017562	0.007043	1.000000	-0.012584	-0.011035	-0.012985	0.092796	0.019100	0.021545	-0.021451	0.007050	-0.011430	1.000000	0.007586	-0.012593	0.002541	-0.011067	0.003580
Night_Mins	-0.008955	0.007681	0.004323	-0.012584	1.000000	-0.015207	-0.009288	0.035493	-0.028905	0.006079	0.022938	0.004324	-0.002093	-0.012592	0.011204	0.999999	-0.012353	-0.015180	-0.005825
Intl_Mins	0.009514	0.002856	-0.010155	-0.011035	-0.015207	1.000000	-0.009640	0.068239	0.045871	-0.001318	0.021565	-0.010157	0.008703	-0.011043	-0.013605	-0.015214	0.032304	0.999993	-0.018288
CustServ_Calls	-0.003796	-0.013263	-0.013423	-0.012985	-0.009288	-0.009640	1.000000	0.208750	-0.024522	-0.017824	-0.018942	-0.013427	0.002423	-0.012987	-0.012802	-0.009277	-0.017561	-0.009675	0.027572
Churn	0.016541	-0.089728	0.205151	0.092796	0.035493	0.068239	0.208750	1.000000	0.259852	-0.102148	0.018459	0.205151	0.009233	0.092786	0.006141	0.035496	-0.052844	0.068259	0.006174
Intl_Plan	0.024735	0.008745	0.049396	0.019100	-0.028905	0.045871	-0.024522	0.259852	1.000000	0.006006	0.003755	0.049398	0.006114	0.019106	0.012451	-0.028913	0.017366	0.045780	0.048551
Vmail_Plan	0.002918	0.956927	-0.001684	0.021545	0.006079	-0.001318	-0.017824	-0.102148	0.006006	1.000000	-0.011086	-0.001686	-0.006444	0.021559	0.015553	0.006064	0.007618	-0.001276	-0.000747
Day_Calls	0.038470	-0.009548	0.006750	-0.021451	0.022938	0.021565	-0.018942	0.018459	0.003755	-0.011086	1.000000	0.006753	0.006462	-0.021449	-0.019557	0.022927	0.004574	0.021666	-0.009646
Day_Charge	0.006214	0.000776	1.000000	0.007050	0.004324	-0.010157	-0.013427	0.205151	0.049398	-0.001686	0.006753	1.000000	0.015769	0.007036	0.022972	0.004301	0.008032	-0.010094	-0.008264
Eve_Calls	0.019260	-0.005864	0.015769	-0.011430	-0.002093	0.008703	0.002423	0.009233	0.006114	-0.006444	0.006462	0.015769	1.000000	-0.011423	0.007710	-0.002056	0.017434	0.008674	-0.011886
Eve_Charge	-0.006745	0.017578	0.007029	1.000000	-0.012592	-0.011043	-0.012987	0.092786	0.019106	0.021559	-0.021449	0.007036	-0.011423	1.000000	0.007596	-0.012601	0.002541	-0.011074	0.003607
Night_Calls	-0.013176	0.007123	0.022972	0.007586	0.011204	-0.013605	-0.012802	0.006141	0.012451	0.015553	-0.019557	0.022972	0.007710	0.007596	1.000000	0.011188	0.000305	-0.013630	0.016522
Night_Charge	-0.008960	0.007663	0.004300	-0.012593	0.999999	-0.015214	-0.009277	0.035496	-0.028913	0.006064	0.022927	0.004301	-0.002056	-0.012601	0.011188	1.000000	-0.012329	-0.015186	-0.005845
Intl_Calls	0.020661	0.013957	0.008033	0.002541	-0.012353	0.032304	-0.017561	-0.052844	0.017366	0.007618	0.004574	0.008032	0.017434	0.002541	0.000305	-0.012329	1.000000	0.032372	-0.024179
Intl_Charge	0.009546	0.002884	-0.010092	-0.011067	-0.015180	0.999993	-0.009675	0.068259	0.045780	-0.001276	0.021666	-0.010094	0.008674	-0.011074	-0.013630	-0.015186	0.032372	1.000000	-0.018395
Area_Code	-0.012463	-0.001994	-0.008264	0.003580	-0.005825	-0.018288	0.027572	0.006174	0.048551	-0.000747	-0.009646	-0.008264	-0.011886	0.003607	0.016522	-0.005845	-0.024179	-0.018395	1.000000

	KS	...	NJ	OH	OK
0	1	...	0	0	0
1	0	...	0	1	0
2	0	...	1	0	0
3	0	...	0	1	0
4	0	...	0	0	1

	KS	...	NJ	OH	OK
0	1	...	0	0	0
1	0	...	0	1	0
2	0	...	1	0	0
3	0	...	0	1	0
4	0	...	0	0	1

	KS	...	NJ	OH	OK
0	1	...	0	0	0
1	0	...	0	1	0
2	0	...	1	0	0
3	0	...	0	1	0
4	0	...	0	0	1