This course will provide you a roadmap to create your own customer churn models. You’ll learn how to explore and visualize your data, prepare it for modeling, make predictions using machine learning, and communicate important, actionable insights to stakeholders. By the end of the course, you’ll become comfortable using the pandas library for data analysis and the scikit-learn library for machine learning.

Exploratory Data Analysis

Begin exploring the Telco Churn Dataset using pandas to compute summary statistics and Seaborn to create attractive visualizations.

  • 1.1 Defining and Exploring customer churn
  • 1.2 Grouping and Summarizing data
  • 1.3 Exploring Data with Visualization
    • Feature distributions
    • Customer service call and Churn

Defining and Exploring Dataset.

import pandas as pd
import numpy as np
import warnings

pd.set_option('display.expand_frame_repr', False)

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

telco = pd.read_csv("Churn.csv")


display(telco.head(10))
Account_Length Vmail_Message Day_Mins Eve_Mins Night_Mins Intl_Mins CustServ_Calls Churn Intl_Plan Vmail_Plan Day_Calls Day_Charge Eve_Calls Eve_Charge Night_Calls Night_Charge Intl_Calls Intl_Charge State Area_Code Phone
0 128 25 265.1 197.4 244.7 10.0 1 no no yes 110 45.07 99 16.78 91 11.01 3 2.70 KS 415 382-4657
1 107 26 161.6 195.5 254.4 13.7 1 no no yes 123 27.47 103 16.62 103 11.45 3 3.70 OH 415 371-7191
2 137 0 243.4 121.2 162.6 12.2 0 no no no 114 41.38 110 10.30 104 7.32 5 3.29 NJ 415 358-1921
3 84 0 299.4 61.9 196.9 6.6 2 no yes no 71 50.90 88 5.26 89 8.86 7 1.78 OH 408 375-9999
4 75 0 166.7 148.3 186.9 10.1 3 no yes no 113 28.34 122 12.61 121 8.41 3 2.73 OK 415 330-6626
5 118 0 223.4 220.6 203.9 6.3 0 no yes no 98 37.98 101 18.75 118 9.18 6 1.70 AL 510 391-8027
6 121 24 218.2 348.5 212.6 7.5 3 no no yes 88 37.09 108 29.62 118 9.57 7 2.03 MA 510 355-9993
7 147 0 157.0 103.1 211.8 7.1 0 no yes no 79 26.69 94 8.76 96 9.53 6 1.92 MO 415 329-9001
8 117 0 184.5 351.6 215.8 8.7 1 no no no 97 31.37 80 29.89 90 9.71 4 2.35 LA 408 335-4719
9 141 37 258.6 222.0 326.4 11.2 0 no yes yes 84 43.96 111 18.87 97 14.69 5 3.02 WV 415 330-8173

Grouping and Summarizing Data.

Visualizing the distribution of some features (Ex: Account Length)

It is important to understand how your variabbles are distributed.

# show Churners and Non-Chuners
print(telco['Churn'].value_counts())

# Group 'CustServ_Calls' and 'Vmail_Message' by 'Churn' and compute the mean
print(telco[['CustServ_Calls', 'Vmail_Message', 'Churn']].groupby(['Churn']).mean())
# Adapt your code to compute the standard deviation
print(telco[['CustServ_Calls', 'Vmail_Message', 'Churn']].groupby(['Churn']).std())

# Count the number of churners and non-churners by State
print(telco.groupby('State')['Churn'].value_counts())
no     2850
yes     483
Name: Churn, dtype: int64
       CustServ_Calls  Vmail_Message
Churn                               
no           1.449825       8.604561
yes          2.229814       5.115942
       CustServ_Calls  Vmail_Message
Churn                               
no           1.163883      13.913125
yes          1.853275      11.860138
State  Churn
AK     no       49
       yes       3
AL     no       72
       yes       8
AR     no       44
       yes      11
AZ     no       60
       yes       4
CA     no       25
       yes       9
CO     no       57
       yes       9
CT     no       62
       yes      12
DC     no       49
       yes       5
DE     no       52
       yes       9
FL     no       55
       yes       8
GA     no       46
       yes       8
HI     no       50
       yes       3
IA     no       41
       yes       3
ID     no       64
       yes       9
IL     no       53
       yes       5
IN     no       62
       yes       9
KS     no       57
       yes      13
KY     no       51
       yes       8
LA     no       47
       yes       4
MA     no       54
       yes      11
MD     no       53
       yes      17
ME     no       49
       yes      13
MI     no       57
       yes      16
MN     no       69
       yes      15
MO     no       56
       yes       7
MS     no       51
       yes      14
MT     no       54
       yes      14
NC     no       57
       yes      11
ND     no       56
       yes       6
NE     no       56
       yes       5
NH     no       47
       yes       9
NJ     no       50
       yes      18
NM     no       56
       yes       6
NV     no       52
       yes      14
NY     no       68
       yes      15
OH     no       68
       yes      10
OK     no       52
       yes       9
OR     no       67
       yes      11
PA     no       37
       yes       8
RI     no       59
       yes       6
SC     no       46
       yes      14
SD     no       52
       yes       8
TN     no       48
       yes       5
TX     no       54
       yes      18
UT     no       62
       yes      10
VA     no       72
       yes       5
VT     no       65
       yes       8
WA     no       52
       yes      14
WI     no       71
       yes       7
WV     no       96
       yes      10
WY     no       68
       yes       9
Name: Churn, dtype: int64
display(telco.describe())
Account_Length Vmail_Message Day_Mins Eve_Mins Night_Mins Intl_Mins CustServ_Calls Day_Calls Day_Charge Eve_Calls Eve_Charge Night_Calls Night_Charge Intl_Calls Intl_Charge Area_Code
count 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000 3333.000000
mean 101.064806 8.099010 179.775098 200.980348 200.872037 10.237294 1.562856 100.435644 30.562307 100.114311 17.083540 100.107711 9.039325 4.479448 2.764581 437.182418
std 39.822106 13.688365 54.467389 50.713844 50.573847 2.791840 1.315491 20.069084 9.259435 19.922625 4.310668 19.568609 2.275873 2.461214 0.753773 42.371290
min 1.000000 0.000000 0.000000 0.000000 23.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 33.000000 1.040000 0.000000 0.000000 408.000000
25% 74.000000 0.000000 143.700000 166.600000 167.000000 8.500000 1.000000 87.000000 24.430000 87.000000 14.160000 87.000000 7.520000 3.000000 2.300000 408.000000
50% 101.000000 0.000000 179.400000 201.400000 201.200000 10.300000 1.000000 101.000000 30.500000 100.000000 17.120000 100.000000 9.050000 4.000000 2.780000 415.000000
75% 127.000000 20.000000 216.400000 235.300000 235.300000 12.100000 2.000000 114.000000 36.790000 114.000000 20.000000 113.000000 10.590000 6.000000 3.270000 510.000000
max 243.000000 51.000000 350.800000 363.700000 395.000000 20.000000 9.000000 165.000000 59.640000 170.000000 30.910000 175.000000 17.770000 20.000000 5.400000 510.000000
import matplotlib.pyplot as plt
import seaborn as sns

# visualize the distribution of Accoung length
# sns.distplot(telco['Total_Charges'])
""" Seaborn New Version:
stat="density" - show density instead of count in the histogram
alpha=0.4 - for the same transparency
kde=True - add a kernel density plot
kde_kws={"cut": 3} - extend the kernel density plot beyond the histogram limit
"""
sns.histplot(telco['Account_Length'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

plt.show() # We can see that is a normal distribution

# differences in account length box plot
sns.boxplot(x ='Churn',
            y = 'Account_Length', 
            data = telco,
            sym ='') # remove ouliners
plt.show()

# Adding 3rd variable

sns.boxplot(x ='Churn',
            y = 'Account_Length',
            data = telco,
            hue = 'Intl_Plan')
plt.show()

Exploring feature distributions

Previously, we saw that the 'Account_Length' feature was normally distributed. Let's now visualize the distributions of the following features using seaborn's distribution plot:

  • 'Day_Mins'
  • 'Eve_Mins'
  • 'Night_Mins'
  • 'Intl_Mins'

To create a feature's distribution plot, pass it in as an argument to sns.distplot().

# g = sns.PairGrid(telco[['Day_Mins','Eve_Mins','Night_Mins','Intl_Mins']])

fig = plt.figure(figsize=(8, 6), dpi=100)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
# Visualize the distribution of 'Day_Mins'
ax = fig.add_subplot(2, 2, 1)
sns.distplot(telco['Day_Mins'])

# visualize the distribution of 'Eve_Mins'
ax = fig.add_subplot(2, 2, 2)
sns.distplot(telco['Eve_Mins'])

# visualize the distribution of 'Night_Mins'
ax = fig.add_subplot(2, 2, 3)
sns.distplot(telco['Night_Mins'])

# visualize the distribution of 'Intl_Mins'
ax = fig.add_subplot(2, 2, 4)
sns.distplot(telco['Intl_Mins'])
plt.show()
""" Display with the new version of Seaborn, using histplot()
All of these features appear to be well approximated by the normal distribution.

"""
fig = plt.figure(figsize=(8, 6), dpi=100)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
# Visualize the distribution of 'Day_Mins'
ax = fig.add_subplot(2, 2, 1)
sns.histplot(telco['Day_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

# visualize the distribution of 'Eve_Mins'
ax = fig.add_subplot(2, 2, 2)
sns.histplot(telco['Eve_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

# visualize the distribution of 'Night_Mins'
ax = fig.add_subplot(2, 2, 3)
sns.histplot(telco['Night_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 

# visualize the distribution of 'Intl_Mins'
ax = fig.add_subplot(2, 2, 4)
sns.histplot(telco['Intl_Mins'], stat="density", alpha=0.4, kde=True, kde_kws={"cut": 3}) 
plt.show()

Customer service calls and churn

You've already seen that there's not much of a difference in account lengths between churners and non-churners, but that there is a difference in the number of customer service calls left by churners.

Let's now visualize this difference using a box plot and incorporate other features of interest - do customers who have international plans make more customer service calls? Or do they tend to churn more? How about voicemail plans? Let's find out!

Recall the syntax for creating a box plot using seaborn:

sns.boxplot(x = "X-axis variable",
            y = "Y-axis variable",
            data = DataFrame)

If you want to remove outliers, you can specify the additional parameter sym="", and you can add a third variable using hue.

import matplotlib.pyplot as plt
import seaborn as sns

# Add "Intl_Plan" as a third variable
sns.boxplot(x = 'Churn',
            y = 'CustServ_Calls',
            data = telco,
            sym = "",
            hue = "Intl_Plan")

# Display the plot
plt.show()

It looks like customers who do churn, have more calls to customer service than those who do not. Among those churners, those who do not have international plan, have more calls to customer service.

Preprocessing for Churn Modeling

Having explored your data, it's now time to preprocess it and get it ready for machine learning. Learn the why, what, and how of preprocessing, including feature selection and feature engineering.

  • 2.1 Data Preprocessing
    • Identifying features to convert
    • Encoding binary features
    • One hot encoding
    • Feature Scaling
      • 2.2 Feature Selection and Engineering
    • Droping unecessary features
    • Engineering a new column

Data Preprocessing

Identify features to convert

print(telco.dtypes.sort_values(), len(telco.dtypes))
""" there are some object datatype: Churn, State, Intl_Plan, Vmail_Plan, Phone, while the rest are numeric.  """ 
Account_Length      int64
Intl_Calls          int64
Night_Calls         int64
Eve_Calls           int64
Area_Code           int64
CustServ_Calls      int64
Day_Calls           int64
Vmail_Message       int64
Night_Mins        float64
Day_Charge        float64
Eve_Mins          float64
Eve_Charge        float64
Day_Mins          float64
Night_Charge      float64
Intl_Charge       float64
Intl_Mins         float64
Churn              object
Intl_Plan          object
Vmail_Plan         object
State              object
Phone              object
dtype: object 21
' there are some object datatype: Churn, State, Intl_Plan, Vmail_Plan, Phone, while the rest are numeric.  '

Encoding binary features

import time
######### Encoding binary features
print(telco['Intl_Plan'].head())
""" since this column only includes two string values ("yes" and  "no"), we can encode it as a binary feature."""
df_1, df_2 = telco['Intl_Plan'].copy(), telco['Intl_Plan'].copy()

# Method 1: replace
t1 = time.time()
for _ in range(1000):
    df_1.copy().replace({'no':0, 'yes':1})
print("Method 1 took {}s".format(time.time()-t1))

# Method 2: Label Encoder (Much Faster when iterate a large number of features)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

t2 = time.time()
for _ in range(1000):
    # LabelEncoder().fit_transform(df_2.copy())
    le.fit_transform(df_2.copy())
print("Method 2 took {}s".format(time.time()-t2))

# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco["Intl_Plan"]= le.fit_transform(telco["Intl_Plan"])
print(telco['Intl_Plan'].head())
# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco['Vmail_Plan'] =  le.fit_transform(telco["Vmail_Plan"])

# Replace 'no' with 0 and 'yes' with 1 in 'Churn'
test_le = LabelEncoder()
le_fitted = test_le.fit_transform(telco["Churn"])
telco["Churn"] = test_le.fit_transform(telco["Churn"])
0     no
1     no
2     no
3    yes
4    yes
Name: Intl_Plan, dtype: object
Method 1 took 1.096604585647583s
Method 2 took 0.45363402366638184s
0    0
1    0
2    0
3    1
4    1
Name: Intl_Plan, dtype: int32
inverted = test_le.inverse_transform(telco['Vmail_Plan'] )
inverted
array(['yes', 'yes', 'no', ..., 'no', 'no', 'yes'], dtype=object)

One hot encoding

"""
telco['State'].head(4)
0    KS
1    OH
2    NJ
3    OH
Name: State, dtype: objec

it would be a bad idea to encode them as int number for each stage --> recomend to use One Hot Encoding
pandas has a get_dummies() function which automatically applies one hot encoding over the selected feature.
"""
# Import pandas
import pandas as pd

# Perform one hot encoding on 'State'
telco_state = pd.get_dummies(telco['State'])

display(telco_state.head())
AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO ... MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 51 columns

Feature Scaling

""" We can see that originally the features are not on the same scale.
For ex: Intl_Calls range from 0 to 20, while for example, Night_Mins is from 23 to 395
"""
"""Standardization 
- Centers the distribution around the mean 
- Calculates the number of std away from the mean each oint is"""
print(telco[['Intl_Calls','Night_Mins']].describe())

from sklearn.preprocessing import StandardScaler
telco_scaled= StandardScaler().fit_transform(telco[['Intl_Calls','Night_Mins']])
# Add column names back for readability
telco[['Intl_Calls','Night_Mins']] = pd.DataFrame(telco_scaled, columns=["Intl_Calls", "Night_Mins"])

print(telco[['Intl_Calls','Night_Mins']].describe())
""" Both features are now on the same scale. In practice, you'll need to carefully ensure this is the case for all features of interest. """
        Intl_Calls   Night_Mins
count  3333.000000  3333.000000
mean      4.479448   200.872037
std       2.461214    50.573847
min       0.000000    23.200000
25%       3.000000   167.000000
50%       4.000000   201.200000
75%       6.000000   235.300000
max      20.000000   395.000000
         Intl_Calls    Night_Mins
count  3.333000e+03  3.333000e+03
mean  -8.527366e-18  7.887813e-17
std    1.000150e+00  1.000150e+00
min   -1.820289e+00 -3.513648e+00
25%   -6.011951e-01 -6.698545e-01
50%   -1.948306e-01  6.485803e-03
75%    6.178983e-01  6.808485e-01
max    6.307001e+00  3.839081e+00
" Both features are now on the same scale. In practice, you'll need to carefully ensure this is the case for all features of interest. "

Feature Selection and Engineering

Dropping unnecessary features

  • Unique identifiers such as Phone numbers, Social Security numbers, Account number shoult be drop. Example:telco.drop(['Soc_Sec', 'Tax_ID'], axis = 1)

Dropping correlated features

  • Highly correlated features can be dropped
  • They provide no additional information to the model

Some features such as 'Area_Code' and 'Phone' are not useful when it comes to predicting customer churn, and they need to be dropped prior to modeling. The easiest way to do so in Python is using the .drop() method of pandas DataFrames, just as you saw in the video, where 'Soc_Sec'and 'Tax_ID' were dropped:

telco.drop(['Soc_Sec', 'Tax_ID'], axis=1) Here, axis=1 indicates that you want to drop 'Soc_Sec' and 'Tax_ID' from the columns.

telco.corr()
Account_Length Vmail_Message Day_Mins Eve_Mins Night_Mins Intl_Mins CustServ_Calls Churn Intl_Plan Vmail_Plan Day_Calls Day_Charge Eve_Calls Eve_Charge Night_Calls Night_Charge Intl_Calls Intl_Charge Area_Code
Account_Length 1.000000 -0.004628 0.006216 -0.006757 -0.008955 0.009514 -0.003796 0.016541 0.024735 0.002918 0.038470 0.006214 0.019260 -0.006745 -0.013176 -0.008960 0.020661 0.009546 -0.012463
Vmail_Message -0.004628 1.000000 0.000778 0.017562 0.007681 0.002856 -0.013263 -0.089728 0.008745 0.956927 -0.009548 0.000776 -0.005864 0.017578 0.007123 0.007663 0.013957 0.002884 -0.001994
Day_Mins 0.006216 0.000778 1.000000 0.007043 0.004323 -0.010155 -0.013423 0.205151 0.049396 -0.001684 0.006750 1.000000 0.015769 0.007029 0.022972 0.004300 0.008033 -0.010092 -0.008264
Eve_Mins -0.006757 0.017562 0.007043 1.000000 -0.012584 -0.011035 -0.012985 0.092796 0.019100 0.021545 -0.021451 0.007050 -0.011430 1.000000 0.007586 -0.012593 0.002541 -0.011067 0.003580
Night_Mins -0.008955 0.007681 0.004323 -0.012584 1.000000 -0.015207 -0.009288 0.035493 -0.028905 0.006079 0.022938 0.004324 -0.002093 -0.012592 0.011204 0.999999 -0.012353 -0.015180 -0.005825
Intl_Mins 0.009514 0.002856 -0.010155 -0.011035 -0.015207 1.000000 -0.009640 0.068239 0.045871 -0.001318 0.021565 -0.010157 0.008703 -0.011043 -0.013605 -0.015214 0.032304 0.999993 -0.018288
CustServ_Calls -0.003796 -0.013263 -0.013423 -0.012985 -0.009288 -0.009640 1.000000 0.208750 -0.024522 -0.017824 -0.018942 -0.013427 0.002423 -0.012987 -0.012802 -0.009277 -0.017561 -0.009675 0.027572
Churn 0.016541 -0.089728 0.205151 0.092796 0.035493 0.068239 0.208750 1.000000 0.259852 -0.102148 0.018459 0.205151 0.009233 0.092786 0.006141 0.035496 -0.052844 0.068259 0.006174
Intl_Plan 0.024735 0.008745 0.049396 0.019100 -0.028905 0.045871 -0.024522 0.259852 1.000000 0.006006 0.003755 0.049398 0.006114 0.019106 0.012451 -0.028913 0.017366 0.045780 0.048551
Vmail_Plan 0.002918 0.956927 -0.001684 0.021545 0.006079 -0.001318 -0.017824 -0.102148 0.006006 1.000000 -0.011086 -0.001686 -0.006444 0.021559 0.015553 0.006064 0.007618 -0.001276 -0.000747
Day_Calls 0.038470 -0.009548 0.006750 -0.021451 0.022938 0.021565 -0.018942 0.018459 0.003755 -0.011086 1.000000 0.006753 0.006462 -0.021449 -0.019557 0.022927 0.004574 0.021666 -0.009646
Day_Charge 0.006214 0.000776 1.000000 0.007050 0.004324 -0.010157 -0.013427 0.205151 0.049398 -0.001686 0.006753 1.000000 0.015769 0.007036 0.022972 0.004301 0.008032 -0.010094 -0.008264
Eve_Calls 0.019260 -0.005864 0.015769 -0.011430 -0.002093 0.008703 0.002423 0.009233 0.006114 -0.006444 0.006462 0.015769 1.000000 -0.011423 0.007710 -0.002056 0.017434 0.008674 -0.011886
Eve_Charge -0.006745 0.017578 0.007029 1.000000 -0.012592 -0.011043 -0.012987 0.092786 0.019106 0.021559 -0.021449 0.007036 -0.011423 1.000000 0.007596 -0.012601 0.002541 -0.011074 0.003607
Night_Calls -0.013176 0.007123 0.022972 0.007586 0.011204 -0.013605 -0.012802 0.006141 0.012451 0.015553 -0.019557 0.022972 0.007710 0.007596 1.000000 0.011188 0.000305 -0.013630 0.016522
Night_Charge -0.008960 0.007663 0.004300 -0.012593 0.999999 -0.015214 -0.009277 0.035496 -0.028913 0.006064 0.022927 0.004301 -0.002056 -0.012601 0.011188 1.000000 -0.012329 -0.015186 -0.005845
Intl_Calls 0.020661 0.013957 0.008033 0.002541 -0.012353 0.032304 -0.017561 -0.052844 0.017366 0.007618 0.004574 0.008032 0.017434 0.002541 0.000305 -0.012329 1.000000 0.032372 -0.024179
Intl_Charge 0.009546 0.002884 -0.010092 -0.011067 -0.015180 0.999993 -0.009675 0.068259 0.045780 -0.001276 0.021666 -0.010094 0.008674 -0.011074 -0.013630 -0.015186 0.032372 1.000000 -0.018395
Area_Code -0.012463 -0.001994 -0.008264 0.003580 -0.005825 -0.018288 0.027572 0.006174 0.048551 -0.000747 -0.009646 -0.008264 -0.011886 0.003607 0.016522 -0.005845 -0.024179 -0.018395 1.000000
plt.figure(figsize = (10,10), dpi = 128)

ax = sns.heatmap(telco.corr(), annot =True, fmt='.1f')
ax.xaxis.set_ticks_position('top')
plt.xticks(rotation=45,ha='left', rotation_mode='anchor')

sns.set(rc = {'figure.figsize':(100,100)})#<--responsible for changing the size of a seaborn plot
plt.show()
telco = telco.drop(telco[['Area_Code','Phone']], axis=1)

# Verify dropped features
print(telco.columns)
print(telco.dtypes.sort_values(), len(telco.dtypes))
Index(['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Churn', 'Intl_Plan', 'Vmail_Plan', 'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge', 'Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge', 'State'], dtype='object')
Vmail_Plan          int32
Intl_Plan           int32
Churn               int32
Night_Calls         int64
Eve_Calls           int64
Day_Calls           int64
Account_Length      int64
Vmail_Message       int64
CustServ_Calls      int64
Intl_Mins         float64
Night_Mins        float64
Intl_Charge       float64
Eve_Mins          float64
Day_Charge        float64
Day_Mins          float64
Eve_Charge        float64
Night_Charge      float64
Intl_Calls        float64
State              object
dtype: object 19

Feature engineering

Leveraging domain knowledge to engineer new features is an essential part of modeling. This exercise is to create a new feature that contains information about the average length of night calls made by customers.

  • Creating new feature for model
telco['Day_Cost'] = telco['Day_Mins'] / telco['Day_Charge']
# Create the new feature
telco['Avg_Night_Calls'] = telco['Night_Mins'] / telco['Night_Calls']
print(telco.dtypes.sort_values(), len(telco.dtypes))
Vmail_Plan           int32
Intl_Plan            int32
Churn                int32
Account_Length       int64
Night_Calls          int64
Eve_Calls            int64
Day_Calls            int64
Vmail_Message        int64
CustServ_Calls       int64
Intl_Mins          float64
Night_Mins         float64
Eve_Mins           float64
Day_Cost           float64
Day_Charge         float64
Day_Mins           float64
Eve_Charge         float64
Night_Charge       float64
Intl_Calls         float64
Intl_Charge        float64
Avg_Night_Calls    float64
State               object
dtype: object 21

Churn Prediction

With your data preprocessed and ready for machine learning, it's time to predict churn! Learn how to build supervised learning machine models in Python using scikit-learn.

  • 3.1 Making Prediction
    • Predicting whether a new customer will churn
    • Training another sklearn model
  • 3.2 Evaluating Model Performace
    • Creating training and test sets
    • Checking each sets length
    • Computing accuracy
  • 3.3 Model Metric
    • Confusion Matrix
    • Varying training set size
    • Computing precision and recall
  • 3.4 Other Metrics
    • ROC curve
    • AUC (Area Under the Curve)
    • Precision and Recall Curve
    • F1 Score

Making Prediction

Predicting whether a new customer will churn

features = ['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Intl_Plan', 'Vmail_Plan',
            'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge','Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge']


from sklearn.model_selection import train_test_split
X, y = telco[features], telco['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)


# this customer value was introduced in the example:
new_customer = X_test[(X_test.Account_Length == 91) & (X_test.Vmail_Message == 23)]

prediction = svc.predict(new_customer)
print(prediction)

print(svc.score(X_test, y_test))
[0]
0.8590704647676162

Training another scikit-learn model

All sklearn models have.fit()and .predict() methods like the one you used in the previous exercise for the LogisticRegression model. This feature allows you to easily try many different models to see which one gives you the best performance. To get you more confident with using the sklearn API, in this exercise you'll try fitting a DecisionTreeClassifier instead of a LogisticRegression.

from sklearn.tree import DecisionTreeClassifier

# Instantiate the classifier
clf = DecisionTreeClassifier()

# Fit the classifier
clf.fit(X_train,y_train)

# this customer value was introduced in the example:
new_customer = X_test[(X_test.Account_Length == 91) & (X_test.Vmail_Message == 23)]

# Predict the label of new_customer
print(clf.predict(new_customer))

print(clf.score(X_test, y_test))
[0]
0.9205397301349325

Evaluating Model Performance

Creating training and test sets

Check each sets length

from sklearn.model_selection import train_test_split

# we select the below features to reproduce the results in the example
features = ['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins', 'Intl_Mins', 'CustServ_Calls', 'Intl_Plan', 'Vmail_Plan', 'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge','Night_Calls', 'Night_Charge', 'Intl_Calls', 'Intl_Charge']

# Create feature variable
X = telco[features] # DONT USE just telco.drop('Churn', axis=1), there are other oject columns
# Create target variable
y = telco['Churn']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# check each set length
print(len(X_train), len(X_test), len(y_train), len(y_test))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
2333 1000 2333 1000
(2333, 17) (1000, 17) (2333,) (1000,)

Computing Accuracy

Use a RandomForestClassifier as an ensemble of Decision Trees that generally outperforms a single Decision Tree.

print(X_train.dtypes, len(X_train.dtypes))
Account_Length      int64
Vmail_Message       int64
Day_Mins          float64
Eve_Mins          float64
Night_Mins        float64
Intl_Mins         float64
CustServ_Calls      int64
Intl_Plan           int32
Vmail_Plan          int32
Day_Calls           int64
Day_Charge        float64
Eve_Calls           int64
Eve_Charge        float64
Night_Calls         int64
Night_Charge      float64
Intl_Calls        float64
Intl_Charge       float64
dtype: object 17
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Compute accuracy
print(clf.score(X_test, y_test))
0.957

Model Metrics

Confusion Matrix

Computing precision and recall

The sklearn.metrics submodule has many functions that allow you to easily calculate interesting metrics. So far, you've calculated precision and recall by hand - this is important while you develop your intuition for both these metrics.

In practice, once you do, you can leverage the precision_score and recall_score functions that automatically compute precision and recall, respectively. Both work similarly to other functions in sklearn.metrics - they accept 2 arguments: the first is the actual labels (y_test), and the second is the predicted labels (y_pred).

Let's now try a training size of 90%.

from sklearn.model_selection import train_test_split

# Create training and testing sets,  try a training size of 90%.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Import precision_score
from sklearn.metrics import precision_score, recall_score, f1_score

# Print the precision
print(precision_score(y_test, y_pred))
# Print the recall
print(recall_score(y_test, y_pred))
# Print the F1 score
print(f1_score(y_test, y_pred))
[[273   5]
 [ 14  42]]
0.8936170212765957
0.75
0.8155339805825244
""" 
[[True Negative (TN), False Positive(FP)],
 [False Negative (FN), True Positive(TP)]]

Precision   = TP / (TP + FP)
Recall      = TP / (TP + FN)
"""
print("Presition: {}".format(43/(43+2)))
print("Recall: {}".format(43/(43+8)))
Presition: 0.9555555555555556
Recall: 0.8431372549019608

Other Model Metric

  • 3.4.1 ROC curve
  • 3.4.2 AUC (Area Under the Curve) -> the bigger the area, the better the model.
  • 3.4.3 Precision and Recall Curve
  • 3.4.5 F1 Score

ROC curve

Let's now create an ROC curve for our random forest classifier. The first step is to calculate the predicted probabilities output by the classifier for each label using its .predict_proba() method. Then, you can use the roc_curve function from sklearn.metrics to compute the false positive rate and true positive rate, which you can then plot using matplotlib.

A RandomForestClassifier with a training set size of 70% has been fit to the data and is available in your workspace as clf.

y_pred_prob = clf.predict_proba(X_test)[:, 1]


# Import roc_curve
from sklearn.metrics import roc_curve

# Calculate the roc metrics for the model RandomForestClassifier (right above)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.figure(figsize=(10,10), dpi=80)

# Plot the ROC curve
plt.plot(fpr, tpr)

# Add labels and diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot([0, 1], [0, 1], "k--")
plt.show()

Area under the Curve

A pefect model will have AUC = 1, while a worse model has AUC = 0.5 (Random probability guess for 0 and 1)

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test, y_pred_prob)
print("AUC: {}".format(auc))
AUC: 0.9405511305241522

Precision-Recall Curve

Another way to evaluate model performance is using a precision-recall curve, which shows the tradeoff between precision and recall for different thresholds.

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score

# Calculate the precision-recall metrics for the model
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

plt.figure(figsize=(10,10), dpi=80)

# Plot the ROC curve
plt.plot(recall,precision)



plt.xlabel('Recall')
plt.ylabel('Precision')

# show the plot
plt.show()

F1 score

As you've discovered, there's a tradeoff between precision and recall. Both are important metrics, and depending on how the business is trying to model churn, you may want to focus on optimizing one over the other. Often, stakeholders are interested in a single metric that can quantify model performance. The AUC is one metric you can use in these cases, and another is the F1 score, which is calculated as below:

2 * (precision * recall) / (precision + recall)

The advantage of the F1 score is it incorporates both precision and recall into a single metric, and a high F1 score is a sign of a well-performing model, even in situations where you might have imbalanced classes. In scikit-learn, you can compute the f-1 score using using the f1_score function.

clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Import f1_score
from sklearn.metrics import f1_score

# Print the F1 score
print(f1_score(y_test, y_pred))
0.854368932038835

Model Tuning

Learn how to improve the performance of your models using hyperparameter tuning and gain a better understanding of the drivers of customer churn that you can take back to the business.

  • 4.1 Tuning your model:
    • Tuning the number of features
    • Tuning other hyperparameters
    • Randomized Search
  • 4.2 Feature Importance
    • Visualize feature importances
    • Improving the plot
    • Interpreting feature importances
  • 4.3 Adding New Features
    • Does model performance imporve?
    • Computing other metrics

Tuning

Tuning the number of trees + features

""" Grid Search in sklearn """
from sklearn.model_selection import GridSearchCV
import numpy as np

# grid parameters
param_grid = {  'n_estimators': np.arange(10, 51),          # n_estimators: controls the number of trees in the forest.
                'max_features': ['auto', 'sqrt', 'log2'],   # max_features: controls the number of features to consider when looking for the best split.
            }

# Instantiate the grid search model
clf_cv = GridSearchCV(RandomForestClassifier(), param_grid)

clf_cv.fit(X_train, y_train)
clf_cv.best_params_
{'max_features': 'auto', 'n_estimators': 43}
clf_cv.best_score_
0.9549849749582637

Tuning other hyperparamters

The power of GridSearchCV really comes into play when you're tuning multiple hyperparameters, as then the algorithm tries out all possible combinations of hyperparameters to identify the best combination. Here, you'll tune the following random forest hyperparameters:

Hyperparameter Purpose
criterion Quality of Split
max_features Number of features for best split
max_depth Max depth of tree
bootstrap Whether Bootstrap samples are used

The hyperparameter grid has been specified for you, along with a random forest classifier called clf.

other_param_grid = {'max_depth': [3, None],
                    'max_features': [1,3,10],
                    'bootstrap': [True, False],
                    'criterion': ['gini', 'entropy'],
                    } 

other_grid_search = GridSearchCV(RandomForestClassifier(), other_param_grid)

other_grid_search.fit(X_train, y_train)
GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'bootstrap': [True, False],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [3, None], 'max_features': [1, 3, 10]})
print(other_grid_search.best_params_)
print(other_grid_search.best_score_)
{'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10}
0.9549860879243184
# Call GridSearchCV
grid_search = GridSearchCV(clf, param_grid)

# Fit the model
grid_search.fit(X, y)

In the above chunk of code from the previous exercise, you may have noticed that the first line of code did not take much time to run, while the call to .fit() took several seconds to execute.

This is because .fit() is what actually performs the grid search, and in our case, it was grid with many different combinations. As the hyperparameter grid gets larger, grid search becomes slower. In order to solve this problem, instead of trying out every single combination of values, we could randomly jump around the grid and try different combinations. There's a small possibility we may miss the best combination, but we would save a lot of time, or be able to tune more hyperparameters in the same amount of time.

In scikit-learn, you can do this using RandomizedSearchCV. It has the same API as GridSearchCV, except that you need to specify a parameter distribution that it can sample from instead of specific hyperparameter values. Let's try it out now! The parameter distribution has been set up for you, along with a random forest classifier called clf.

from sklearn.model_selection import RandomizedSearchCV


# Create the hyperparameter grid
param_dist = {"max_depth": [3, None],
              "max_features": [1, 5, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

rf = RandomForestClassifier()
# Call RandomizedSearchCV
random_search = RandomizedSearchCV(rf, param_dist)

# Fit the model
random_search.fit(X_train, y_train)

# Print best parameters
print(random_search.best_params_)
{'max_features': 5, 'max_depth': None, 'criterion': 'entropy', 'bootstrap': False}
print(random_search.best_score_)
0.9563199777406789

Feature Importances

  • Scores representing how much each feature contribute to a prediction
  • Which features are important
random_search.best_estimator_.feature_importances_
array([0.02768986, 0.02973571, 0.1284745 , 0.06796968, 0.04463708,
       0.04414193, 0.13070052, 0.08343848, 0.01765399, 0.02959724,
       0.13125914, 0.02546774, 0.06996309, 0.0267514 , 0.04178584,
       0.05953731, 0.0411965 ])

Visualizing feature importances

Your random forest classifier from earlier exercises has been fit to the telco data and is available to you as clf. Let's visualize the feature importances and get a sense for what the drivers of churn are, using matplotlib's barh to create a horizontal bar plot of feature importances.

importances = clf.feature_importances_

plt.figure(figsize=(8,4), dpi=80)

# Create plot
plt.barh(range(X_train.shape[1]), importances)
plt.show()

Improving the above plot

To make the plot more readable:

- Re-order the bars in ascending order.
- Add labels to the plot that correspond to the feature names.

To do this, we'll take advantage of NumPy indexing. The .argsort() method sorts an array and returns the indices. We'll use these indices to achieve both goal

sorted_index = np.argsort(importances)

# create labels
labels = X_train.columns.values[sorted_index]

# clear the current plot
plt.clf()
plt.figure(figsize=(8,6), dpi=80)

# create new plot: pyplot.barh(y: The y coordinates of the bars , width, height=0.8, left=None, *, align='center', **kwargs)
plt.barh(range(X_train.shape[1]), importances[sorted_index], tick_label=labels)
plt.show()
<Figure size 7200x7200 with 0 Axes>

Adding new features

  • Does model performance imporve?
  • Computing other metrics

Does model performance improve?

6 new features have been added to the telco DataFrame:

- Region_Code
- Cost_Call
- Total_Charge
- Total_Minutes
- Total_Calls
- Min_Call

Computing other metrics

In addition to accuracy, let's also compute the F1 score of this new model to get a better picture of model performance.

A 70-30 train-test split has already been done for you, and all necessary modules have been imported.

from sklearn.metrics import f1_score

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Print the F1 score
print(f1_score(y_pred, y_test))
0.8282828282828282