Linear regression and logistic regression are two of the most widely used statistical models. They act like master keys, unlocking the secrets hidden in your data. In this course, you’ll gain the skills you need to fit simple linear and logistic regressions. Through hands-on exercises, you’ll explore the relationships between variables in real-world datasets, including motor insurance claims, Taiwan house prices, fish sizes, and more. By the end of this course, you’ll know how to make predictions from your data, quantify model performance, and diagnose problems with model fit.

import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols

plt.rcParams['figure.figsize'] = [10, 7.5]

pd.set_option('display.expand_frame_repr', False)

warnings.filterwarnings("ignore")

Simple Linear Regression Modeling

You’ll learn the basics of this popular statistical model, what regression is, and how linear and logistic regressions differ. You’ll then learn how to fit simple linear regression models with numeric and categorical explanatory variables, and how to describe the relationship between the response and explanatory variables using model coefficients.

A tale of two variables

  • Jargon:

    • Response variable (a.k.a. dependent variable)
      • The variable that you want to predict.
    • Explanatory variables (a.k.a. independent variables)
      • The variables that explain how the response variable will change.
  • Linear regression and logistic regression:

    • Linear regression
      • The response variable is numeric.
    • Logistic regression
      • The response variable is logical.
    • Simple linear/logistic regression
      • There is only one explanatory variable.

Which one is the response variable?


Visualizing two numeric variables

Before you can run any statistical models, it's usually a good idea to visualize your dataset. Here, you'll look at the relationship between house price per area and the number of nearby convenience stores using the Taiwan real estate dataset.

One challenge in this dataset is that the number of convenience stores contains integer data, causing points to overlap. To solve this, you will make the points transparent.

taiwan_real_estate is available as a pandas DataFrame.

Instructions:

  • Import the seaborn package, aliased as sns.
  • Using taiwan_real_estate, draw a scatter plot of "price_twd_msq" (y-axis) versus "n_convenience" (x-axis).
  • Draw a trend line calculated using linear regression. Omit the confidence interval ribbon. Note: The scatter_kws argument, pre-filled in the exercise, makes the data points 50% transparent.
taiwan_real_estate = pd.read_csv('./datasets/taiwan_real_estate2.csv')
display(taiwan_real_estate.head())
display(taiwan_real_estate.info())
dist_to_mrt_m n_convenience house_age_years price_twd_msq
0 84.87882 10 30 to 45 11.467474
1 306.59470 9 15 to 30 12.768533
2 561.98450 5 0 to 15 14.311649
3 561.98450 5 0 to 15 16.580938
4 390.56840 5 0 to 15 13.040847
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   dist_to_mrt_m    414 non-null    float64
 1   n_convenience    414 non-null    int64  
 2   house_age_years  414 non-null    object 
 3   price_twd_msq    414 non-null    float64
dtypes: float64(2), int64(1), object(1)
memory usage: 13.1+ KB
None
import seaborn as sns

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Draw the scatter plot
sns.scatterplot(x="n_convenience",
                y="price_twd_msq",
                data=taiwan_real_estate)

# Draw a trend line on the scatter plot of price_twd_msq vs. n_convenience
sns.regplot(x="n_convenience",
            y="price_twd_msq",
            data=taiwan_real_estate,
            ci=None,
            scatter_kws={'alpha': 0.5},
            line_kws={"color": "red"})

# Show the plot
plt.show()

Fitting a linear regression


Estimate the intercept

Estimate the slope


Linear regression with ols()

While sns.regplot() can display a linear regression trend line, it doesn't give you access to the intercept and slope as variables, or allow you to work with the model results as variables. That means that sometimes you'll need to run a linear regression yourself.

Time to run your first model!

taiwan_real_estate is available. TWD is an abbreviation for Taiwan dollars.

In addition, for this exercise and the remainder of the course, the following packages will be imported and aliased if necessary:matplotlib.pyplot as plt, seaborn as sns, and pandas as pd.
Instructions:

  • Import the ols() function from the statsmodels.formula.api package.
  • Run a linear regression with price_twd_msq as the response variable, n_convenience as the explanatory variable, and taiwan_real_estate as the dataset. Name it mdl_price_vs_conv.
  • Fit the model.
  • Print the parameters of the fitted model.
# Import the ols function
from statsmodels.formula.api import ols

# Create the model object
mdl_price_vs_conv = ols("price_twd_msq ~ n_convenience", data=taiwan_real_estate)

# Fit the model
mdl_price_vs_conv = mdl_price_vs_conv.fit()

# Print the parameters of the fitted model
print(mdl_price_vs_conv.params)
Intercept        8.224237
n_convenience    0.798080
dtype: float64

Question: The model had an Intercept coefficient of 8.2242. What does this mean?

Possible Answers:

  • On average, houses had a price of 8.2242 TWD per square meter.

  • On average, a house with zero convenience stores nearby had a price of 8.2242 TWD per square meter.

  • The minimum house price was 8.2242 TWD per square meter.

  • The minimum house price with zero convenience stores nearby was 8.2242 TWD per square meter.

  • The intercept tells you nothing about house prices.


Question: The model had an n_convenience coefficient of 0.7981. What does this mean?

Possible Answers:

  • If you increase the number of nearby convenience stores by one, then the expected increase in house price is 0.7981 TWD per square meter.

  • If you increase the house price by 0.7981 TWD per square meter, then the expected increase in the number of nearby convenience stores is one.

  • If you increase the number of nearby convenience stores by 0.7981, then the expected increase in house price is one TWD per square meter.

  • If you increase the house price by one TWD per square meter, then the expected increase in the number of nearby convenience stores is 0.7981.

  • The n_convenience coefficient tells you nothing about house prices.

The intercept is positive, so a house with no convenience stores nearby still has a positive price. The coefficient for convenience stores is also positive, so as the number of nearby convenience stores increases, so does the price of the house.

Categorical explanatory variables

Fish dataset:

  • Each row represents one fish.
  • There are 128 rows in the dataset.
  • There are 4 species of fish:
    • Common Bream
    • European Perch
    • Northern Pike
    • Common Roach
import matplotlib.pyplot as plt
import seaborn as sns

fish = pd.read_csv("./datasets/fish.csv")
sns.displot(data=fish,
            x="mass_g",
            col="species",
            col_wrap=2,
            bins=9)
plt.show()
summary_stats = fish.groupby("species")["mass_g"].mean()
print(summary_stats)
species
Bream    617.828571
Perch    382.239286
Pike     718.705882
Roach    152.050000
Name: mass_g, dtype: float64
from statsmodels.formula.api import ols
mdl_mass_vs_species = ols("mass_g ~ species", data=fish).fit()
print(mdl_mass_vs_species.params)
Intercept           617.828571
species[T.Perch]   -235.589286
species[T.Pike]     100.877311
species[T.Roach]   -465.778571
dtype: float64


Visualizing numeric vs. categorical

If the explanatory variable is categorical, the scatter plot that you used before to visualize the data doesn't make sense. Instead, a good option is to draw a histogram for each category.

The Taiwan real estate dataset has a categorical variable in the form of the age of each house. The ages have been split into 3 groups:0 to 15 years, 15 to 30 years, and 30 to 45 years.

Instructions:

  • Using taiwan_real_estate, plot a histogram of price_twd_msq with 10 bins. Split the plot by house_age_years to give 3 panels.
plt.rcParams['figure.figsize'] = [10, 10]
sns.displot(data=taiwan_real_estate,
         x="price_twd_msq",
         col="house_age_years",
         bins=10)

# Show the plot
plt.show()

Calculating means by category

A good way to explore categorical variables further is to calculate summary statistics for each category. For example, you can calculate the mean and median of your response variable, grouped by a categorical variable. As such, you can compare each category in more detail.

Here, you'll look at grouped means for the house prices in the Taiwan real estate dataset. This will help you understand the output of a linear regression with a categorical variable.

mean_price_by_age = taiwan_real_estate.groupby("house_age_years")["price_twd_msq"].mean()

# Print the result
print(mean_price_by_age)
house_age_years
0 to 15     12.637471
15 to 30     9.876743
30 to 45    11.393264
Name: price_twd_msq, dtype: float64

Linear regression with a categorical explanatory variable

Great job calculating those grouped means! As mentioned in the last video, the means of each category will also be the coefficients of a linear regression model with one categorical variable. You'll prove that in this exercise.

To run a linear regression model with categorical explanatory variables, you can use the same code as with numeric explanatory variables. The coefficients returned by the model are different, however. Here you'll run a linear regression on the Taiwan real estate dataset.

mdl_price_vs_age = ols("price_twd_msq ~ house_age_years", data=taiwan_real_estate).fit()

# Print the parameters of the fitted model
print(mdl_price_vs_age.params)
Intercept                      12.637471
house_age_years[T.15 to 30]    -2.760728
house_age_years[T.30 to 45]    -1.244207
dtype: float64
mdl_price_vs_age0 = ols("price_twd_msq ~ house_age_years + 0", data=taiwan_real_estate).fit()

# Print the parameters of the fitted model
print(mdl_price_vs_age0.params)
house_age_years[0 to 15]     12.637471
house_age_years[15 to 30]     9.876743
house_age_years[30 to 45]    11.393264
dtype: float64

The coefficients of the model are just the means of each category you calculated previously.

Predictions and model objects

In this chapter, you’ll discover how to use linear regression models to make predictions on Taiwanese house prices and Facebook advert clicks. You’ll also grow your regression skills as you get hands-on with model objects, understand the concept of "regression to the mean", and learn how to transform variables in a dataset.

Making predictions

bream = fish[fish["species"] == "Bream"]
print(bream.head())
  species  mass_g  length_cm
0   Bream   242.0       23.2
1   Bream   290.0       24.0
2   Bream   340.0       23.9
3   Bream   363.0       26.3
4   Bream   430.0       26.5
plt.rcParams['figure.figsize'] = [7.5, 5]
sns.regplot(x="length_cm",
            y="mass_g",
            data=bream,
            ci=None)
plt.show()
mdl_mass_vs_length = ols("mass_g ~ length_cm", data=bream).fit()
print(mdl_mass_vs_length.params)
Intercept   -1035.347565
length_cm      54.549981
dtype: float64
# what value would the response variable have?
explanatory_data = pd.DataFrame({"length_cm": np.arange(20, 41)})
explanatory_data[:5]
length_cm
0 20
1 21
2 22
3 23
4 24
display(mdl_mass_vs_length.predict(explanatory_data)[:5])
0     55.652054
1    110.202035
2    164.752015
3    219.301996
4    273.851977
dtype: float64
explanatory_data = pd.DataFrame( {"length_cm": np.arange(20, 41)} ) 

# create new df with new mass_g col from the above df                               
prediction_data = explanatory_data.assign( mass_g=mdl_mass_vs_length.predict(explanatory_data)) 
display(prediction_data.head())
length_cm mass_g
0 20 55.652054
1 21 110.202035
2 22 164.752015
3 23 219.301996
4 24 273.851977
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure()

sns.regplot(x="length_cm",
            y="mass_g",
            ci=None,
            data=bream,)
            
sns.scatterplot(x="length_cm",
                y="mass_g",
                data=prediction_data,
                color="red",
                marker="s")
plt.show()
# outside the range of observed data.
little_bream = pd.DataFrame({"length_cm": [10]})
pred_little_bream = little_bream.assign(
mass_g=mdl_mass_vs_length.predict(little_bream))
print(pred_little_bream)
   length_cm      mass_g
0         10 -489.847756
sns.regplot(x="length_cm",
            y="mass_g",
            ci=None,
            data=bream,)
            
sns.scatterplot(x="length_cm",
                y="mass_g",
                data=pred_little_bream,
                color="red",
                marker="s")
plt.show()

Predicting house prices

Perhaps the most useful feature of statistical models like linear regression is that you can make predictions. That is, you specify values for each of the explanatory variables, feed them to the model, and get a prediction for the corresponding response variable. The code flow is as follows.

explanatory_data = pd.DataFrame({"explanatory_var":list_of_values})predictions = model.predict(explanatory_data)
prediction_data = explanatory_data.assign(response_var=predictions)


Here, you'll make predictions for the house prices in the Taiwan real estate dataset.

taiwan_real_estate is available. The fitted linear regression model of house price versus number of convenience stores is available as mdl_price_vs_conv. For future exercises, when a model is available, it will also be fitted.

import numpy as np

# Create explanatory_data 
explanatory_data = pd.DataFrame({'n_convenience': np.arange(0, 11)})

# Use mdl_price_vs_conv to predict with explanatory_data, call it price_twd_msq
price_twd_msq = mdl_price_vs_conv.predict(explanatory_data)

# Create prediction_data
prediction_data = explanatory_data.assign(
    price_twd_msq = price_twd_msq)

# Print the result
display(prediction_data)
n_convenience price_twd_msq
0 0 8.224237
1 1 9.022317
2 2 9.820397
3 3 10.618477
4 4 11.416556
5 5 12.214636
6 6 13.012716
7 7 13.810795
8 8 14.608875
9 9 15.406955
10 10 16.205035

Visualizing predictions

The prediction DataFrame you created contains a column of explanatory variable values and a column of response variable values. That means you can plot it on the same scatter plot of response versus explanatory data values.

plt.rcParams['figure.figsize'] = [10, 8]
fig = plt.figure()

sns.regplot(x="n_convenience",
            y="price_twd_msq",
            data=taiwan_real_estate,
            ci=None)
# Add a scatter plot layer to the regplot
sns.scatterplot(x="n_convenience",
                y="price_twd_msq",
                data=prediction_data,
                color="red",
                marker="s")

# Show the layered plot
plt.show()

The limits of prediction

In the last exercise, you made predictions on some sensible, could-happen-in-real-life, situations. That is, the cases when the number of nearby convenience stores were between zero and ten. To test the limits of the model's ability to predict, try some impossible situations.

Use the console to try predicting house prices from mdl_price_vs_conv when there are -1 convenience stores. Do the same for 2.5 convenience stores. What happens in each case?

mdl_price_vs_conv is available.

impossible = pd.DataFrame({'n_convenience': [-1,2.5]})
mdl_price_vs_conv.predict(impossible)
0     7.426158
1    10.219437
dtype: float64

Question: Try making predictions on your two impossible cases. What happens?

Possible Answers:

  • The model throws an error because these cases are impossible in real life.

  • The model throws a warning because these cases are impossible in real life.

  • The model successfully gives a prediction about cases that are impossible in real life.

  • The model throws an error for -1 stores because it violates the assumptions of a linear regression. 2.5 stores gives a successful prediction.

  • The model throws an error for 2.5 stores because it violates the assumptions of a linear regression. -1 stores gives a successful prediction.

Linear models don't know what is possible or not in real life. That means that they can give you predictions that don't make any sense when applied to your data. You need to understand what your data means in order to determine whether a prediction is nonsense or not.

Working with model objects

from statsmodels.formula.api import ols
mdl_mass_vs_length = ols("mass_g ~ length_cm", data = bream).fit()
print(mdl_mass_vs_length.params)
Intercept   -1035.347565
length_cm      54.549981
dtype: float64
# print(mdl_mass_vs_length.fittedvalues)

# or equivalently
explanatory_data = bream["length_cm"]

print(np.array_equal(mdl_mass_vs_length.predict(explanatory_data), mdl_mass_vs_length.fittedvalues))
True
# predicted response values
# print(mdl_mass_vs_length.resid)

# or equivalently
print(np.array_equal(bream["mass_g"] - mdl_mass_vs_length.fittedvalues , mdl_mass_vs_length.resid))
True
mdl_mass_vs_length.summary()
OLS Regression Results
Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6
Date: Wed, 16 Nov 2022 Prob (F-statistic): 1.22e-16
Time: 16:46:40 Log-Likelihood: -199.35
No. Observations: 35 AIC: 402.7
Df Residuals: 33 BIC: 405.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1035.3476 107.973 -9.589 0.000 -1255.020 -815.676
length_cm 54.5500 3.539 15.415 0.000 47.350 61.750
Omnibus: 7.314 Durbin-Watson: 1.478
Prob(Omnibus): 0.026 Jarque-Bera (JB): 10.857
Skew: -0.252 Prob(JB): 0.00439
Kurtosis: 5.682 Cond. No. 263.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Extracting model elements

The model object created by ols() contains many elements. In order to perform further analysis on the model results, you need to extract its useful bits. The model coefficients, the fitted values, and the residuals are perhaps the most important pieces of the linear model object.

print(mdl_price_vs_conv.params)
Intercept        8.224237
n_convenience    0.798080
dtype: float64
display(mdl_price_vs_conv.fittedvalues)
0      16.205035
1      15.406955
2      12.214636
3      12.214636
4      12.214636
         ...    
409     8.224237
410    15.406955
411    13.810795
412    12.214636
413    15.406955
Length: 414, dtype: float64
display(mdl_price_vs_conv.resid)
0     -4.737561
1     -2.638422
2      2.097013
3      4.366302
4      0.826211
         ...   
409   -3.564631
410   -0.278362
411   -1.526378
412    3.670387
413    3.927387
Length: 414, dtype: float64
display(mdl_price_vs_conv.summary())
OLS Regression Results
Dep. Variable: price_twd_msq R-squared: 0.326
Model: OLS Adj. R-squared: 0.324
Method: Least Squares F-statistic: 199.3
Date: Wed, 16 Nov 2022 Prob (F-statistic): 3.41e-37
Time: 16:50:00 Log-Likelihood: -1091.1
No. Observations: 414 AIC: 2186.
Df Residuals: 412 BIC: 2194.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8.2242 0.285 28.857 0.000 7.664 8.784
n_convenience 0.7981 0.057 14.118 0.000 0.687 0.909
Omnibus: 171.927 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1417.242
Skew: 1.553 Prob(JB): 1.78e-308
Kurtosis: 11.516 Cond. No. 8.87


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Manually predicting house prices

You can manually calculate the predictions from the model coefficients. When making predictions in real life, it is better to use .predict(), but doing this manually is helpful to reassure yourself that predictions aren't magic - they are simply arithmetic.

In fact, for a simple linear regression, the predicted value is just the intercept plus the slope times the explanatory variable.

$$response = intercept + slope * explanatory$$

explanatory_data = pd.DataFrame({'n_convenience': np.arange(0,10)})
coeffs = mdl_price_vs_conv.params

# Get the intercept
intercept = coeffs[0]

# Get the slope
slope = coeffs[1]

# Manually calculate the predictions
price_twd_msq = intercept + slope * explanatory_data
print(price_twd_msq)

# Compare to the results from .predict()
print(price_twd_msq.assign(predictions_auto=mdl_price_vs_conv.predict(explanatory_data)))
   n_convenience
0       8.224237
1       9.022317
2       9.820397
3      10.618477
4      11.416556
5      12.214636
6      13.012716
7      13.810795
8      14.608875
9      15.406955
   n_convenience  predictions_auto
0       8.224237          8.224237
1       9.022317          9.022317
2       9.820397          9.820397
3      10.618477         10.618477
4      11.416556         11.416556
5      12.214636         12.214636
6      13.012716         13.012716
7      13.810795         13.810795
8      14.608875         14.608875
9      15.406955         15.406955

Regression to the mean

The concept:- Response value = fitted value + residual- "The stuff you explained" + "the stuff you couldn't explain"

  • Residuals exist due to problems in the model and fundamental randomness
  • Extreme cases are often due to randomness
  • Regression to the mean means extreme cases don't persist over time

Home run!

Regression to the mean is an important concept in many areas, including sports.

Here you can see a dataset of baseball batting data in 2017 and 2018. Each point represents a player, and more home runs are better. A naive prediction might be that the performance in 2018 is the same as the performance in 2017. That is, a linear regression would lie on the "y equals x" line.

Explore the plot and make predictions. What does regression to the mean say about the number of home runs in 2018 for a player who was very successful in 2017?


Someone who hit 40 home runs in 2017 is predicted to hit 10 fewer home runs the next year because regression to the mean states that, on average, extremely high values are not sustained.

Due to regression to the mean, it's common that one player or team that does really well one year, doesn't do as well the next. Likewise players or teams that perform very badly one year do better the next year.

Plotting consecutive portfolio returns

Regression to the mean is also an important concept in investing. Here you'll look at the annual returns from investing in companies in the Standard and Poor 500 index (S&P 500), in 2018 and 2019.

The sp500_yearly_returns dataset contains three columns:

variable meaning
symbol Stock ticker symbol uniquely identifying the company.
return_2018 A measure of investment performance in 2018.
return_2019 A measure of investment performance in 2019.


A positive number for the return means the investment increased in value; negative means it lost value.

Just as with baseball home runs, a naive prediction might be that the investment performance stays the same from year to year, lying on the y equals x line.

sp500_yearly_returns = pd.read_csv("./datasets/sp500_yearly_returns.csv")
sp500_yearly_returns.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   symbol       493 non-null    object 
 1   return_2018  493 non-null    float64
 2   return_2019  493 non-null    float64
dtypes: float64(2), object(1)
memory usage: 11.7+ KB
fig = plt.figure()

# Plot the first layer: y = x
plt.axline(xy1=(0,0), slope=1, linewidth=2, color="green")

# Add scatter plot with linear regression trend line
sns.regplot(x="return_2018",
            y="return_2019",
            data=sp500_yearly_returns,
            ci = None,
            line_kws={"color": "black"})

# Set the axes so that the distances along the x and y axes look the same
plt.axis("equal")

# Show the plot
plt.show()

The regression trend line looks very different to the y equals x line. As the financial advisors say, "Past performance is no guarantee of future results."

Modeling consecutive returns

Let's quantify the relationship between returns in 2019 and 2018 by running a linear regression and making predictions. By looking at companies with extremely high or extremely low returns in 2018, we can see if their performance was similar in 2019.

mdl_returns = ols("return_2019 ~ return_2018", data = sp500_yearly_returns).fit()

# Print the parameters
print(mdl_returns.params)
Intercept      0.321321
return_2018    0.020069
dtype: float64
mdl_returns = ols("return_2019 ~ return_2018", data=sp500_yearly_returns).fit()

# Create a DataFrame with return_2018 at -1, 0, and 1 
explanatory_data = pd.DataFrame({"return_2018": [-1,0,1]})

# Use mdl_returns to predict with explanatory_data
print(mdl_returns.predict(explanatory_data))
0    0.301251
1    0.321321
2    0.341390
dtype: float64

Investments that gained a lot in value in 2018 on average gained only a small amount in 2019. Similarly, investments that lost a lot of value in 2018 on average also gained a small amount in 2019.

Transforming variables

perch = fish[fish["species"] == "Perch"]
print(perch.head())
   species  mass_g  length_cm
55   Perch     5.9        7.5
56   Perch    32.0       12.5
57   Perch    40.0       13.8
58   Perch    51.5       15.0
59   Perch    70.0       15.7
sns.regplot(x="length_cm",
            y="mass_g",
            data=perch,
            ci=None)
plt.show()
perch =perch.assign( length_cm_cubed= perch["length_cm"] ** 3)

sns.regplot(x="length_cm_cubed",
            y="mass_g",
            data=perch,
            ci=None)
plt.show()
mdl_perch = ols("mass_g ~ length_cm_cubed", data=perch).fit()
mdl_perch.params
Intercept         -0.117478
length_cm_cubed    0.016796
dtype: float64
explanatory_data = pd.DataFrame({"length_cm_cubed": np.arange(10, 41, 5) ** 3,
                                 "length_cm": np.arange(10, 41, 5)})

prediction_data = explanatory_data.assign(mass_g=mdl_perch.predict(explanatory_data))
plt.rcParams['figure.figsize'] = [7, 5]
fig = plt.figure()
sns.regplot(x="length_cm_cubed", y="mass_g",
            data=perch, ci=None)
            
sns.scatterplot(data=prediction_data,
                x="length_cm_cubed", y="mass_g",
                color="red", marker="s")
<AxesSubplot: xlabel='length_cm_cubed', ylabel='mass_g'>
fig = plt.figure()
sns.regplot(x="length_cm", y="mass_g",
            data=perch, ci=None)
            
sns.scatterplot(data=prediction_data,
                x="length_cm", y="mass_g",
                color="red", marker="s")
<AxesSubplot: xlabel='length_cm', ylabel='mass_g'>

Transforming the explanatory variable

If there is no straight-line relationship between the response variable and the explanatory variable, it is sometimes possible to create one by transforming one or both of the variables. Here, you'll look at transforming the explanatory variable.

You'll take another look at the Taiwan real estate dataset, this time using the distance to the nearest MRT (metro) station as the explanatory variable. You'll use code to make every commuter's dream come true:shortening the distance to the metro station by taking the square root. Take that, geography!

taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(
taiwan_real_estate["dist_to_mrt_m"])


plt.figure()

# Plot using the transformed variable
sns.regplot(x="sqrt_dist_to_mrt_m",
            y="price_twd_msq",
            data=taiwan_real_estate,
            ci=None)
plt.show()
taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])

# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_m
mdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()

# Print the parameters
display(mdl_price_vs_dist.params)

explanatory_data = pd.DataFrame({"sqrt_dist_to_mrt_m": np.sqrt(np.arange(0, 81, 10) ** 2),
                                "dist_to_mrt_m": np.arange(0, 81, 10) ** 2})

# Create prediction_data by adding a column of predictions to explantory_data
prediction_data = explanatory_data.assign(
    price_twd_msq = mdl_price_vs_dist.predict(explanatory_data)
)

# Print the result
display(prediction_data)
Intercept             16.709799
sqrt_dist_to_mrt_m    -0.182843
dtype: float64
sqrt_dist_to_mrt_m dist_to_mrt_m price_twd_msq
0 0.0 0 16.709799
1 10.0 100 14.881370
2 20.0 400 13.052942
3 30.0 900 11.224513
4 40.0 1600 9.396085
5 50.0 2500 7.567656
6 60.0 3600 5.739227
7 70.0 4900 3.910799
8 80.0 6400 2.082370
taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])

# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_m
mdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()

# Use this explanatory data
explanatory_data = pd.DataFrame({"sqrt_dist_to_mrt_m": np.sqrt(np.arange(0, 81, 10) ** 2),
                                "dist_to_mrt_m": np.arange(0, 81, 10) ** 2})

# Use mdl_price_vs_dist to predict explanatory_data
prediction_data = explanatory_data.assign(
    price_twd_msq = mdl_price_vs_dist.predict(explanatory_data)
)

fig = plt.figure()
sns.regplot(x="sqrt_dist_to_mrt_m", y="price_twd_msq", data=taiwan_real_estate, ci=None)

# Add a layer of your prediction points
sns.scatterplot(data = prediction_data, x="sqrt_dist_to_mrt_m", y="price_twd_msq", color="red")
plt.show()

Transforming the response variable too

The response variable can be transformed too, but this means you need an extra step at the end to undo that transformation. That is, you "back transform" the predictions.

In the video, you saw the first step of the digital advertising workflow:spending money to buy ads, and counting how many people see them (the "impressions"). The next step is determining how many people click on the advert after seeing it.

ad_conversion = pd.read_csv("./datasets/ad_conversion.csv")
display(ad_conversion.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 936 entries, 0 to 935
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   spent_usd      936 non-null    float64
 1   n_impressions  936 non-null    int64  
 2   n_clicks       936 non-null    int64  
dtypes: float64(1), int64(2)
memory usage: 22.1 KB
None
ad_conversion["qdrt_n_impressions"] = np.power(ad_conversion["n_impressions"], 0.25) 
ad_conversion["qdrt_n_clicks"] = np.power(ad_conversion["n_clicks"], 0.25)

plt.figure()

# Plot using the transformed variables
sns.regplot(x="qdrt_n_impressions",
            y="qdrt_n_clicks",
            data=ad_conversion,
            ci=None)
plt.show()
ad_conversion["qdrt_n_impressions"] = ad_conversion["n_impressions"] ** 0.25
ad_conversion["qdrt_n_clicks"] = ad_conversion["n_clicks"] ** 0.25

mdl_click_vs_impression = ols("qdrt_n_clicks ~ qdrt_n_impressions", data=ad_conversion).fit()

explanatory_data = pd.DataFrame({"qdrt_n_impressions": np.arange(0, 3e6+1, 5e5) ** .25,
                                 "n_impressions": np.arange(0, 3e6+1, 5e5)})

# Complete prediction_data
prediction_data = explanatory_data.assign(
    qdrt_n_clicks = mdl_click_vs_impression.predict(explanatory_data)
)

# Print the result
display(prediction_data)
qdrt_n_impressions n_impressions qdrt_n_clicks
0 0.000000 0.0 0.071748
1 26.591479 500000.0 3.037576
2 31.622777 1000000.0 3.598732
3 34.996355 1500000.0 3.974998
4 37.606031 2000000.0 4.266063
5 39.763536 2500000.0 4.506696
6 41.617915 3000000.0 4.713520

Back transformation

In the previous exercise, you transformed the response variable, ran a regression, and made predictions. But you're not done yet! In order to correctly interpret and visualize your predictions, you'll need to do a back-transformation.

prediction_data["n_clicks"] = prediction_data["qdrt_n_clicks"] **4
display(prediction_data)
qdrt_n_impressions n_impressions qdrt_n_clicks n_clicks
0 0.000000 0.0 0.071748 0.000026
1 26.591479 500000.0 3.037576 85.135121
2 31.622777 1000000.0 3.598732 167.725102
3 34.996355 1500000.0 3.974998 249.659131
4 37.606031 2000000.0 4.266063 331.214159
5 39.763536 2500000.0 4.506696 412.508546
6 41.617915 3000000.0 4.713520 493.607180
prediction_data["n_clicks"] = prediction_data["qdrt_n_clicks"] ** 4

# Plot the transformed variables
fig = plt.figure()
sns.regplot(x="qdrt_n_impressions", y="qdrt_n_clicks", data=ad_conversion, ci=None)

# Add a layer of your prediction points
sns.scatterplot(data=prediction_data, x="qdrt_n_impressions", y="qdrt_n_clicks", color="red")
plt.show()

Assessing model fit

In this chapter, you’ll learn how to ask questions of your model to assess fit. You’ll learn how to quantify how well a linear regression model fits, diagnose model problems using visualizations, and understand each observation's leverage and influence to create the model.

Quantifying model fit

Coefficient of determination

Sometimes called "r-squared" or "R-squared".

The proportion of the variance in the response variable that is predictable from the explanatory variable.

  • 1 means a perfect fit
  • 0 means the worst possible fit
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit()
display(mdl_bream.summary())
OLS Regression Results
Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6
Date: Thu, 17 Nov 2022 Prob (F-statistic): 1.22e-16
Time: 13:44:39 Log-Likelihood: -199.35
No. Observations: 35 AIC: 402.7
Df Residuals: 33 BIC: 405.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1035.3476 107.973 -9.589 0.000 -1255.020 -815.676
length_cm 54.5500 3.539 15.415 0.000 47.350 61.750
Omnibus: 7.314 Durbin-Watson: 1.478
Prob(Omnibus): 0.026 Jarque-Bera (JB): 10.857
Skew: -0.252 Prob(JB): 0.00439
Kurtosis: 5.682 Cond. No. 263.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
mdl_bream.rsquared
0.8780627095147174

Residual standard error (RSE)

  • A "typical" difference between a prediction and an observed response
  • It has the same unit as the response variable.
  • $MSE = RSE^2$
mse = mdl_bream.mse_resid
print('mse: ', mse)
rse = np.sqrt(mse)
print("rse: ", rse)
mse:  5498.555084973521
rse:  74.15224261594197
residuals_sq = mdl_bream.resid ** 2
resid_sum_of_sq = sum(residuals_sq)
deg_freedom = len(bream.index) - 2

print('mse: ', resid_sum_of_sq/deg_freedom)

print("rse :", np.sqrt(resid_sum_of_sq/deg_freedom))
mse:  5498.55508497352
rse : 74.15224261594197
residuals_sq = mdl_bream.resid ** 2
resid_sum_of_sq = sum(residuals_sq)
n_obs = len(bream.index)
rmse = np.sqrt(resid_sum_of_sq/n_obs)
print("rmse :", rmse)
rmse : 72.00244396727619

Interpreting RSE

  • mdl_bream has an RSE of 74 .

    The difference between predicted bream masses and observed bream masses is typically about 74g.

Coefficient of determination

The coefficient of determination is a measure of how well the linear regression line fits the observed values. For simple linear regression, it is equal to the square of the correlation between the explanatory and response variables.

Here, you'll take another look at the second stage of the advertising pipeline:modeling the click response to impressions. Two models are available: mdl_click_vs_impression_orig models n_clicks versus n_impressions. mdl_click_vs_impression_trans is the transformed model you saw in Chapter 2. It models n_clicks to the power of 0.25 versus n_impressions to the power of 0.25.

mdl_click_vs_impression_orig = mdl_click_vs_impression =ols("n_clicks ~ n_impressions", data=ad_conversion).fit()

mdl_click_vs_impression_trans= mdl_click_vs_impression =  ols("qdrt_n_clicks ~ qdrt_n_impressions", data=ad_conversion).fit()
display(mdl_click_vs_impression_orig.summary())

# Print a summary of mdl_click_vs_impression_trans
display(mdl_click_vs_impression_trans.summary())
OLS Regression Results
Dep. Variable: n_clicks R-squared: 0.892
Model: OLS Adj. R-squared: 0.891
Method: Least Squares F-statistic: 7683.
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.00
Time: 14:30:20 Log-Likelihood: -4126.7
No. Observations: 936 AIC: 8257.
Df Residuals: 934 BIC: 8267.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.6829 0.789 2.133 0.033 0.135 3.231
n_impressions 0.0002 1.96e-06 87.654 0.000 0.000 0.000
Omnibus: 247.038 Durbin-Watson: 0.870
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13215.277
Skew: -0.258 Prob(JB): 0.00
Kurtosis: 21.401 Cond. No. 4.88e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
Dep. Variable: qdrt_n_clicks R-squared: 0.945
Model: OLS Adj. R-squared: 0.944
Method: Least Squares F-statistic: 1.590e+04
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.00
Time: 14:30:20 Log-Likelihood: 193.90
No. Observations: 936 AIC: -383.8
Df Residuals: 934 BIC: -374.1
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.0717 0.017 4.171 0.000 0.038 0.106
qdrt_n_impressions 0.1115 0.001 126.108 0.000 0.110 0.113
Omnibus: 11.447 Durbin-Watson: 0.568
Prob(Omnibus): 0.003 Jarque-Bera (JB): 10.637
Skew: -0.216 Prob(JB): 0.00490
Kurtosis: 2.707 Cond. No. 52.1


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(mdl_click_vs_impression_orig.rsquared)

# Print the coeff of determination for mdl_click_vs_impression_trans
print(mdl_click_vs_impression_trans.rsquared)
0.8916134973508041
0.9445272817143905

Question: mdl_click_vs_impression_orig has a coefficient of determination of 0.89. Which statement about the model is true?

Possible Answers

  • The number of clicks explains 89% of the variability in the number of impressions.

  • The number of impressions explains 89% of the variability in the number of clicks.

  • The model is correct 89% of the time.

  • The correlation between the number of impressions and the number of clicks is 0.89.

The transformed model has a higher coefficient of determination than the original model, suggesting that it gives a better fit to the data.

Residual standard error

Residual standard error (RSE) is a measure of the typical size of the residuals. Equivalently, it's a measure of how wrong you can expect predictions to be. Smaller numbers are better, with zero being a perfect fit to the data.

mse_orig = mdl_click_vs_impression_orig.mse_resid

# Calculate rse_orig for mdl_click_vs_impression_orig and print it
rse_orig = np.sqrt(mse_orig)
print("RSE of original model: ", rse_orig)

# Calculate mse_trans for mdl_click_vs_impression_trans
mse_trans = mdl_click_vs_impression_trans.mse_resid

# Calculate rse_trans for mdl_click_vs_impression_trans and print it
rse_trans = np.sqrt(mse_trans)
print("RSE of transformed model: ", rse_trans)
RSE of original model:  19.905838862478138
RSE of transformed model:  0.19690640896875727

Question: mdl_click_vs_impression_orig has an RSE of about 20. Which statement is true?

Possible Answers:

  • The model explains 20% of the variability in the number of clicks.

  • 20% of the residuals are typically greater than the observed values.

  • The typical difference between observed number of clicks and predicted number of clicks is 20.

  • The typical difference between observed number of impressions and predicted number of impressions is 20.

  • The model predicts that you get one click for every 20 observed impressions.

Note:

  • Since mdl_click_vs_impression_orig has an RSE of about 20, mdl_click_vs_impression_trans has an RSE of about 0.2.). The transformed model - mdl_click_vs_impression_trans gives more accurate predictions.
  • RSE is a measure of accuracy for regression models. It even works on other other statistical model types like regression trees, so you can compare accuracy across different classes of models.

Visualizing model fit

Residuals vs. fitted values

Here you can see diagnostic plots of residuals versus fitted values for two models on advertising conversion.

Original model (n_clicks versus n_impressions):

Transformed model (n_clicks 0.25 versus n_impressions 0.25):


Look at the numbers on the y-axis scales and how well the trend lines follow the line. Which statement is true?

Possible Answers:

  • The residuals track the $y=0$ line more closely in the original model compared to the transformed model, indicating that the original model is a better fit for the data.

  • The residuals track the $y=0$ line more closely in the transformed model compared to the original model, indicating that the original model is a better fit for the data.

  • The residuals track the $y=0$ line more closely in the original model compared to the transformed model, indicating that the transformed model is a better fit for the data.

  • The residuals track the $y=0$ line more closely in the transformed model compared to the original model, indicating that the transformed model is a better fit for the data.

In a good model, the residuals should have a trend line close to zero.

Q-Q plot of residuals

Here are normal Q-Q plots of the previous two models.

Original model (n_clicks versus n_impressions):

Transformed model (n_clicks 0.25 versus n_impressions 0.25):


Look at how well the points track the "normality" line. Which statement is true?

Possible Answers:

  • The residuals track the "normality" line more closely in the original model compared to the transformed model, indicating that the original model is a better fit for the data.

  • The residuals track the "normality" line more closely in the original model compared to the transformed model, indicating that the transformed model is a better fit for the data.

  • The residuals track the "normality" line more closely in the transformed model compared to the original model, indicating that the transformed model is a better fit for the data.

  • The residuals track the "normality" line more closely in the transformed model compared to the original model, indicating that the original model is a better fit for the data.

If the residuals from the model are normally distributed, then the points will track the line on the Q-Q plot. In this case, neither model is perfect, but the transformed model is closer.

Scale-location

Here are normal scale-location plots of the previous two models. That is, they show the size of residuals versus fitted values.

Original model (n_clicks versus n_impressions): Transformed model (n_clicks 0.25 versus n_impressions 0.25):

Look at the numbers on the y-axis and the slope of the trend line. Which statement is true?

Possible Answers:

  • The size of the standardized residuals is more consistent in the original model compared to the transformed model, indicating that the original model is a better fit for the data.

  • The size of the standardized residuals is more consistent in the transformed model compared to the original model, indicating that the transformed model is a better fit for the data.

  • The size of the standardized residuals is more consistent in the transformed model compared to the original model, indicating that the original model is a better fit for the data.

  • The size of the standardized residuals is more consistent in the original model compared to the transformed model, indicating that the transformed model is a better fit for the data.

Drawing diagnostic plots

It's time for you to draw these diagnostic plots yourself using the Taiwan real estate dataset and the model of house prices versus number of convenience stores.

sns.residplot(x="n_convenience", y="price_twd_msq", data=taiwan_real_estate, lowess = True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

# Show the plot
plt.show()
from statsmodels.api import qqplot

# Create the Q-Q plot of the residuals
qqplot(data=mdl_price_vs_conv.resid, fit=True, line="45")

# Show the plot
plt.show()
model_norm_residuals = mdl_price_vs_conv.get_influence().resid_studentized_internal
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

# Create the scale-location plot
sns.regplot(x=mdl_price_vs_conv.fittedvalues, y=model_norm_residuals_abs_sqrt, ci=None, lowess=True, scatter_kws={"color": "blue"}, line_kws={"color": "red"})
plt.xlabel("Fitted values")
plt.ylabel("Sqrt of abs val of stdized residuals")

# Show the plot
plt.show()

Outliers, leverage, and influence

plt.rcParams['figure.figsize'] = [7.5, 5]
roach = fish[fish["species"] == "Roach"]
# Which points are outliers?
sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None)
plt.show()
roach["extreme_l"] = ((roach["length_cm"] < 15) | (roach["length_cm"] > 26))
fig = plt.figure()

sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None)
sns.scatterplot(x="length_cm",
            y="mass_g",
            hue="extreme_l",
            data=roach)
plt.show()
roach["extreme_m"] = roach["mass_g"] < 1
fig = plt.figure()
sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None)

sns.scatterplot(x="length_cm",
                y="mass_g",
                hue="extreme_l",
                style="extreme_m",
                data=roach)

plt.show()
mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()
summary_roach = mdl_roach.get_influence().summary_frame()
roach["leverage"] = summary_roach["hat_diag"]
display(roach.head())
species mass_g length_cm leverage cooks_dist
35 Roach 40.0 12.9 0.313729 1.074015
36 Roach 69.0 16.5 0.125538 0.010429
37 Roach 78.0 17.5 0.093487 0.000020
38 Roach 87.0 18.2 0.076283 0.001980
39 Roach 120.0 18.6 0.068387 0.006610
roach["cooks_dist"] = summary_roach["cooks_d"]
display(roach.head())
species mass_g length_cm leverage cooks_dist
35 Roach 40.0 12.9 0.313729 1.074015
36 Roach 69.0 16.5 0.125538 0.010429
37 Roach 78.0 17.5 0.093487 0.000020
38 Roach 87.0 18.2 0.076283 0.001980
39 Roach 120.0 18.6 0.068387 0.006610
display(roach.sort_values("cooks_dist", ascending = False)[:5])
species mass_g length_cm leverage cooks_dist
35 Roach 40.0 12.9 0.313729 1.074015
54 Roach 390.0 29.5 0.394740 0.365782
40 Roach 0.0 19.0 0.061897 0.311852
52 Roach 290.0 24.0 0.099488 0.150064
51 Roach 180.0 23.6 0.088391 0.061209
roach_not_short = roach[roach["length_cm"] != 12.9]
sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None,
            line_kws={"color": "green"})
sns.regplot(x="length_cm",
            y="mass_g",
            data=roach_not_short,
            ci=None,
            line_kws={"color": "red"})
plt.show()

Leverage

Leverage measures how unusual or extreme the explanatory variables are for each observation. Very roughly, high leverage means that the explanatory variable has values that are different from other points in the dataset. In the case of simple linear regression, where there is only one explanatory value, this typically means values with a very high or very low explanatory value.

Here, you'll look at highly leveraged values in the model of house price versus the square root of distance from the nearest MRT station in the Taiwan real estate dataset.


Guess which observations you think will have a high leverage, then move the slider to find out.

Which statement is true?

Observations with a large distance to the nearest MRT station have the highest leverage, because most of the observations have a short distance, so long distances are more extreme.

Highly leveraged points are the ones with explanatory variables that are furthest away from the others.

Influence

Influence measures how much a model would change if each observation was left out of the model calculations, one at a time. That is, it measures how different the prediction line would look if you would run a linear regression on all data points except that point, compared to running a linear regression on the whole dataset.

The standard metric for influence is Cook's distance, which calculates influence based on the residual size and the leverage of the point.


You can see the same model as last time:house price versus the square root of distance from the nearest MRT station in the Taiwan real estate dataset. Guess which observations you think will have a high influence, then move the slider to find out.

Which statement is true?

Observations far away from the trend line have high influence, because they have large residuals and are far away from other observations.

The majority of the influential houses were those with prices that were much higher than the model predicted (and one with a price that was much lower).

Extracting leverage and influence

taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])

# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_m
mdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()
summary_info = mdl_price_vs_dist.get_influence().summary_frame()

# Add the hat_diag column to taiwan_real_estate, name it leverage
taiwan_real_estate["leverage"] = summary_info["hat_diag"]

# Sort taiwan_real_estate by leverage in descending order and print the head
display(taiwan_real_estate.sort_values("leverage", ascending = False).head())
dist_to_mrt_m n_convenience house_age_years price_twd_msq sqrt_dist_to_mrt_m leverage
347 6488.021 1 15 to 30 3.388805 80.548253 0.026665
116 6396.283 1 30 to 45 3.691377 79.976765 0.026135
249 6306.153 1 15 to 30 4.538578 79.411290 0.025617
255 5512.038 1 30 to 45 5.264750 74.243101 0.021142
8 5512.038 1 30 to 45 5.688351 74.243101 0.021142
summary_info = mdl_price_vs_dist.get_influence().summary_frame()

# Add the hat_diag column to taiwan_real_estate, name it leverage
taiwan_real_estate["leverage"] = summary_info["hat_diag"]

# Add the cooks_d column to taiwan_real_estate, name it cooks_dist
taiwan_real_estate["cooks_dist"] = summary_info["cooks_d"]

# Sort taiwan_real_estate by cooks_dist in descending order and print the head.
display(taiwan_real_estate.sort_values("cooks_dist", ascending = False).head())
dist_to_mrt_m n_convenience house_age_years price_twd_msq sqrt_dist_to_mrt_m leverage cooks_dist
270 252.5822 1 0 to 15 35.552194 15.892835 0.003849 0.115549
148 3780.5900 0 15 to 30 13.645991 61.486503 0.012147 0.052440
228 3171.3290 0 0 to 15 14.099849 56.314554 0.009332 0.035384
220 186.5101 9 30 to 45 23.691377 13.656870 0.004401 0.025123
113 393.2606 6 0 to 15 2.299546 19.830799 0.003095 0.022813

Simple Logistic Regression Modeling

Learn to fit logistic regression models. Using real-world data, you’ll predict the likelihood of a customer closing their bank account as probabilities of success and odds ratios, and quantify model performance using confusion matrices.

Why you need logistic regression

churn = pd.read_csv("./datasets/churn.csv")
churn.head()
has_churned time_since_first_purchase time_since_last_purchase
0 0 -1.089221 -0.721322
1 0 1.182983 3.634435
2 0 -0.846156 -0.427582
3 0 0.086942 -0.535672
4 0 -1.166642 -0.672640
mdl_churn_vs_recency_lm = ols("has_churned ~ time_since_last_purchase",
data=churn).fit()
print(mdl_churn_vs_recency_lm.params)

intercept, slope = mdl_churn_vs_recency_lm.params
Intercept                   0.490780
time_since_last_purchase    0.063783
dtype: float64
sns.scatterplot(x="time_since_last_purchase",
                y="has_churned",
                data=churn)
plt.axline(xy1=(0, intercept),
slope=slope)
plt.show()
sns.scatterplot(x="time_since_last_purchase",
                y="has_churned",
                data=churn)
plt.axline(xy1=(0,intercept),
slope=slope)
plt.xlim(-10, 10)
plt.ylim(-0.2, 1.2)
plt.show()
from statsmodels.formula.api import logit
mdl_churn_vs_recency_logit = logit("has_churned ~ time_since_last_purchase",
data=churn).fit()
print(mdl_churn_vs_recency_logit.params)
Optimization terminated successfully.
         Current function value: 0.683000
         Iterations 4
Intercept                  -0.035019
time_since_last_purchase    0.269215
dtype: float64
sns.regplot(x="time_since_last_purchase",
            y="has_churned",
            data=churn,
            ci=None,
            logistic=True)
            
plt.axline(xy1=(0,intercept),
            slope=slope,
            color="black")

plt.show()

Exploring the explanatory variables

When the response variable is logical, all the points lie on the $y=0$ and $y=1$ lines, making it difficult to see what is happening. In the video, until you saw the trend line, it wasn't clear how the explanatory variable was distributed on each line. This can be solved with a histogram of the explanatory variable, grouped by the response.

You will use these histograms to get to know the financial services churn dataset seen in the video.

sns.displot(data=churn,
            x="time_since_last_purchase",
            col="has_churned")
plt.show()
sns.displot(data=churn,
            x="time_since_first_purchase",
            col="has_churned")
plt.show()

Visualizing linear and logistic models

As with linear regressions, regplot() will draw model predictions for a logistic regression without you having to worry about the modeling code yourself. To see how the predictions differ for linear and logistic regressions, try drawing both trend lines side by side. Spoiler:you should see a linear (straight line) trend from the linear model, and a logistic (S-shaped) trend from the logistic model.

sns.regplot(x="time_since_first_purchase",
            y="has_churned",
            data=churn, 
            ci=None,
            line_kws={"color": "red"})

# Draw a logistic regression trend line and a scatter plot of time_since_first_purchase vs. has_churned
sns.regplot(x="time_since_first_purchase",
            y="has_churned",
            data=churn,
            ci=None,
            logistic=True,
            line_kws={"color": "blue"})

plt.show()

Logistic regression with logit()

Logistic regression requires another function from statsmodels.formula.api:logit(). It takes the same arguments as ols(): a formula and data argument. You then use .fit() to fit the model to the data. Here, you'll model how the length of relationship with a customer affects churn.

# Import logit
from statsmodels.formula.api import logit

# Fit a logistic regression of churn vs. 
# length of relationship using the churn dataset
mdl_churn_vs_relationship = logit('has_churned~time_since_first_purchase', data=churn).fit()

# Print the parameters of the fitted model
print(mdl_churn_vs_relationship.params)
Optimization terminated successfully.
         Current function value: 0.679663
         Iterations 4
Intercept                   -0.015185
time_since_first_purchase   -0.354795
dtype: float64

Predictions and odds ratios

Probabilities

There are four main ways of expressing the prediction from a logistic regression model – we'll look at each of them over the next four exercises. Firstly, since the response variable is either "yes" or "no", you can make a prediction of the probability of a "yes". Here, you'll calculate and visualize these probabilities.

Two variables are available:

  • mdl_churn_vs_relationship is the fitted logistic regression model of has_churned versus time_since_first_purchase.
  • explanatory_data is a DataFrame of explanatory values.
explanatory_data = pd.DataFrame({'time_since_first_purchase': np.arange(-1.5, 4.1, .25)})
# explanatory_data
prediction_data = explanatory_data.assign(
  has_churned = mdl_churn_vs_relationship.predict(explanatory_data)
)

# Print the head
print(prediction_data.head())
   time_since_first_purchase  has_churned
0                      -1.50     0.626448
1                      -1.25     0.605470
2                      -1.00     0.584096
3                      -0.75     0.562401
4                      -0.50     0.540465
prediction_data = explanatory_data.assign(
    has_churned = mdl_churn_vs_relationship.predict(explanatory_data)
)

fig = plt.figure()

# Create a scatter plot with logistic trend line
sns.regplot(x="time_since_first_purchase",
            y="has_churned",
            data=churn,
            ci=None,
            logistic=True)

# Overlay with prediction_data, colored red
sns.scatterplot(x="time_since_first_purchase",
                y="has_churned",
                data=prediction_data,
                color="darkred")

plt.show()

Most likely outcome

When explaining your results to a non-technical audience, you may wish to side-step talking about probabilities and simply explain the most likely outcome. That is, rather than saying there is a 60% chance of a customer churning, you say that the most likely outcome is that the customer will churn. The trade-off here is easier interpretation at the cost of nuance.

prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])

# Print the head
print(prediction_data.head())
   time_since_first_purchase  has_churned  most_likely_outcome
0                      -1.50     0.626448                  1.0
1                      -1.25     0.605470                  1.0
2                      -1.00     0.584096                  1.0
3                      -0.75     0.562401                  1.0
4                      -0.50     0.540465                  1.0
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])

fig = plt.figure()

# Create a scatter plot with logistic trend line (from previous exercise)
sns.regplot(x="time_since_first_purchase",
            y="has_churned",
            data=churn,
            ci=None,
            logistic=True)

# Overlay with prediction_data, colored red
sns.scatterplot(x="time_since_first_purchase",
                y="most_likely_outcome",
                data=prediction_data,
                color="red")

plt.show()

Odds ratio

Odds ratios compare the probability of something happening with the probability of it not happening. This is sometimes easier to reason about than probabilities, particularly when you want to make decisions about choices. For example, if a customer has a 20% chance of churning, it may be more intuitive to say "the chance of them not churning is four times higher than the chance of them churning".

prediction_data["odds_ratio"] = prediction_data["has_churned"] / (1 - prediction_data["has_churned"])

# Print the head
print(prediction_data.head())
   time_since_first_purchase  has_churned  most_likely_outcome  odds_ratio
0                      -1.50     0.626448                  1.0    1.677003
1                      -1.25     0.605470                  1.0    1.534661
2                      -1.00     0.584096                  1.0    1.404400
3                      -0.75     0.562401                  1.0    1.285197
4                      -0.50     0.540465                  1.0    1.176111
prediction_data["odds_ratio"] = prediction_data["has_churned"] / (1 - prediction_data["has_churned"])

# Create a new figure
fig = plt.figure()

# Create a line plot of odds_ratio vs time_since_first_purchase
sns.lineplot(x='time_since_first_purchase', y='odds_ratio', data = prediction_data)

# Add a dotted horizontal line at odds_ratio = 1
plt.axhline(y=1, linestyle="dotted")

# Show the plot
plt.show()

Odds ratios provide an alternative to probabilities that make it easier to compare positive and negative responses.

Log odds ratio

One downside to probabilities and odds ratios for logistic regression predictions is that the prediction lines for each are curved. This makes it harder to reason about what happens to the prediction when you make a change to the explanatory variable. The logarithm of the odds ratio (the "log odds ratio" or "logit") does have a linear relationship between predicted response and explanatory variable. That means that as the explanatory variable changes, you don't see dramatic changes in the response metric - only linear changes.

Since the actual values of log odds ratio are less intuitive than (linear) odds ratio, for visualization purposes it's usually better to plot the odds ratio and apply a log transformation to the y-axis scale.

prediction_data["log_odds_ratio"] = np.log(prediction_data['odds_ratio'])

# Print the head
print(prediction_data.head())
   time_since_first_purchase  has_churned  most_likely_outcome  odds_ratio  log_odds_ratio
0                      -1.50     0.626448                  1.0    1.677003        0.517008
1                      -1.25     0.605470                  1.0    1.534661        0.428309
2                      -1.00     0.584096                  1.0    1.404400        0.339610
3                      -0.75     0.562401                  1.0    1.285197        0.250912
4                      -0.50     0.540465                  1.0    1.176111        0.162213
prediction_data["log_odds_ratio"] = np.log(prediction_data["odds_ratio"])

fig = plt.figure()

# Update the line plot: log_odds_ratio vs. time_since_first_purchase
sns.lineplot(x="time_since_first_purchase",
             y="log_odds_ratio",
             data=prediction_data)

# Add a dotted horizontal line at log_odds_ratio = 0
plt.axhline(y=0, linestyle="dotted")

plt.show()

Quantifying logistic regression fit

Resid plot, QQplot & Scale location plot are less useful in the case of logistic regression. Instead, we can use confusion matrices to analyze the fit performance. With True/False positive & negative outcomes. We can also compute metrics based on various ratios.

  • Accuracy : proportion of correct predictions. Higher better. $$ \dfrac{TN+TP}{TN+FN+FP+TP}$$

  • Sensitivity : proportions of observations where the actual response was true and where the model also predicted it was true. Higher better.
    $$ \dfrac{TP} {FN + TP}$$

  • Specificity : proportions of observations where the actual was false and where the model also predicted it was false. Higher better. $$ \dfrac{TN} {TN + FP}$$

Calculating the confusion matrix

A confusion matrix (occasionally called a confusion table) is the basis of all performance metrics for models with a categorical response (such as a logistic regression). It contains the counts of each actual response-predicted response pair. In this case, where there are two possible responses (churn or not churn), there are four overall outcomes.

  • True positive:The customer churned and the model predicted they would.- False positive: The customer didn't churn, but the model predicted they would.
  • True negative: The customer didn't churn and the model predicted they wouldn't.
  • False negative: The customer churned, but the model predicted they wouldn't.
actual_response = churn["has_churned"]

# Get the predicted responses
predicted_response = np.round(mdl_churn_vs_relationship.predict())

# Create outcomes as a DataFrame of both Series
outcomes = pd.DataFrame({"actual_response": actual_response,
                         "predicted_response": predicted_response})

# Print the outcomes
print(outcomes.value_counts(sort = False))
actual_response  predicted_response
0                0.0                   112
                 1.0                    88
1                0.0                    76
                 1.0                   124
dtype: int64

Drawing a mosaic plot of the confusion matrix

While calculating the performance matrix might be fun, it would become tedious if you needed multiple confusion matrices of different models. Luckily, the .pred_table() method can calculate the confusion matrix for you.

Additionally, you can use the output from the .pred_table() method to visualize the confusion matrix, using the mosaic() function.

from statsmodels.graphics.mosaicplot import mosaic

# Calculate the confusion matrix conf_matrix
conf_matrix = mdl_churn_vs_relationship.pred_table()

# Print it
print(conf_matrix)

# Draw a mosaic plot of conf_matrix
mosaic(conf_matrix)
plt.show()
[[112.  88.]
 [ 76. 124.]]

Accuracy, sensitivity, specificity

Measuring logistic model performance

As you know by now, several metrics exist for measuring the performance of a logistic regression model. In this last exercise, you'll manually calculate accuracy, sensitivity, and specificity. Recall the following definitions:Accuracy is the proportion of predictions that are correct. $$ Accuracy = \dfrac{TN+TP}{TN+FN+FP+TP}$$
Sensitivity is the proportion of true observations that are correctly predicted by the model as being true.

$$ Sensitivity = \dfrac{TP} {FN + TP}$$

Specificity is the proportion of false observations that are correctly predicted by the model as being false. $$ Specificity = \dfrac{TN} {TN + FP}$$

TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]

# Calculate and print the accuracy
accuracy = (TN + TP) / (TN + FN + FP + TP)
print("accuracy: ", accuracy)

# Calculate and print the sensitivity
sensitivity = TP / (TP + FN)
print("sensitivity: ", sensitivity)

# Calculate and print the specificity
specificity = TN / (TN + FP)
print("specificity: ", specificity)
accuracy:  0.59
sensitivity:  0.62
specificity:  0.56

Summary

  • Part 1
    • Fit a simple linear regression
    • Interpret coefficients
  • Part 2
    • Make predictions
    • Regression to the mean
    • Transforming variables
  • Part 3
    • Quantifying model fit
    • Outlier, leverage, and inffluence
  • Part 4
    • Fit a simple logistic regression
    • Make predictions
    • Get performance from confusion matrix