Chapter 4 - Using XGBoost in pipelines

PREREQUISITES:Supervised Learning with scikit-learn, Case Study: School Budgeting with Machine Learning in Python.

Take your XGBoost skills to the next level by incorporating your models into two end-to-end machine learning pipelines. You'll learn how to tune the most important XGBoost hyperparameters efficiently within a pipeline, and get an introduction to some more advanced preprocessing techniques.

Review of pipelines using sklearn

  • Data Content: Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows :

    • CRIM: per capita crime rate by town
    • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
    • INDUS: proportion of non-retail business acres per town
    • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    • NOX: nitric oxides concentration (parts per 10 million)
    • RM: average number of rooms per dwelling
    • AGE: proportion of owner-occupied units built prior to 1940
    • DIS: weighted distances to five Boston employment centers
    • RAD: index of accessibility to radial highways
    • TAX: full-value property-tax rate per $10,000
    • PTRATIO: pupil-teacher ratio by town
    • B: 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town
    • LSTAT: % lower status of the population
    • MEDV: Median value of owner-occupied homes in $1000s We can see that the input attributes have a mixture of units.
import pandas as pd
import numpy as np
import warnings

pd.set_option('display.expand_frame_repr', False)

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
""" Scikit-learn pipeline example """
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score


colnames = ["crime","zone","industry","charles","no","rooms","age", "distance","radial","tax","pupil","aam","lower","med_price"]

data = pd.read_csv("datasets/boston_housing.csv",skiprows = 1, names=colnames)

display(data.head())
display(data.info())
X, y = data.iloc[:,:-1], data.iloc[:,-1]
rf_pipeline = Pipeline([ ("st_scaler",StandardScaler()),
                        ("rf_model",RandomForestRegressor())])

scores = cross_val_score(rf_pipeline,X,y, scoring="neg_mean_squared_error",cv=10)

final_avg_rmse = np.mean(np.sqrt(np.abs(scores)))
print("Final RMSE:", final_avg_rmse)
crime zone industry charles no rooms age distance radial tax pupil aam lower med_price
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   crime      506 non-null    float64
 1   zone       506 non-null    float64
 2   industry   506 non-null    float64
 3   charles    506 non-null    int64  
 4   no         506 non-null    float64
 5   rooms      506 non-null    float64
 6   age        506 non-null    float64
 7   distance   506 non-null    float64
 8   radial     506 non-null    int64  
 9   tax        506 non-null    float64
 10  pupil      506 non-null    float64
 11  aam        506 non-null    float64
 12  lower      506 non-null    float64
 13  med_price  506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
None
Final RMSE: 4.186097627860323

Exploratory data analysis

Before diving into the nitty gritty of pipelines and preprocessing, let's do some exploratory analysis of the original, unprocessed Ames housing dataset. When you worked with this data in previous chapters, we preprocessed it for you so you could focus on the core XGBoost concepts. In this chapter, you'll do the preprocessing yourself!

A smaller version of this original, unprocessed dataset has been pre-loaded into a pandas DataFrame called df. Your task is to explore df in the Shell and pick the option that is incorrect. The larger purpose of this exercise is to understand the kinds of transformations you will need to perform in order to be able to use XGBoost.

df_raw = pd.read_csv("datasets/ames_unprocessed_data.csv",skiprows = None)
df_processed = pd.read_csv("datasets/ames_housing_trimmed_processed.csv",skiprows = None)
display(df_raw.info())
display(df_raw.head())

categorical_columns = [col for col in df_raw.columns if df_raw[col].dtype == "object"]
display(categorical_columns)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSSubClass    1460 non-null   int64  
 1   MSZoning      1460 non-null   object 
 2   LotFrontage   1201 non-null   float64
 3   LotArea       1460 non-null   int64  
 4   Neighborhood  1460 non-null   object 
 5   BldgType      1460 non-null   object 
 6   HouseStyle    1460 non-null   object 
 7   OverallQual   1460 non-null   int64  
 8   OverallCond   1460 non-null   int64  
 9   YearBuilt     1460 non-null   int64  
 10  Remodeled     1460 non-null   int64  
 11  GrLivArea     1460 non-null   int64  
 12  BsmtFullBath  1460 non-null   int64  
 13  BsmtHalfBath  1460 non-null   int64  
 14  FullBath      1460 non-null   int64  
 15  HalfBath      1460 non-null   int64  
 16  BedroomAbvGr  1460 non-null   int64  
 17  Fireplaces    1460 non-null   int64  
 18  GarageArea    1460 non-null   int64  
 19  PavedDrive    1460 non-null   object 
 20  SalePrice     1460 non-null   int64  
dtypes: float64(1), int64(15), object(5)
memory usage: 239.7+ KB
None
MSSubClass MSZoning LotFrontage LotArea Neighborhood BldgType HouseStyle OverallQual OverallCond YearBuilt ... GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr Fireplaces GarageArea PavedDrive SalePrice
0 60 RL 65.0 8450 CollgCr 1Fam 2Story 7 5 2003 ... 1710 1 0 2 1 3 0 548 Y 208500
1 20 RL 80.0 9600 Veenker 1Fam 1Story 6 8 1976 ... 1262 0 1 2 0 3 1 460 Y 181500
2 60 RL 68.0 11250 CollgCr 1Fam 2Story 7 5 2001 ... 1786 1 0 2 1 3 1 608 Y 223500
3 70 RL 60.0 9550 Crawfor 1Fam 2Story 7 5 1915 ... 1717 1 0 1 0 3 1 642 Y 140000
4 60 RL 84.0 14260 NoRidge 1Fam 2Story 8 5 2000 ... 2198 1 0 2 1 4 1 836 Y 250000

5 rows × 21 columns

['MSZoning', 'Neighborhood', 'BldgType', 'HouseStyle', 'PavedDrive']

Encoding categorical columns I: LabelEncoder

Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.

First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea.

The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers. You'll practice using this here.

Instructions:

  • Import LabelEncoder from sklearn.preprocessing.
  • Fill in missing values in the LotFrontage column with 0 using .fillna().
  • Create a boolean mask for categorical columns. You can do this by checking for whether df.dtypes equals object.
  • Create a LabelEncoder object. You can do this in the same way you instantiate any scikit-learn estimator.
  • Encode all of the categorical columns into integers using LabelEncoder(). To do this, use the .fit_transform() method of le in the provided lambda function.
from sklearn.preprocessing import LabelEncoder

df = df_raw.copy()
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)

# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()

# Print the head of the categorical columns
print(df[categorical_columns].head())

# Create LabelEncoder object: le
le = LabelEncoder()
# Save the enconder labels to a list. 
transform_dicts = []

# Apply LabelEncoder to categorical columns
# df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))
for col in categorical_columns:
    df[col] = le.fit_transform(df[col])
    transform_dicts.append(dict(zip(le.classes_, le.transform(le.classes_))))


# Print the head of the LabelEncoded categorical columns
display(df[categorical_columns].head())
print(*transform_dicts,sep="\n")
  MSZoning Neighborhood BldgType HouseStyle PavedDrive
0       RL      CollgCr     1Fam     2Story          Y
1       RL      Veenker     1Fam     1Story          Y
2       RL      CollgCr     1Fam     2Story          Y
3       RL      Crawfor     1Fam     2Story          Y
4       RL      NoRidge     1Fam     2Story          Y
MSZoning Neighborhood BldgType HouseStyle PavedDrive
0 3 5 0 5 2
1 3 24 0 2 2
2 3 5 0 5 2
3 3 6 0 5 2
4 3 15 0 5 2
{'C (all)': 0, 'FV': 1, 'RH': 2, 'RL': 3, 'RM': 4}
{'Blmngtn': 0, 'Blueste': 1, 'BrDale': 2, 'BrkSide': 3, 'ClearCr': 4, 'CollgCr': 5, 'Crawfor': 6, 'Edwards': 7, 'Gilbert': 8, 'IDOTRR': 9, 'MeadowV': 10, 'Mitchel': 11, 'NAmes': 12, 'NPkVill': 13, 'NWAmes': 14, 'NoRidge': 15, 'NridgHt': 16, 'OldTown': 17, 'SWISU': 18, 'Sawyer': 19, 'SawyerW': 20, 'Somerst': 21, 'StoneBr': 22, 'Timber': 23, 'Veenker': 24}
{'1Fam': 0, '2fmCon': 1, 'Duplex': 2, 'Twnhs': 3, 'TwnhsE': 4}
{'1.5Fin': 0, '1.5Unf': 1, '1Story': 2, '2.5Fin': 3, '2.5Unf': 4, '2Story': 5, 'SFoyer': 6, 'SLvl': 7}
{'N': 0, 'P': 1, 'Y': 2}
transform_dict = dict(zip(le.classes_, le.transform(le.classes_)))
print(transform_dict)
{'C (all)': 0, 'FV': 1, 'RH': 2, 'RL': 3, 'RM': 4}

Encoding categorical columns II: OneHotEncoder

Okay - so you have your categorical columns encoded numerically. Can you now move onto using pipelines and XGBoost? Not yet! In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.

As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or "dummy" variables. You can do this using scikit-learn's OneHotEncoder.

Instructions:

  • Import OneHotEncoder from sklearn.preprocessing.
  • Instantiate a OneHotEncoder object called ohe. Specify the keyword arguments categorical_features=categorical_mask and sparse=False.
  • Using its .fit_transform() method, apply the OneHotEncoder to df and save the result as df_encoded. The output will be a NumPy array.
  • Print the first 5 rows of df_encoded, and then the shape of df as well as df_encoded to compare the difference.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Create OneHotEncoder: ohe
# ohe = OneHotEncoder(categorical_features = categorical_mask, sparse = False) # -----> Not working in the new version of sklearn
ct = ColumnTransformer([('my_ohe', OneHotEncoder(), categorical_mask)], remainder='passthrough')
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ct.fit_transform(df)

# Print first 3 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:3, :])

# Print the shape of the original DataFrame
print(df.shape)

# Print the shape of the transformed array
print(df_encoded.shape)
[[0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 6.000e+01 6.500e+01 8.450e+03
  7.000e+00 5.000e+00 2.003e+03 0.000e+00 1.710e+03 1.000e+00 0.000e+00
  2.000e+00 1.000e+00 3.000e+00 0.000e+00 5.480e+02 2.085e+05]
 [0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 1.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 2.000e+01 8.000e+01 9.600e+03
  6.000e+00 8.000e+00 1.976e+03 0.000e+00 1.262e+03 0.000e+00 1.000e+00
  2.000e+00 0.000e+00 3.000e+00 1.000e+00 4.600e+02 1.815e+05]
 [0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 6.000e+01 6.800e+01 1.125e+04
  7.000e+00 5.000e+00 2.001e+03 1.000e+00 1.786e+03 1.000e+00 0.000e+00
  2.000e+00 1.000e+00 3.000e+00 1.000e+00 6.080e+02 2.235e+05]]
(1460, 21)
(1460, 62)

Encoding categorical columns III: DictVectorizer

Alright, one final trick before you dive into pipelines. The two step process you just went through - LabelEncoder followed by OneHotEncoder - can be simplified by using a DictVectorizer.

Using a DictVectorizer on a DataFrame that has been converted to a dictionary allows you to get label encoding as well as one-hot encoding in one go.

Your task is to work through this strategy in this exercise!

Instructions:

  • Import DictVectorizer from sklearn.feature_extraction.
  • Convert df into a dictionary called df_dict using its .to_dict() method with "records" as the argument.
  • Instantiate a DictVectorizer object called dv with the keyword argument sparse=False.
  • Apply the DictVectorizer on df_dict by using its .fit_transform() method.
  • Hit 'Submit Answer' to print the resulting first five rows and the vocabulary.
from sklearn.feature_extraction import DictVectorizer

# USE the unprocessed data for the following steps
df_raw = pd.read_csv("datasets/ames_unprocessed_data.csv",skiprows = None)
df = df_raw.copy()

print(df.shape, '/n', df.info())
# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse = False)

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)

# Print the resulting first five rows
print(df_encoded[:5,:])
print(df_encoded.shape)

# Print the vocabulary
print(dv.vocabulary_)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSSubClass    1460 non-null   int64  
 1   MSZoning      1460 non-null   object 
 2   LotFrontage   1201 non-null   float64
 3   LotArea       1460 non-null   int64  
 4   Neighborhood  1460 non-null   object 
 5   BldgType      1460 non-null   object 
 6   HouseStyle    1460 non-null   object 
 7   OverallQual   1460 non-null   int64  
 8   OverallCond   1460 non-null   int64  
 9   YearBuilt     1460 non-null   int64  
 10  Remodeled     1460 non-null   int64  
 11  GrLivArea     1460 non-null   int64  
 12  BsmtFullBath  1460 non-null   int64  
 13  BsmtHalfBath  1460 non-null   int64  
 14  FullBath      1460 non-null   int64  
 15  HalfBath      1460 non-null   int64  
 16  BedroomAbvGr  1460 non-null   int64  
 17  Fireplaces    1460 non-null   int64  
 18  GarageArea    1460 non-null   int64  
 19  PavedDrive    1460 non-null   object 
 20  SalePrice     1460 non-null   int64  
dtypes: float64(1), int64(15), object(5)
memory usage: 239.7+ KB
(1460, 21) /n None
[[3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 2.000e+00 5.480e+02 1.710e+03 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
  8.450e+03 6.500e+01 6.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 7.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 2.085e+05 2.003e+03]
 [3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  1.000e+00 1.000e+00 2.000e+00 4.600e+02 1.262e+03 0.000e+00 0.000e+00
  0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  9.600e+03 8.000e+01 2.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 8.000e+00 6.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 1.815e+05 1.976e+03]
 [3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 1.000e+00 2.000e+00 6.080e+02 1.786e+03 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
  1.125e+04 6.800e+01 6.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 7.000e+00
  0.000e+00 0.000e+00 1.000e+00 1.000e+00 2.235e+05 2.001e+03]
 [3.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 1.000e+00 1.000e+00 6.420e+02 1.717e+03 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
  9.550e+03 6.000e+01 7.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 7.000e+00
  0.000e+00 0.000e+00 1.000e+00 1.000e+00 1.400e+05 1.915e+03]
 [4.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 1.000e+00 2.000e+00 8.360e+02 2.198e+03 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00
  1.426e+04 8.400e+01 6.000e+01 0.000e+00 0.000e+00 0.000e+00 1.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 8.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 2.500e+05 2.000e+03]]
(1460, 62)
{'MSSubClass': 23, 'MSZoning=RL': 27, 'LotFrontage': 22, 'LotArea': 21, 'Neighborhood=CollgCr': 34, 'BldgType=1Fam': 1, 'HouseStyle=2Story': 18, 'OverallQual': 55, 'OverallCond': 54, 'YearBuilt': 61, 'Remodeled': 59, 'GrLivArea': 11, 'BsmtFullBath': 6, 'BsmtHalfBath': 7, 'FullBath': 9, 'HalfBath': 12, 'BedroomAbvGr': 0, 'Fireplaces': 8, 'GarageArea': 10, 'PavedDrive=Y': 58, 'SalePrice': 60, 'Neighborhood=Veenker': 53, 'HouseStyle=1Story': 15, 'Neighborhood=Crawfor': 35, 'Neighborhood=NoRidge': 44, 'Neighborhood=Mitchel': 40, 'HouseStyle=1.5Fin': 13, 'Neighborhood=Somerst': 50, 'Neighborhood=NWAmes': 43, 'MSZoning=RM': 28, 'Neighborhood=OldTown': 46, 'Neighborhood=BrkSide': 32, 'BldgType=2fmCon': 2, 'HouseStyle=1.5Unf': 14, 'Neighborhood=Sawyer': 48, 'Neighborhood=NridgHt': 45, 'Neighborhood=NAmes': 41, 'BldgType=Duplex': 3, 'Neighborhood=SawyerW': 49, 'Neighborhood=IDOTRR': 38, 'PavedDrive=N': 56, 'Neighborhood=MeadowV': 39, 'BldgType=TwnhsE': 5, 'MSZoning=C (all)': 24, 'Neighborhood=Edwards': 36, 'Neighborhood=Timber': 52, 'PavedDrive=P': 57, 'HouseStyle=SFoyer': 19, 'MSZoning=FV': 25, 'Neighborhood=Gilbert': 37, 'HouseStyle=SLvl': 20, 'BldgType=Twnhs': 4, 'Neighborhood=StoneBr': 51, 'HouseStyle=2.5Unf': 17, 'Neighborhood=ClearCr': 33, 'Neighborhood=NPkVill': 42, 'HouseStyle=2.5Fin': 16, 'Neighborhood=Blmngtn': 29, 'Neighborhood=BrDale': 31, 'Neighborhood=SWISU': 47, 'MSZoning=RH': 26, 'Neighborhood=Blueste': 30}

Preprocessing within a pipeline

Now that you've seen what steps need to be taken individually to properly process the Ames housing data, let's use the much cleaner and more succinct DictVectorizer approach and put it alongside an XGBoostRegressor inside of a scikit-learn pipeline.

Instructions:

  • Import DictVectorizer from sklearn.feature_extraction and Pipeline from sklearn.pipeline.
  • Fill in any missing values in the LotFrontage column of X with 0.
  • Complete the steps of the pipeline with DictVectorizer(sparse=False) for "ohe_onestep" and xgb.XGBRegressor() for "xgb_model".
  • Create the pipeline using Pipeline() and steps.
  • Fit the Pipeline. Don't forget to convert X into a format that DictVectorizer understands by calling the to_dict("records") method on X.
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import cross_val_score

# import xgb
import xgboost as xgb
data = df_raw.copy()
X, y = data.iloc[:,:-1], data.iloc[:,-1]
# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse = False)),
         ("xgb_model", xgb.XGBRegressor())]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# # Fit the pipeline
# xgb_pipeline.fit(X.to_dict("records"),y)

xgb_scores = cross_val_score(xgb_pipeline,X.to_dict("records"),y, scoring="neg_mean_squared_error",cv=10)

final_avg_rmse = np.mean(np.sqrt(np.abs(xgb_scores)))
print("Final RMSE:", final_avg_rmse)
Final RMSE: 28282.433580247784

Incorporating XGBoost into pipelines

Cross-validating ur XGBoost model

In this exercise, you'll go one step further by using the pipeline you've created to preprocess and cross-validate your model.

Instructions

  • Create a pipeline called xgb_pipeline using steps.
  • Perform 10-fold cross-validation using cross_val_score(). You'll have to pass in the pipeline, X (as a dictionary, using .to_dict("records")), y, the number of folds you want to use, and scoring ("neg_mean_squared_error").
  • Print the 10-fold RMSE.
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [   ("ohe_onestep", DictVectorizer(sparse = False)),
            ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:squarederror"))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# cross-validate the model
scores = cross_val_score(xgb_pipeline,X.to_dict("records"),y,scoring="neg_mean_squared_error",cv=10)

print("10-fold RMSE:", np.mean(np.sqrt(np.abs(scores))))
10-fold RMSE: 27683.04157118635

Kidney disease case study I: Categorical Imputer

You'll now continue your exploration of using pipelines with a dataset that requires significantly more wrangling. The chronic kidney disease dataset contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.

As Sergey mentioned in the video, you'll be introduced to a new library, sklearn_pandas, that allows you to chain many more processing steps inside of a pipeline than are currently supported in scikit-learn. Specifically, you'll be able to impute missing categorical values directly using the Categorical_Imputer() class in sklearn_pandas, and the DataFrameMapper() class to apply any arbitrary sklearn-compatible transformer on DataFrame columns, where the resulting output can be either a NumPy array or DataFrame.

We've also created a transformer called a Dictifier that encapsulates converting a DataFrame using .to_dict("records") without you having to do it explicitly (and so that it works in a pipeline). Finally, we've also provided the list of feature names in kidney_feature_names, the target name in kidney_target_name, the features in X, and the target in y.

In this exercise, your task is to apply the CategoricalImputer to impute all of the categorical columns in the dataset. You can refer to how the numeric imputation mapper was created as a template. Notice the keyword arguments input_df=True and df_out=True? This is so that you can work with DataFrames instead of arrays. By default, the transformers are passed a numpy array of the selected columns as input, and as a result, the output of the DataFrame mapper is also an array. Scikit-learn transformers have historically been designed to work with numpy arrays, not pandas DataFrames, even though their basic indexing interfaces are similar.

Instructions:

  • Apply the categorical imputer using DataFrameMapper() and CategoricalImputer(). CategoricalImputer() does not need any arguments to be passed in. The columns are contained in categorical_columns. Be sure to specify input_df=True and df_out=True, and use category_feature as your iterator variable in the list comprehension.
# import pandas as pd

# # data = arff.loadarff('datasets/Chronic_Kidney_Disease/chronic_kidney_disease_full.arff')
# # df = pd.DataFrame(data[0])
# # df.info()


# # read the kidney diasease data
# df = pd.read_csv("datasets/kidney_disease.csv")
# df = df.drop(['id'], axis=1)
# print(df["classification"].unique())
# print(df["classification"].value_counts())
# df["classification"] = df["classification"].replace(['ckd\t'],'ckd')
# print(df["classification"].value_counts())

# df['classification'] = df['classification'].replace(['notckd','ckd'],[0,1])


# X, y = df.iloc[:,:-1], df.iloc[:,-1]

# print(df.info())

# #check number of nulls in each column
# nulls_per_column = X.isnull().sum()
# print(nulls_per_column)

# # create a boolean mask for categorical columns
# categorical_mask = (X.dtypes == object)

# # get list of categorical column names
# categorical_columns = X.columns[categorical_mask].tolist()

# # get list of non-categorical column names
# non_categorical_columns = X.columns[~categorical_mask].tolist()

# # apply numeric imputer to non-categorical columns
# from sklearn_pandas import DataFrameMapper
# from sklearn.impute import SimpleImputer




# # Apply numeric imputer
# numeric_imputation_mapper = DataFrameMapper(
#                                             [([numeric_feature], SimpleImputer(strategy="median")) for numeric_feature in non_categorical_columns],
#                                             input_df=True,
#                                             df_out=True
#                                            )

# # Apply categorical imputer
# categorical_imputation_mapper = DataFrameMapper(
#                                                 [(category_feature, SimpleImputer(strategy="most_frequent")) for category_feature in categorical_columns],
#                                                 input_df=True,
#                                                 df_out=True
#                                                )
['ckd' 'ckd\t' 'notckd']
ckd       248
notckd    150
ckd\t       2
Name: classification, dtype: int64
ckd       250
notckd    150
Name: classification, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             391 non-null    float64
 1   bp              388 non-null    float64
 2   sg              353 non-null    float64
 3   al              354 non-null    float64
 4   su              351 non-null    float64
 5   rbc             248 non-null    object 
 6   pc              335 non-null    object 
 7   pcc             396 non-null    object 
 8   ba              396 non-null    object 
 9   bgr             356 non-null    float64
 10  bu              381 non-null    float64
 11  sc              383 non-null    float64
 12  sod             313 non-null    float64
 13  pot             312 non-null    float64
 14  hemo            348 non-null    float64
 15  pcv             330 non-null    object 
 16  wc              295 non-null    object 
 17  rc              270 non-null    object 
 18  htn             398 non-null    object 
 19  dm              398 non-null    object 
 20  cad             398 non-null    object 
 21  appet           399 non-null    object 
 22  pe              399 non-null    object 
 23  ane             399 non-null    object 
 24  classification  400 non-null    int64  
dtypes: float64(11), int64(1), object(13)
memory usage: 78.2+ KB
None
age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       70
wc       105
rc       130
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
dtype: int64
X = pd.read_csv('datasets/chronic_kidney_X.csv')
y = pd.read_csv('datasets/chronic_kidney_y.csv').to_numpy().ravel()

#check number of nulls in each column
nulls_per_column = X.isnull().sum()
print(nulls_per_column)

print(X.info())
print(X["htn"].unique())
print(X["htn"].value_counts())
# X["htn"] = X["htn"].replace(['ckd\t'],'ckd')
# print(y["classification"].value_counts())

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object
# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()


from sklearn.impute import SimpleImputer
from sklearn import preprocessing

for cat_col in categorical_columns:
    
    imp = SimpleImputer(strategy="most_frequent")
    X[cat_col] = imp.fit_transform(X[[cat_col]]).ravel()

    le = preprocessing.LabelEncoder()
    X[cat_col] = le.fit_transform(X[cat_col])


nulls_per_column = X.isnull().sum()
print(nulls_per_column)
age        9
bp        12
sg        47
al        46
su        49
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wc       106
rc       131
rbc      152
pc        65
pcc        4
ba         4
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 24 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     391 non-null    float64
 1   bp      388 non-null    float64
 2   sg      353 non-null    float64
 3   al      354 non-null    float64
 4   su      351 non-null    float64
 5   bgr     356 non-null    float64
 6   bu      381 non-null    float64
 7   sc      383 non-null    float64
 8   sod     313 non-null    float64
 9   pot     312 non-null    float64
 10  hemo    348 non-null    float64
 11  pcv     329 non-null    float64
 12  wc      294 non-null    float64
 13  rc      269 non-null    float64
 14  rbc     248 non-null    object 
 15  pc      335 non-null    object 
 16  pcc     396 non-null    object 
 17  ba      396 non-null    object 
 18  htn     398 non-null    object 
 19  dm      398 non-null    object 
 20  cad     398 non-null    object 
 21  appet   399 non-null    object 
 22  pe      399 non-null    object 
 23  ane     399 non-null    object 
dtypes: float64(14), object(10)
memory usage: 75.1+ KB
None
['yes' 'no' nan]
no     251
yes    147
Name: htn, dtype: int64
age        9
bp        12
sg        47
al        46
su        49
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wc       106
rc       131
rbc        0
pc         0
pcc        0
ba         0
htn        0
dm         0
cad        0
appet      0
pe         0
ane        0
dtype: int64
from sklearn.impute import SimpleImputer
# import standard scaler, one hot encoder and column transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Check number of nulls in each feature columns
nulls_per_column = X.isnull().sum()
print(nulls_per_column)

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()
# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()
# convert for transformer
numeric_features, categorical_features = non_categorical_columns, categorical_columns


numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])
categorical_transformer =Pipeline(steps=[("imputer", SimpleImputer(strategy="most_frequent")), ("ohe", OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
age        9
bp        12
sg        47
al        46
su        49
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wc       106
rc       131
rbc        0
pc         0
pcc        0
ba         0
htn        0
dm         0
cad        0
appet      0
pe         0
ane        0
dtype: int64
# imputer = SimpleImputer(strategy='constant', fill_value='missing')

# df['rbc'] = imputer.fit_transform(df['rbc'].values.reshape(-1,1))[:,0]

# df['rbc']
categorical_transformer
Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                ('ohe', OneHotEncoder())])

Kidney disease case study II: Feature Union

Having separately imputed numeric as well as categorical columns, your task is now to use scikit-learn's FeatureUnion to concatenate their results, which are contained in two separate transformer objects - numeric_imputation_mapper, and categorical_imputation_mapper, respectively.

You may have already encountered FeatureUnion in Machine Learning with the Experts: School Budgets. Just like with pipelines, you have to pass it a list of (string, transformer) tuples, where the first half of each tuple is the name of the transformer.

  • Instructions

    • Import FeatureUnion from sklearn.pipeline.
    • Combine the results of numeric_imputation_mapper and categorical_imputation_mapper using FeatureUnion(), with the names "num_mapper" and "cat_mapper" respectively.
# from sklearn.pipeline import FeatureUnion

# # Combine the numeric and categorical transformations
# numeric_categorical_union = FeatureUnion([
#                                           ("num_mapper", numeric_imputation_mapper),
#                                           ("cat_mapper", categorical_imputation_mapper)
#                                          ])

Kidney disease case study III: Full pipeline

It's time to piece together all of the transforms along with an XGBClassifier to build the full pipeline!

Besides the numeric_categorical_union that you created in the previous exercise, there are two other transforms needed: the Dictifier() transform which we created for you, and the DictVectorizer().

After creating the pipeline, your task is to cross-validate it to see how well it performs.

  • Instructions

    • Create the pipeline using the numeric_categorical_union, Dictifier(), and DictVectorizer(sort=False) transforms, and xgb.XGBClassifier() estimator with max_depth=3. Name the transforms "featureunion", "dictifier" "vectorizer", and the estimator "clf".
    • Perform 3-fold cross-validation on the pipeline using cross_val_score(). Pass it the pipeline, pipeline, the features, kidney_data, the outcomes, y. Also set scoring to "roc_auc" and cv to 3.
from sklearn.base import BaseEstimator, TransformerMixin

# Define Dictifier class to turn df into dictionary as part of pipeline
class Dictifier(BaseEstimator, TransformerMixin):       
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if type(X) == pd.core.frame.DataFrame:
            return X.to_dict("records")
        else:
            return pd.DataFrame(X).to_dict("records")
from sklearn.feature_extraction import DictVectorizer
# import PIpline
from sklearn.pipeline import Pipeline
# import xgb
import xgboost as xgb
import numpy as np
# cross-validation
from sklearn.model_selection import cross_val_score

# Create full pipeline , SET eval_metric='rmse' to disable Warning, the results were unchanged
pipeline = Pipeline([
                     ("preprocessor", preprocessor),
                    #  ("dictifier", Dictifier()),
                    #  ("vectorizer", DictVectorizer(sort=False)),
                     ("clf", xgb.XGBClassifier( eval_metric='rmse',use_label_encoder=False,max_depth=3))
                    ])

# Perform cross-validation
cross_val_scores = cross_val_score(pipeline, X, y , scoring="roc_auc", cv=3)

# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))
3-fold AUC:  0.998237712755785
from sklearn.model_selection import RandomizedSearchCV

# Create the parameter grid
gbm_param_grid = {
    'clf__learning_rate': np.arange(0.05, 1, 0.05),
    'clf__max_depth': np.arange(3, 10, 1),
    'clf__n_estimators': np.arange(50, 200, 50)
}

# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(pipeline, param_distributions = gbm_param_grid , scoring= 'roc_auc', n_iter = 20, cv=2,verbose = 1)

# Fit the estimator
randomized_roc_auc.fit(X,y)

# Compute metrics
print(randomized_roc_auc.best_score_)
print(randomized_roc_auc.best_estimator_)
Fitting 2 folds for each of 20 candidates, totalling 40 fits
0.9975466666666666
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'bp', 'sg', 'al',
                                                   'su', 'bgr', 'bu', 'sc',
                                                   'sod', 'pot', 'hemo', 'pcv',
                                                   'wc', 'rc', 'rbc', 'pc',
                                                   'pcc', 'ba', 'htn', 'dm',
                                                   'cad', 'appet', 'pe',
                                                   'ane']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(st...
                               importance_type=None, interaction_constraints='',
                               learning_rate=0.5, max_delta_step=0, max_depth=9,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=8, num_parallel_tree=1, predictor='auto',
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=1, subsample=1,
                               tree_method='exact', use_label_encoder=False,
                               validate_parameters=1, verbosity=None))])
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.

Tuning XGBoost hyperparameters

Bringing it all together

Final Thoughts