Simple Understanding of Generalized Ensemble / Stacking with Cross-Validation

I was recently going through an excellent blog called Kaggle Ensembling Guide. Based on the explanations and code available on the blog, I have created a simple Algorithm to explain the Generalized Stacking Algorithm.

Lets say we have a small Train data set of 12 rows, 2 independent and one dependent variable. And a test set of 4 observations.

We decide to do a 3-Fold Cross Validation (totally random choice). For sake of simplicity we choose a 2 layer stacking with 2 base models and 1 meta learner model.

The Algorithm simulation is roughly as follows.

The “Steps” mentioned below match the title of the simulation images.

  • Step 1-3: For Model 1, For the k’th split in the train data, fit each model into all but kth fold and predict the training column for kth fold.
    • Also build a test column for the full test data using the same fitted model at each step
  • Step 4: Do this for all folds to fit the training & test data completely and then average out all the test columns to build a single model column
  • Step 5-8: Repeat the above steps(1-4) are carried for Model 2, we will get no of column predictors for test data as a number of models chosen. Similarly at each point, build the test prediction for Model 2. In the end, average all predicted columns for Test data to build the final predictor for Model 2 for the Test data.
    • The Train and Test data both will now have 2 predictor columns (XM1 and XM2) corresponding to Model1 and Model2
  • Step 9: Then apply the next level stacking using Model3 to fit the (full) train data (Y ~ XM1 + XM2) on the derived columns and output (completely). The use this model to predict for the Test data – making the overall final prediction.

 

Example: Kaggle TUT Head Pose Estimation Challenge

The challenge is to predict the direction of eyes (angles for both eyes) using the data given (there are 2 Y variables). I have written a sample code using 2 level Ensemble stacking architecture as discussed above. The only differences are:

  • There are 2 Y variables, hence the process of modelling and prediction is repeated twice
  • I have used 5 models in Level 1 and 1 Model in Level 2
  • K = 10

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import os
import pandas as pd
from sklearn import ensemble

import os
cwd = os.getcwd()

loc_train_x = cwd + “/Input CSVs/tut-head-pose-estimation-challenge/X_train.csv”
loc_train_y = cwd + “/Input CSVs/tut-head-pose-estimation-challenge/y_train.csv”
loc_test = cwd + “/Input CSVs/tut-head-pose-estimation-challenge/X_test.csv”
loc_sample_submission = cwd + “/Input CSVs/tut-head-pose-estimation-challenge/sample_submission.csv”

X_train = pd.read_csv(loc_train_x)
y_train = pd.read_csv(loc_train_y)
X_test = pd.read_csv(loc_test)
X_submission = pd.read_csv(loc_sample_submission)

feature_cols = [col for col in X_train.columns if col not in [“Id”]]
X = X_train[feature_cols].values
y1 = y_train[“Angle1”].values
y2 = y_train[“Angle2”].values
X_test = X_test[feature_cols].values


clfs = [RandomForestClassifier(n_estimators=500, n_jobs=-1, criterion=’gini’),
        RandomForestClassifier(n_estimators=500, n_jobs=-1, criterion=’entropy’),
        ExtraTreesClassifier(n_estimators=500, n_jobs=-1, criterion=’gini’),
        ExtraTreesClassifier(n_estimators=500, n_jobs=-1, criterion=’entropy’),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]



#For using Cross-validated
n_folds = 10
shuffle = False
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=n_folds, shuffle=shuffle)
skf.get_n_splits(X, y1)

dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_test.shape[0], len(clfs)))

loc_temp_save_base = cwd + “/Input CSVs/tut-head-pose-estimation-challenge/”

#There are logs below just in case someone wants to understand what is actually going on and how this Stacking Algo works.

#We have 2 Y vars to be predicted
y_list = [y1, y2]

for e, y in enumerate(y_list):
    for
j, clf in enumerate(clfs):
        print
j, clf
        dataset_blend_test_j = np.zeros((X_test.shape[0], (skf.n_splits)))
       
i = 0
        for train_index, test_index in skf.split(X, y):
            print “Fold”
, i
            X_train, X_train_test = X[train_index], X[test_index]
           
y_train, y_test = y[train_index], y[test_index]
           
clf.fit(X_train, y_train)
           
y_submission = clf.predict(X_train_test)
           
dataset_blend_train[test_index, j] = y_submission
            #Temp logging
            loc_temp_save = loc_temp_save_base + “dataset_blend_train_” + str(clf).split(‘(‘)[0] + str(j) + str(i) + “.csv”
           
pd.DataFrame(dataset_blend_train).to_csv(loc_temp_save)
           
# print “dataset_blend_train”
            # print dataset_blend_train

            dataset_blend_test_j[:, i] = clf.predict(X_test)
            #Temp logging
            loc_temp_save = loc_temp_save_base + “dataset_blend_test_j_” + str(clf).split(‘(‘)[0] + str(j) + str(i) + “.csv”
           
pd.DataFrame(dataset_blend_test_j).to_csv(loc_temp_save)
           
i += 1
        dataset_blend_test[:, j] = dataset_blend_test_j.mean(axis=1)
       
# Temp logging
        loc_temp_save = loc_temp_save_base + “dataset_blend_test_” + str(clf).split(‘(‘)[0] + str(j) + “.csv”
       
pd.DataFrame(dataset_blend_test).to_csv(loc_temp_save)
    print “Level 2 Fitting”
   
clf = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, criterion=’entropy’)
   
clf.fit(dataset_blend_train, y)
   
y_submission = clf.predict(dataset_blend_test)
   
# Assign values to a column in submission dataframe
    X_submission.iloc[:, [e + 1]] = np.reshape(y_submission,(-1,1))

#Submission File

loc_actual_submission = cwd + “/Input CSVs/tut-head-pose-estimation-challenge/actual_submission.csv”
X_submission.to_csv(loc_actual_submission, index=False)

 

The output gets an error of close to 5.5 and can easily put you in Top 15. Please do let me know if you found this useful.

Reach out to me on: parijat@datatreeresearch.com

Reach Us at

Call : +91-9096221202

Email  : parijat@datatreeresearch.com

Address : City Centre , Hinjewadi , Pune

Get in Touch

Leave a Reply

Your email address will not be published. Required fields are marked *