Simple Model Stacking, Explained And Automated

An overview of Model Stacking

In model stacking, we don’t use one single model to make our predictions — instead, we make predictions with several different models, and then use those predictions as features for a higher-level meta model. It can work especially well with varied types of lower-level learners, all contributing different strengths to the meta model. Model stacks can be built in many ways, and there isn’t one “correct” way to use stacking. It can be made more complex than today’s example with multiple levels, weights, averaging, etc. The basic model stack we will make today looks like this:

Model stacking with original training features — Image by Author

In our stack, we will make non-leaky predictions on our train data using a series of intermediary models, and then use those as features in conjunction with the original training features on a meta model.

If this sounds complicated, don’t be deterred. I’ll show you a painless way to automate this process including selecting the best meta model, selecting the best stack models, putting all of that data together, and making the final predictions on your test data.

Today we are working with a Regression problem using the King County Housing dataset located on Kaggle.

If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material

Starting Positions

Sanity check: This article doesn’t cover any preprocessing or model tuning. Your dataset should be clean and ready to use, and your potential models should be hyperparameter-tuned if desired. You should also have your data split into a train/test set or a train/validate/test set. This writeup also assumes basic familiarity with cross-validation.

We start by instantiating our tuned potential models with any optional hyperparameters. The key to a fun and successful stack is the more the merrier — try as many models as you want! You can save time if you run a cross-validated spot check beforehand with all of your potential models, to avoid models that are clearly not informative. But you might be surprised how much a poor base model can contribute to a stack. Just keep in mind that every potential model you add to the possibilities will take time. But there is NO limit to the number of models that can be tried in your stack, and the key to our tutorial is that we will ultimately use only the best ones, without having to select anything manually.

For our example I’ll instantiate only five potential models. Here we go:

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
import xgboost as xgb

svr = SVR(gamma = 'scale', kernel = 'linear', C=10, epsilon=.05)

ridge = Ridge(random_state = randomstate, tol=1e-3, normalize=False, solver='auto')

neighbor = KNeighborsRegressor(n_neighbors = 11)

linreg = LinearRegression()

xgbr = xgb.XGBRegressor(n_estimators=1000, eval_metric='mae', max_depth = 7,eta = .1, min_child_weight = 5, colsample_bytree = .4, reg_lambda = 50)

Next, we’re going to create empty lists to store each potential’s training set predictions (which we’ll obtain in a non-leaky fashion). You’ll need a list for each potential model, and a consistent naming convention will help stay organized.

svr_yhat, ridge_yhat, neighbor_yhat, linreg_yhat, xgbr_yhat = [], [], [], [], []

Finally, we’re putting the instantiated models and their empty prediction lists into a storage dictionary, which we’ll be using with our various stack functions. The format of each dictionary entry is “label” : [model instance, prediction list], like so:

models_dict = {'SVR' : [svr, svr_yhat], 
                'Ridge' : [ridge, ridge_yhat],  
                'KNN' : [neighbor, neighbor_yhat], 
                'Linear Regression' : [linreg, linreg_yhat], 
                'XGB' : [xgbr, xgbr_yhat]}

Sanity check: Prepare your train/test sets in array format for the following functions. Your X features should be in an array shape of (n,m) where n is # of samples and m is # of features, and your y targets should be an array of (n, ). If they are dataframes, convert them with np.array(df)

Getting Out-Of-Fold Predictions

What does it mean to have an out-of-fold or out-of-sample prediction? In model stacking, we use predictions made on the train data itself in order to train the meta model. The way to properly include these predictions is by dividing our train data into folds, just like with cross-validation, and doing predictions on each fold using the remaining folds. In this way we will have a full set of predictions for our train data but without any data leakage, which is what would occur were we to simply train and then predict on the same set.

Here is our first function, for getting our out-of-fold predictions. This function was adapted from the great out-of-fold predictions tutorial at Machine Learning Mastery:

from sklearn.model_selection import KFold

def train_oof_predictions(x, y, models, verbose=True):
    '''Function to perform Out-Of-Fold predictions on train data
    returns re-ordered predictors x, re-ordered target y, and model dictionary with filled predictors
    Parameters:
    x: training predictors
    y: training targets
    models: dictionary of models in form of model name : [instantiated model, predictors list]
    verbose: if True, prints status update as the function works
    '''
    
    # instantiate a KFold with 10 splits
    kfold = KFold(n_splits=10, shuffle=True, random_state=randomstate)
    
    # prepare lists to hold the re-ordered x and y values
    data_x, data_y  = [], []
    
    # run the following block for each of the 10 kfold splits
    for train_ix, test_ix in kfold.split(x, y):
    
        if verbose: print("\nStarting a new fold\n")
    
        if verbose: print("Creating splits")
        #create this fold's training and test sets
        train_X, test_X = x[train_ix], x[test_ix] 
        train_y, test_y = y[train_ix], y[test_ix]
    
        if verbose: print("Adding x and y to lists\n")
        # add the data that is used in this fold to the re-ordered lists
        data_x.extend(test_X)
        data_y.extend(test_y)
    
        # run each model on this kfold and add the predictors to the model's running predictors list
        for item in models:
            
            label = item # get label for reporting purposes
            model = models[item][0] # get the model to use on the kfold
        
            # fit and make predictions 
            if verbose: print("Running",label,"on this fold")
            model.fit(train_X, train_y) # fit to the train set for the kfold
            predictions = model.predict(test_X) # fit on the out-of-fold set
            models[item][1].extend(predictions) # add predictions to the model's running predictors list
    
    return data_x, data_y, models

Now we’re ready to get the out-of-fold predictions, using the model dictionary that we made earlier. The function defaults to verbose and will give status updates about its progress. Keep in mind that getting OOF predictions can take a long time if you have a large dataset or a lot of models to try in your stack!

Run the out-of-fold predictions function:

data_x, data_y, trained_models = train_oof_predictions(X_train, y_train, models_dict)

Sanity check: Check for consistent output from this out-of-fold function. All of the yhats in the dictionaries should come back as plain lists of numbers, without any arrays.

We now have a data_x and data_y, which are the same data as our x_train and y_train, but re-ordered to match the order of the potentials’ yhat predictions. Our returned trained_models dictionary has yhat predictions for the entire train set, for each potential model .

Running the Stack Selector

Time for the Stack Selector. Here’s our next function. This one was based on the feature selection forward-backward selector written by David Dale here:

from sklearn.model_selection import cross_validate

def model_selector(X, y, meta_model, models_dict, model_label, verbose=True):
    
    """ 
    Perform a forward model selection based on MAE improvement
    Parameters:
        X - baseline X_train with all features
        y - baseline y_train with all targets
        meta_model - meta_model to be trained
        models_dict - dictionary of models in format of model name : [model object, out-of-fold predictions]
        label - the label for the current meta model
        verbose - whether to print the sequence of inclusions(True recommended)
    Returns: list of selected models, best MAE 
    """

print("\n\nRunning model selector for ", model_label)
    included_models = []
     
    while True:
        changed=False
        
        # forward step
        
        if verbose: print("\nNEW ROUND - Setting up score charts")
        excluded_models = list(set(models_dict.keys())-set(included_models)) # make a list of the current excluded_models
        if verbose: print("Included models: {}".format(included_models))
        if verbose: print("Exluded models: {}".format(excluded_models))
        new_mae = pd.Series(index=excluded_models) # make a series where the index is the current excluded_models
        
        current_meta_x = np.array(X)
        
        if len(included_models) &amp;gt; 0:
            for included in included_models:
                included = np.array(models_dict[included][1]).reshape((len(models_dict[included][1]), 1))
                current_meta_x = np.hstack((current_meta_x, included))

# score the current model
        scores = cross_validate(meta_model, current_meta_x, y, cv=5, n_jobs=-1, scoring=('neg_mean_absolute_error'))
        starting_mae = round(scores['test_score'].mean(),3)
        if verbose: print("Starting mae: {}\n".format(starting_mae))
        
       
        for excluded in excluded_models:  # for each item in the excluded_models list:
            
            new_yhat = np.array(models_dict[excluded][1]).reshape(-1, 1) # get the current item's predictions
            meta_x = np.hstack((current_meta_x, new_yhat)) # add the predictions to the meta set

            
            # score the current item
            scores = cross_validate(meta_model, meta_x, y, cv=5, n_jobs=-1, scoring=('neg_mean_absolute_error'))
            mae = round(scores['test_score'].mean(),3)
            if verbose: print("{} score: {}".format(excluded, mae))
            
            new_mae[excluded] = mae # append the mae to the series field
        
        best_mae = new_mae.max() # evaluate best mae of the excluded_models in this round
        if verbose: print("Best mae: {}\n".format(best_mae))
        
        if best_mae &amp;gt; starting_mae:  # if the best mae is better than the initial mae
            best_feature = new_mae.idxmax()  # define this as the new best feature
            included_models.append(str(best_feature)) # append this model name to the included list
            changed=True # flag that we changed it
            if verbose: print('Add  {} with mae {}\n'.format(best_feature, best_mae))
        else: changed = False
        
        if not changed:
            break
            
    print(model_label, "model optimized")
    print('resulting models:', included_models)
    print('MAE:', starting_mae)
    
    return included_models, starting_mae

Sanity check: My function is scoring on mean absolute error, but you can edit this for whichever score metric you most prefer, such as R2 or RMSE.

We will ultimately run this function with each of our potential models as the function’s meta model. When we run the function we send it the data_x and data_y that we got from our out-of-fold function, as well as a single instantiated model to try as the meta model, and our dictionary with all of the out-of-fold predictions. The function then runs forward selection, using the out-of-fold predictions as features.

For the model that is being tried as the meta model, we get a baseline score (using CV) on the train set. For each other potential model, we iteratively append the potential’s yhat predictions to the feature set and re-score the meta model using that additional feature. If our meta model score improves with the addition of any feature, the single best scoring potential’s predictions are permanently appended to the feature set and the improved score becomes the baseline. The function then loops, once more trying the addition of each potential that isn’t already in the stack, until no potential additions improve the score. The function then reports the optimal included models for this meta model, and the best score achieved.

Sanity check: This model selector is written to optimize on a Train set using CV. If you are fortunate enough to have a Validation set, you could rewrite and perform the selection on that. It will be much faster — but watch out for overfitting!

How do we pick our meta model for this task? Here comes the fun part — we’re going to try ALL of our models as the meta model. Keep in mind that this function may take a LONG time to run. Keeping verbose set to True will give frequent progress reports.

We make a dictionary to store all of the scores that we’ll get from our testing. Then we run the stack selector on each of our trained models, using that model as the meta model:

# Set up a scoring dictionary to hold the model stack selector results
scores = {}
scores['Model'] = []
scores['MAE'] = []
scores['Included'] = []

# Run the model stack selector for each model in our trained_models

for model in trained_models:
    
    meta_model = trained_models[model][0]
    resulting_models, best_mae = model_selector(data_x, data_y,  meta_model, trained_models, label, verbose=True)
    
    scores['Model'].append(model)
    scores['MAE'].append(best_mae)
    scores['Included'].append(resulting_models)

Afterward we’ll have the scores of how each model performs as the meta model, along with its best stacked additions. We turn that dictionary into a dataframe and sort our results on our score metric:

# Look at the scores of our model combinations

best_model = pd.DataFrame(scores).reset_index(drop=True)
best_model.sort_values('MAE', ascending=False)

Now we can see exactly which meta model and stacked models performed the best.

Putting It All Together

We’re almost done! Soon we’ll be making predictions on our Test set. We’ve selected a meta model (probably the one that performed the best on the Stack Selector!) and the stacked models that we’ll include.

Before we fit our meta model to our stacked data, try the model without the added features on the test set so you can get a comparison for your stacked model’s improvements! Fit and predict your meta model on your original train/test set. This is the baseline we’re expecting to beat with a stacked model.

# Check our meta model on the original train/test set only# 

Instantiate the chosen meta model

meta_model = SVR(gamma = 'scale', kernel = 'linear', C=10, epsilon=.05)meta_model.fit(X_train, y_train)
predictions = meta_model.predict(X_test)

pred_exp = np.exp(predictions)
actual = np.exp(y_test)

print("MAE: ",int(mean_absolute_error(pred_exp, actual)))
print("RMSE:",int(np.sqrt(mean_squared_error(pred_exp, actual))))
print(("R2:",r2_score(pred_exp, actual)*100)

Output:
MAE:  53130
RMSE: 82427
R2: 86.93570728212

Take a step back and fit all the models that we will be using in our stack on our ORIGINAL training dataset. We do this because we’ll be predicting on the test set (same as with a single model!) and adding the predictions to our test set as features for our meta model.

print("Fitting Models")
linreg.fit(X_train, y_train)
xgbr.fit(X_train, y_train)
knn.fit(X_train, y_train)

Now we prepare our model stack. First, manually make a list storing only the out-of-fold predictions for models that we are using in our final stack. We get these predictions from the ‘trained_models ’ dictionary that we produced earlier using our out-of-fold predictions function:

yhat_predics = [trained_models['XGB'][1], trained_models['Linear Regression'][1], trained_models['KNN'][1]]

Time for one more function. This one takes in our re-ordered train data_x, and the list holding yhat predictions, and puts them together into a single meta train set. We do this in a function because we’ll be doing it again for the test data later.

def create_meta_dataset(data_x, items):
    '''Function that takes in a data set and list of predictions, and forges into one dataset
    parameters:
    data_x - original data set
    items - list of predictions
    returns: stacked data set
    '''
    
    meta_x = data_x
    
    for z in items:
        z = np.array(z).reshape((len(z), 1))
        meta_x = np.hstack((meta_x, z))
        
    return meta_x

Now call this function passing data_x and predictions list to create a meta training set for our meta model:

# create the meta data set using the oof predictions
meta_X_train = create_meta_dataset(data_x, yhat_predics)

The meta training set consists of our original predictive features along with the stacked models’ out-of-fold predictions as additional features.

Make a list to hold the fitted model instances; we’ll be using that with our next and last function. Make sure this is in the same order as you listed the yhat predictions for your meta stacker!

final_models = [xgbr, linreg, knn]

Time for our last function. This function takes in our test set and the fitted models, uses the fitted models to make predictions on the test set, and then adds those predictions as features to the test set. It sends back a complete meta_x set for the meta model to predict on.

def stack_prediction(X_test, final_models): 
    '''takes in a test set and a list of fitted models.
    Fits each model in the list on the test set and stores it in a predictions list. Then sends the test set and the predictions to the create_meta_dataset to be combined
    Returns: combined meta test set
    Parameters:
    X_test - testing dataset
    final_models - list of fitted models
    '''
    predictions = []
    
    for item in final_dict:
        print(item)
        preds = item.predict(X_test).reshape(-1,1)
        predictions.append(preds)
    
    meta_X = create_meta_dataset(X_test, predictions)
        
    return meta_X

Here we call the function, sending the test set and the fitted models:

meta_X_test = stack_prediction(X_test, final_models)

Final Model Evaluation

The moment of truth is finally here. Time to predict with your stacked model!

# fit the meta model to the Train meta dataset
# There is no data leakage in the meta dataset since we did all of our predictions out-of-sample!
meta_model.fit(meta_X_train, data_y)

# predict on the meta test set
predictions = meta_model.predict(meta_X_test)

pred_exp = np.exp(predictions)
actual = np.exp(y_test)

print("MAE: ",int(mean_absolute_error(pred_exp, actual)))
print("RMSE:",int(np.sqrt(mean_squared_error(pred_exp, actual))))
print(("R2:",r2_score(pred_exp, actual)*100)

Output:
MAE:  47205
RMSE: 73973
R2: 90.03816032670765

In our stacking example, we reduced the MAE on our test set from 53130 to 47205 — an 11.15% improvement!

I hope you see improvements to your model scores, and can see the utility of trying stacked models! I trust you’ve learned some valuable tools to add to your kit for automating model selection and stacking.

References:

How to Use Out-of-Fold Predictions in Machine Learning by Jason Brownlee
Forward-Backward Feature Selection by David Dale

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

We’ll let you know when we release more technical education.

Simple Model Stacking, Explained And Automated

An overview of Model Stacking

Starting Positions

Getting Out-Of-Fold Predictions

Running the Stack Selector

Putting It All Together

Final Model Evaluation

Related

Bots

Brands

Business

China

Commerce

Computer Vision

Conversational AI

Customer Service

Cybersecurity

Data Science & Engineering

Design

Education

Ethics & Safety

Finance

Gaming

Healthcare

HR & Recruiting

Infrastructure

Leadership & Management

Manufacturing

Marketing

Natural Language Processing

Reinforcement Learning

Research

Retail & CPG

Society

Technical Guide

Technology

About TOPBOTS

An overview of Model Stacking

Starting Positions

Getting Out-Of-Fold Predictions

Running the Stack Selector

Putting It All Together

Final Model Evaluation

Enjoy this article? Sign up for more AI updates.

Related

Reader Interactions

About Jen Wadkins

Leave a Reply

Footer

About TOPBOTS