Introduction to Ensemble Learning
"Even the weak become strong when they are united." - Friedrich Von Schiller
This is the basic idea behind Ensemble Methods. Weak learners, aka base models, are strategically combined which makes for a strong learner called the ensemble model. The ensemble model uses multiple learning algorithms to solve a particular classification/regression problem that cannot be learned as efficiently by any of its constituent learners alone.
The different techniques of combining the weak learners vary with respect to how the outputs of each of these weak learners are aggregated to obtain the final output. We will be looking at two popular techniques for achieving an ensemble model:
1. Bagging
2. Boosting
Before going any further, let us intuitively understand the effect of these techniques on model performance. A good machine learning algorithm is theoretically expected to achieve low-bias and low-variance while learning the training data. However, achieving both is impossible and this is called the bias-variance trade-off. Both these sources of error (bias/variance) prevent the model from generalizing the training data to effectively perform on the unseen data.
Bias is an error that arises from the erroneous assumptions made by the algorithm. This causes the model to miss the relationship between the features and target output in the training data (under-fitting).
Variance is an error caused by sensitivity to small fluctuations in the training data. This would result in the model learning noise from the training data (over-fitting).
Ensemble methods are used to cater to reducing either the bias or the variance depending on the weakness of the base models. Different combining techniques are chosen depending on what source of error we are trying to reduce.
Bagging (Bootstrap Aggregating)
Bagging involves training a weak learning model on different sets of training data in parallel and combining the results of these base models using some averaging method.
1. Sampling
When I say multiple sets of training data, you must be thinking this would require a lot of training data to feed multiple data-hungry deep learning networks. But all we need is multiple samples of one set of training data. This can be obtained using a technique called Bootstrapping.
Bootstrapping is a statistical method of sampling from the original data where the samples are almost independent and representative of the original data distribution (approximately i.i.d — independent and identical to original data distribution).
2. Training
Say we generate L samples each of size N, with replacement, from the original data set of size N. Then we train L homogeneous base models on each of these L samples that are generated by bootstrapping.
3. Aggregating
Now we have predictions from L base models that we need to aggregate using some averaging method. In the case of a regression problem, we could find the literal average of the predictions from base models to make for the prediction from the ensemble model.
Y = ( y1+y2+y3+…+yL ) / L
For a classification problem, if we have the base models return the class labels, then one way of aggregating is by considering the class returned by each weak classifier as a vote and the class with the highest number of votes is returned by the ensemble model — Hard Voting.
Y = max ( y1, y2, y3,…, yL )
On the other hand, if we have base models returning the probabilities of each class, then one way is to find the average of each of the class predictions from all the base models and the class with the highest mean probability is returned by the ensemble model — Soft Voting.
Y = max [ P(y1), P(y2), P(y3),…, P(yL) ]
When and when not to use Bagging?
- Bagging methods mainly focus on reducing the variance, not affecting the bias. If the base models trained on different samples have high variance (overfitting), then the aggregated result would even it out thereby reducing the variance. Therefore, this technique is chosen when the base models have high variance and low bias which is generally the case with models having high degrees of freedom for complex data (Ex: Deep Decision Trees).
- Bagging does not work for models with high bias and low variance because combining the results of base models that do not fit well on the training data (under-fitting) will not change the results when aggregated.
- Models with high degrees of freedom (complex data) require more training time. Since the base models are trained in parallel, the time taken to train each of the base models is equivalent to training any one of them, making bagging a good choice of ensemble method.
Boosting
Boosting involves training a weak learner progressively where each time, the model intuitively focuses its efforts on the observations that were hard to learn by its predecessor.
The base models are combined in a very adaptive manner to form the ensemble model. The ensemble model is the weighted sum of the constituent base models. There are mainly two popular meta-algorithms that differ specifically with the way the weak learners are aggregated. They both aim at optimizing the ensemble model in an iterative approach.
1. Adaptive Boosting (Adaboost)
2. Gradient Boosting (Adagrad)
Adaptive Boosting
Each base learner updates the weights attached to each of the observations in the dataset. The following steps repeat for L learners iteratively where each base model tries to correct the mistakes of its predecessor using weights.
(i) After a weak learner is trained, each data point in the dataset is assigned a weight that signifies the accuracy of classification. i.e., a data point with higher accuracy would be assigned a lower weight and vice versa.
(ii) This weighted dataset is then used as the training set for the next weak learner. This learner would intuitively focus more on the data points with higher weights and try classifying them correctly.
(iii) The resulting dataset is re-weighted for the misclassified samples and used as the input for the next learner.
Gradient Boosting
Each base learner updates the values of the observations in the dataset. Like in Adaboost, the next model depends on the error of the previous one. As the name suggests, we combine the weak learners sequentially using Gradient Descent. Each base model tries to fit the data in the opposite of the gradient of the error obtained from the ensemble model we have so far.
Wᵢ = Wᵢ-₁ — (α * ∇Eᵢ-₁)
Wᵢ is the weights of the base model i, α is the step size and Eᵢ-₁ is the error of the base model (i-1).
For every observation in the dataset, we compute the difference between the observed and the predicted value. These values are called pseudo-residuals. They indicate the direction in which the next learner should be updated to get the right value.
1. Initially, we set the pseudo-residuals to the average of the known targets.
2. With each weak learning algorithm i, we predict the pseudo-residuals obtained in (i-1).
3. Pseudo-residuals thus obtained are the targets for the successive weak learner.
pseudo_residuals = Yᵗᵃʳᵍᵉᵗ — Yᵖʳᵉᵈ
When and when not to use Boosting?
- Generally, models with low variance and high bias with few degrees of freedom (Ex: shallow decision trees) are used.
- Unlike bagging where models are trained in parallel, the models here are trained sequentially. Training several complex models sequentially would make it computationally expensive.
References
[1] Ian Goodfellow, Yoshua Bengio, Aaron Courville — Deep Learning. [2] Joseph Rocca, Ensemble Methods: bagging, boosting and stacking. [3] Figures created using Venngage.