Deep Ensembles
May, 2021

Abstract

Deep ensembles have shown to be a promising approach for improving accuracy. We will start by reviewing the theoretical background of deep ensembles, as well as some early attempts in the 1990s. Then we will introduce some recent works on the ensemble of deep neural networks.

The Bias-Variance Tradeoff for Ensemble Models

Consider a single-variable scenario and our goal is to learn a function \(f: \mathbb{R}^N \mapsto \mathbb{R}\) from a set of i.i.d. samples \((x^\mu, y^\mu)\) such that \(y^\mu = f(x^\mu)\). Assume we have \(N\) networks in our ensemble model. The output of network \(\alpha\) is denoted by \(V^\alpha(x)\). The output of our ensemble is then the weighted ensemble average given by \(\bar{V}(x) = \sum_\alpha w_\alpha V^\alpha(x)\), where \(\sum_\alpha w_\alpha = 1\).

We define the ambiguity on input \(x\) of a single ensemble member be \(a^\alpha(x) = (V^\alpha(x) - \bar{V}(x))^2\). The ensemble ambiguity on input \(x\) is \[\bar{a}(x) = \sum_\alpha w_\alpha a^\alpha(x)\] which is simply the variance of the ensemble members' outputs. The quadratic error of network \(\alpha\) and of the ensemble are \[\begin{align*}\epsilon^\alpha(x) & = (f(x) - V^\alpha(x))^2 \\ e(x) & = (f(x) - \bar{V}(x))^2\end{align*}\] Substituting back to \(\bar{a}(x)\) we obtain \[\begin{align*}\bar{a}(x) & = \sum_\alpha w_\alpha \epsilon^\alpha(x) - e(x) \\ e(x) & = \bar{\epsilon}(x) - \bar{a}(x)\end{align*}\] By averaing over the input distribution we have \[E = \bar{E} - \bar{A}\] where \(E\) is the ensemble generalization errror, \(\bar{E}\) is the weighted average of the generalization errors of ensemble members, and \(\bar{A}\) is the weighted average of the ambiguities. This equation decomposes the generalization error of the ensemble model into the generalization errors of individual networks and the correlation between them. In terms of uniform weights we have \[E \leq \frac{1}{N}\sum_\alpha E^\alpha\] These results have been noted by several authors [2][3].

Diverse Ensemble of Deep Networks

Now we've seen some examples of deep ensembles. But what are the best practices to create an ensemble? In this work, Lee et al. [1] first compared several standard approaches in ensemble learning and then proposed some novel strategies for training an ensemble model.

Random initialization and bagging. Randomly initializing network weights and randomly resampling dataset subsets (bagging) are probably the most used methods to induce randomness in ensemble members. The authors compared three training settings: (1) random initializaiton, (2) bagging, and (3) combined. From the results below, all ensembles improve performance over the baseline, but bagging may harm the performance in terms of ensemble-mean accuracy. In fact, this is likely because we lose about 37% of the training data when adopting the bagging strategy, which is often considered as a bad practice in deep learning.

overview

Parameter sharing with TreeNets. Ensemble models achieve as much performance as possible from multiple instances of a base model, which are likely to introduce wasteful duplication of parameters or feature representations. The authors investigate the idea of parameter sharing by evaluating a family to tree-structured CNN ensembles called TreeNets.

overview

The authors evaluated TreeNets split at different positions on two large architectures trained on ImageNet, ILSVRC-Alex and ILSVRC-NiN. We see that shared parameter networks not only retain the performance of full ensembles, but can outperform them. Moreover, TreeNets sharing 1 to 2 initial layers outperform classical ensembles with fewer parameters.

overview

Ensemble-aware losses. Instead of using a separate objective for each ensemble member, the authors introduced ensemble-aware losses. A "natural" idea is to simply optimizing the performance of the average-beliefs of the ensemble, i.e., use the ensemble-mean loss during training. However, this idea doesn't work becuase averaging outputs would reduce diversity. The gradients back-propagated into the ensembler members are identical and the "mistakes" are shared between ensemble members, independent of their individual performance. Further, averaging softmax probabilities would cause nemerical instability during back propagation.

Inspired by Multiple Choice Learning [4], the authors proposed a diversity-encouraging loss for classification. Instead of back propagating the loss through each ensemble member, for each sample, we use an "oracle" to update the best predictor only. Formally, given an ensemble model with \(M\) predictors, \(\{\theta_m\}\), the cross-entropy oracle set-loss is defined as \[\mathcal{L}_{set} = \sum_{m=1}^M -\alpha_{mi} \log(p_{y_i}^{\theta_m})\] where \(\alpha_{mi}\) is a binary variable indicating whether \(\theta_m\) has the lowest loss on \((x_i, y_i)\). A straightforward extension is to choose top-\(k\) predictors instead of the best one. The experimental results are shown below.

overview

Predictive Uncertainty Estimation

The authors proposed a simple and scalable method for estimating predictive uncertainty estimates from NNs. The used proper scoring rules to help the NNs model predictive distributions, and incorporated the ideas of ensembles and adversarial training to smooth predictive estimates.

  • Proper scoring rules. Given a scoring rule \(S(p_\theta, (y, \mathbf{x}))\) and the true distribution \(q(y, \mathbf{x})\) of \((y, \mathbf{x})\), the expected score is \(S(p_\theta, q) = \int q(y, \mathbf{x})S(p_\theta, (y, \mathbf{x}))dyd\mathbf{x}\). A proper scoring rule is one where \(S(p_\theta, q) \leq S(q, q)\) with equality if and only if \(p_\theta = q\). Many common NN loss functions are proper scoring rules, such as the MLE for classification and the Brier score. However, the MSE for regression is not. Following [6], the authors used a netowrk that outputs the predicted mean \(\mu(x)\) and variance \(\sigma^2(x) > 0\) and minimize the negative log-likelihood criterion: \[-\log p_\theta(y_n \mid \mathbf{x}_n) = \frac{\log \sigma_\theta^2(\mathbf{x})}{2} + \frac{(y-\mu_\theta(\mathbf{x}))^2}{2\sigma_\theta^2(\mathbf{x})} + const.\]
  • Adversarial training. The authors interpreted adversarial training as a computationally efficient solution to smooth the predictive distributions by increasing the likelihood of the target around the \(\epsilon\)-neighborhood of the observed training examples.
  • Ensembles. The authors trained an ensemble of NNs with different initialization weights on the entire dataset. The also observed that bagging deteriorated performance.

A Loss Landscape Perspective

TODO

References

[1] S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, D. Batra. Why M Heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314 (2015).

[2] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and active learning. In NIPS, 1995.

[3] M.P. Perrone and L. N Cooper. When networks disagree: Ensemble method for hybrid neural networks. In Neural Networks for Speech and Image Processing, 1993.

[4] A. Guzman-Rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple structured outputs. In NIPS, 2012.

[5] S. Fort, H. Hu, B. Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757 (2019).

[6] D.A. Nix, A.S. Weigend. Estimating the mean and variance of the target probability distribution. In ICNN, 1994.