Born-Again Networks
BANs exploited the idea demonstrated in KD, that the information contained in a teacher model's output distribution $f(x, \theta_1^*)$ can provide a rich source of training signal, leading to a second solution $f(x, \theta_2^*)$, $\theta_2 \in \Theta_2$, with better generalization ability.
Sequence of teaching selves born-again networks ensemble. The $k$-th model of BANs is trained with knowledge transferred from the $k-1$-th student:
\[ \mathcal{L}(f(x, \arg \min_{\theta_{k-1}} \mathcal{L}(f(x, \theta_{k-1}))), f(x, \theta_k)) \]
Finally, Born-Again Network Ensembles (BANE) predict by averaging the prediction of multiple generations of BANs.
\[ \hat{f}^{k}(x) = \sum_{i=1}^k f(x, \theta_i)/k \]
Dark knowledge under the light. The single-sample gradient of the cross-entropy between student logits $z_j$ and teacher logits $t_j$ with respect to the $i$th output is given by: (see here)
\[ \frac{\partial \mathcal{L}_i}{\partial z_i} = q_i - p_i = \frac{\exp(z_i)}{\sum_{j=1}^n \exp(z_j)} - \frac{\exp(t_i)}{\sum_{j=1}^n \exp(t_j)} \]
When the target probability distribution corresponds to the ground truth one-hot label, this reduces to:
\[ \frac{\partial \mathcal{L}_*}{\partial z_*} = q_* - y_* = \frac{\exp(z_*)}{\sum_{j=1}^n \exp(z_j)} - 1 \]
|