In this work, the authors presented a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Experiments showed that DML achieved compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks.
Deep Mutual Learning
The conventional supervised loss trains the network $\Theta_1$ to predict the correct labels for the training instances. To improve the generalization performance of $\Theta_1$ on testing instances, we use another peer network $\Theta_2$ to provide training experience in the form of its posterior probability $p_2$. Kullback Leibler (KL) Divergence is used to measure the match of the two network's predictions $p_1$ and $p_2$:
\[ D_\text{KL}(p_2 \mid\mid p_1) = \sum_{i=1}^N\sum_{m=1}^M p_2^m(x_1) \log\frac{p_2^m(x_i)}{p_1^m(x_i)} \]
The cross entropy error is given by
\[ L_{\text{C}, \Theta_1} = - \sum_{i=1}^N \sum_{m=1}^M I(y_i, m) \log(p_1^m(x_i)) \]
The overall loss $L_{\Theta_1}$ is then
\[ L_{\Theta_1} = L_{\text{C}, \Theta_1} + D_\text{KL}(p_2 \mid\mid p_1) \]
Results
Top-1 accuracy of CIFAR-100 dataset obatined by various architectures.
Comparing with distillation on CIFAR-100.
Ablation study on different number of networks.
References
[1] Y. Zhang, T. Xiang, T. Hospedales, H. Lu. Deep Mutual Learning. In CVPR, 2018.