Perceptual GANs
The generator is a deep residual based feature generative model which transforms the original poor features of small objects to highly discriminative ones by introducing fine-grained details from lower-level layers, achieving "super-resolution" on the intermediate representations. The discriminator serves a supervisor and provides guidance on the quality and advantages of the generated fine-grained details. The Perceptual GAN also includes a new perceptual loss tailored for the detection purpose.
Perceptual GANs. Let $F_l$ and $F_s$ be representations for large and small objects respectively. We aim to learn a generator function $G$ that transforms the representations of a small object $F_s$ to a super-resolved one $G(F_s)$ that is similar to the original one of the large object $F_l$. A new conditional generator model is introduced to generate residual representation between large and small objects condition conditioned on the extra auxiliary information, i.e. the low-level features of the small object $f$.
\[ \min_G \max_D L(D, G) \triangleq \mathbb{E}_{F_l \sim p_{data}(F_l)} \log D(F_l) + \mathbb{E}_{F_s \sim p_{F_s}(F_s \mid F)} [\log(1 - D(F_s + G(F_s \mid f)))] \]
The generator $G_{\Theta_G}$ is obtained by optimizing the loss function $L_{dis}$
\[ \Theta_G = \arg \min_{\Theta_G} L_{dis}(G_{\Theta_G}(F_s)) \]
The adversarial branch of the discriminator $D_{\Theta_a}$ is obtained by optimizing the loss fucntion $L_a$
\[ \begin{align*}\Theta_a & = \arg \min_{\Theta_a} L_a(G_{\Theta_g}(F_s), F_l) \\ L_a & = -\log D_{\Theta_a}(F_l) - \log(1 - D_{\Theta_a}(G_{\Theta_g}(F_s)))\end{align*} \]
The perception branch of the discriminator $D_{\Theta_p}$ is obtained by optimizing the loss function $L_{dis\_p}$
\[ \Theta_p = \arg \min_{\Theta_p} L_{dis\_p}(F_l) \]
Adversarial loss. An adversarial loss is introduced to encourage the generator network to produce the super-resolved representation for small object similar as that of the large object.
\[ L_{dis\_a} = -\log D_{\Theta_a}(G_{\Theta_g}(F_s)) \]
Perceptual loss. The multi-task loss $L_{dis\_p}$ is computed to justify the detection accuracy benefiting from the generated super-resolved features for each object proposal:
\[ L_{dis\_p} = L_{cls}(p,g) + \mathbf{1}[g \geq 1] L_{loc}(r_g, r^*) \]
where $L_{cls}(p, g) = -\log p_g$ and $L_{loc}$ is a smooth $L_1$ loss.
|