Method
Let the convnet mapping be $f_\theta$, where $\theta$ is the set of parameters. Given a training set $X = \{x_1, \dots, x_N\}$ of $N$ images, we want to find a parameter $\theta^*$ such that the mapping $f_{\theta^*}$ produces good general-purpose features.
In the supervised learning domain, these parameters are learned from $k$ predefined classes, so each image $x_n$ is associated with a label $y_n \in \{0, 1\}^k$. With a classifier $g_W$ parameterized by $W$, the parameters $\theta$ and $W$ can be jointly learned by optimizing the multinomial logistic loss (negative log-softmax loss):
\[ \min_{\theta, W} \frac{1}{N} \sum_{n=1}^N l(g_W(f_\theta(x_n)), y_n) \;\;\;\;\;\;\;\; (1) \]
Unsupervised learning by clustering. The idea of DeepCluster is to bootstrap the discriminative power of a convnet. We cluster the output of the convnet and use the subsequent cluster assignments as "pseudo-labels" to optimize Eq. (1). The clustering of features using $k$-means can be written as:
\[ \min_{C \in \mathbb{R}^{d \times k}} \frac{1}{N} \sum_{n=1}^N \min_{y_n \in \{0, 1\}^k} \lVert f_\theta(x_n) - Cy_n \rVert_2^2 \;\;\; \text{s.t.} \; y_n^\top 1_k = 1 \]
Overall, DeepCluster alternates between clustering the features to produce pseudo-labels and updating the parameters by predicting the pseudo-labels.
Avoiding trivial solutions.
- Empty clusters. A common trick used in feature quantization consists in automatically reassigning empty clusters during $k$-means optimization. When a cluster becomes empty, we randomly select a non-empty cluster and use its centroid with a small perturbation as the new centroid for the empty cluster.
- Trivial parameterization. If the vast majority of images is assigned to a few clusters, the parameters $\theta$ will exclusively discriminate between them and may lead to a trivial parameterization where the convnet will predict the same output regardless of the input. A strategy to circumvent this issue to sample images based on a uniform distribution over the classes.
|