DeepCluster

Jan 2022

Wufei Ma
Purdue University

Abstract

Paper reading notes for Deep Clustering for Unsupervised Learning of Visual Features [1].

In this work, the authors presented DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, $k$-means, and uses the subseqent assignments as supervision to update the weights of the network. Experiments on standard benchmarks showed that DeepCluster can outperform previous state-of-the-art methods by a wide margin.

Method

Let the convnet mapping be $f_\theta$, where $\theta$ is the set of parameters. Given a training set $X = \{x_1, \dots, x_N\}$ of $N$ images, we want to find a parameter $\theta^*$ such that the mapping $f_{\theta^*}$ produces good general-purpose features.

In the supervised learning domain, these parameters are learned from $k$ predefined classes, so each image $x_n$ is associated with a label $y_n \in \{0, 1\}^k$. With a classifier $g_W$ parameterized by $W$, the parameters $\theta$ and $W$ can be jointly learned by optimizing the multinomial logistic loss (negative log-softmax loss): \[ \min_{\theta, W} \frac{1}{N} \sum_{n=1}^N l(g_W(f_\theta(x_n)), y_n) \;\;\;\;\;\;\;\; (1) \]

Unsupervised learning by clustering. The idea of DeepCluster is to bootstrap the discriminative power of a convnet. We cluster the output of the convnet and use the subsequent cluster assignments as "pseudo-labels" to optimize Eq. (1). The clustering of features using $k$-means can be written as: \[ \min_{C \in \mathbb{R}^{d \times k}} \frac{1}{N} \sum_{n=1}^N \min_{y_n \in \{0, 1\}^k} \lVert f_\theta(x_n) - Cy_n \rVert_2^2 \;\;\; \text{s.t.} \; y_n^\top 1_k = 1 \] Overall, DeepCluster alternates between clustering the features to produce pseudo-labels and updating the parameters by predicting the pseudo-labels.

Avoiding trivial solutions.

Empty clusters. A common trick used in feature quantization consists in automatically reassigning empty clusters during $k$-means optimization. When a cluster becomes empty, we randomly select a non-empty cluster and use its centroid with a small perturbation as the new centroid for the empty cluster.
Trivial parameterization. If the vast majority of images is assigned to a few clusters, the parameters $\theta$ will exclusively discriminate between them and may lead to a trivial parameterization where the convnet will predict the same output regardless of the input. A strategy to circumvent this issue to sample images based on a uniform distribution over the classes.

Results

Linear classification on ImageNet.

References

[1] M. Caron, P. Bojanowski, A. Joulin, M. Douze. Deep Clustering for Unsupervised Learning of Visual Features. In ECCV, 2018.