Unsupervised Learning

Jul, 2021

Wufei Ma
Purdue University

Abstract

Learning unsupervised learning without supervision.

In this post, we will review some previous literature on unsupervised learning:

J. Rissanen. Modeling by shortest data description. In Automatica, 1978.

Feature selection:

I. Guyon and A. Elisseeff. An introductiont to variable and feature selection. In JMLR, 2003.
T. Liu, S. Liu, Z. Chen, and W. Ma. An evaluation on feature selection for text clustering. In ICML, 2013.
H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. In TKDE, 2005.
H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. In JCGS, 2006.
V. Kuleshov. Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration. In ICML, 2013.
J. Rissanen. Modeling by shortest data description. In Automatica, 1978.

Compressive Feature Learning

Unsupervised feature selection methods are particularly attractive. First, they do not require labeled examples. Second, they can be run a single time in an offline preprocessing step, producing a reduced feature space. Finally, a good data representation obtained in an unsupervised way captures inherent sturcture and can be used in other tasks such as clustering, classification, or ranking.

In this work Paskov et al. [1] proposed an unsupervised method for feature selection for text data based on ideas from data compression and formulated as an optimization problem. The method is grounded in the principle of minimum description length (MDL) [2] and uses a dictionary-based compression scheme to extract a succinct feature set. The method can reduces the feature set size by two orders of magnitude without incurring a loss of performance on several text categorization tasks. It should be noted that this method seeks common substrings but is not affected by the concatenation order of corpus documents and does not suffer from the instability issues as in some standard compression algorithms such as LZ77.

The MDL principle implies that a good feature representation for a document $D=x_1x_2\dots x_n$ of $n$ words minimizes some description length of $D$. The dictionary-based compression scheme achieves this by representing $D$ as a dictionary - a subset of $D$'s substrings - and a sequence of pointers indicating where copies of each element should be placed to fully reconstruct the document. The figure below shows different ways of representing a document $D$ with $\lambda$ controlling the tradeoff between dictionary costs and pointer costs.

Let $\mathcal{S}=\{x_i \dots x_{i+t-1} \mid 1\leq t\leq k, 1 \leq i \leq n-t+1\}$ be the set of all unique $k$-grams in $D$, and $\mathcal{P} = \{ (s,l) \mid s = x_l\dots x_{l+\lvert s \rvert-1} \}$ be the set of all $m=\lvert \mathcal{P} \rvert$ pointers. Let $\mathcal{P}$ be an ordered set so we can index with $i \in \{1, \dots, m\}$, and we define $J(s) \subset \{1, \dots, m\}$ to be the set of indices of all pointers with the same string $s$. Given a binary vector $w \in \{0, 1\}^m$, $w$ reconstructs $x_j$ if for some $w_i=1$, $p_i=(s,l)$ satisfies $l \leq j < l + \lvert s\rvert$.

Compression $D$ can then be formulated as a binary linear minimization problem over $w$: \[\min_w w^\top d + \sum_{s \in \mathcal{S}} c(s) \lVert w_{J(s)} \rVert_\infty, \text{ subject to } Xw \geq 1, w\in \{0, 1\}^m\]

References

[1] H.S. Paskov, R. West, J.C. Mitchell, and T.J. Hastie. Compressive feature learning. In NIPS, 2013.

[2] J. Rissanen. Modeling by shortest data description. In Automatica, 1978.

Related Materials

[1] E. Hazan. COS 598 Unsupervised learning: Theory and practice (Spring 2017). Princeton University.