Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning


Abstract

Prototypical self-supervised learning methods consistently suffer from partial prototype collapse, where multiple prototypes converge to nearly identical representations. This undermines their central purpose, providing diverse and informative targets to guide encoders toward rich representations, and has led practitioners to over-parameterize prototype sets or add ad-hoc regularizers, which mitigate symptoms rather than address the root cause. We empirically trace the collapse to the joint optimization of encoders and prototypes, which encourages a type of shortcut learning: early in training prototypes drift toward redundant representations that minimize loss without necessarily enhancing representation diversity. To break the joint optimization, we introduce a fully decoupled training strategy that learns prototypes and encoders under separate objectives. Concretely, we model prototypes as a Gaussian mixture updated with an online EM-style procedure, independent of the encoder's loss. This simple yet principled decoupling eliminates prototype collapse without explicit regularization and yields consistently diverse prototypes, which in several settings translate to improved downstream performance.

proto-dec Figure 1 Figure 1: Visualization of prototype distributions for different prototypical SSL methods. Existing methods exhibit pronounced mode collapse, with prototypes concentrating into a small number of regions. In contrast, ours maintains a well-spread and uniform prototype distribution, showing no visible signs of prototypical collapse.


Background

Prototypical self-supervised learning (SSL) uses learnable prototypes to define structured targets that guide representation learning. However, many frameworks suffer from a phenomena known as partial prototype collapse, where a substantial fraction of the prototypes become redundant, converging to nearly identical representations. A key empirical observation in our study is that this collapse occurs early in training. This is problematic as it diminishes target diversity in the SSL objective, which can be especially harmful when training using long-tailed data distributions. A common response is to over-parameterize the prototype set, increasing computation and memory costs, or to add ad-hoc diversity regularizers, at the cost of additional hyperparameters.

Problem Formulation

Traditional prototypical SSL jointly optimizes an encoder $f_\theta$ and a prototype set $C = \{c_k\}_{k=1}^K$ by minimizing a consistency loss over augmented views:
\(\begin{equation} \min_{\theta,\,C}\; \mathcal{L}_f(f_\theta, C). \end{equation}\)

We argue that this joint objective can induce shortcut learning: early in training, prototypes are incentivized to drift toward redundant configurations that reduce $\mathcal{L}_f$ without necessarily improving the overall representations of the the encoder, undermining the purpose of learning $C$.

Proposed Solution

Instead of optimizing $(\theta, C)$ jointly, we propose to fully decouple prototype estimation from encoder learning and alternate two updates at iteration $t$:

(i) Prototype update:
\(\begin{equation} C^{t} = \operatorname*{arg\,min}_{C\in\mathcal{C}}\; \mathcal{L}_C\left(C^{t-1}, h^{t}_\phi\right). \end{equation}\)
We update $C$ via an expectation maximization (EM) procedure independently of the encoder’s loss.

(ii) Encoder update:
\(\begin{equation} \theta^{t+1}=\operatorname*{arg\,min}_{\theta}\;\mathcal{L}_f \left(h^t_\theta, C^{t}\right). \end{equation}\)
We update the encoder while keeping the prototypes $C^t$ fixed. Our proposed solution is illustrated in the figure below:

proto-dec Figure 2 Figure 2: (a) Joint optimization of the encoder and prototypes, which can induce shortcut learning and prototype collapse. (b) Our decoupled framework which separates prototype updates from encoder learning, preventing collapse and preserving prototype diversity.


Results

Across most existing prototypical SSL frameworks, we observe a consistent pattern of early partial prototype collapse, where a substantial subset of the prototypes rapidly becomes redundant within the first stages of training. In contrast, our decoupled optimization strategy exhibits no observable collapse throughout training. This behavior remains stable even as we progressively tighten the criterion used to declare collapse, i.e., adopting increasingly strict notions of prototype redundancy. While our method also yields improved performance on several downstream tasks, the primary finding is its robustness against prototype collapse, highlighting the effectiveness of decoupling prototype estimation from encoder learning.

proto-dec Figure 3 Figure 3: Left: Percentage of unique prototypes versus the collapse threshold $\epsilon$, where larger $\epsilon$ enforces a stricter notion of uniqueness. Existing methods exhibit increasing prototype collapse as the criterion becomes stricter, whereas our proposed decoupled approach shows no collapse across all thresholds. Right: k-NN performance, indicating that preserving prototype diversity can improve downstream performance.


Citation

@inproceedings{arteaga2026prototypes,
title={Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning},
author={Arteaga, Gabriel Y. and Aasan, Marius and Chakraborty, Rwiddhi and Hjelkrem-Tan, Martine and Silva, Thalles and Kampffmeyer, Michael and Ram\'irez Rivera, Ad\'in},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=fVJEWdwvLO}
}