Abstract

Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

Self-supervised learning (SSL) via Masked Image Modeling (MIM) objectives have become a popular source for strong, generalized vision backbones. However, there are key issues with artifacts and noise in the representations learned models that rely on MIM-based objectives.

We present a novel method to measure the amount of non-semantic noise in ViT tokens for state-of- the-art MIM-based models. We characterize non-semantic noise as components that are invariant to the semantic content in the input. This can for example be positional encoding, which are necessary for attention mechanisms but seldom useful in inference, or structural artifacts.

We discover that strong principal components exhibit high levels of non-semantic noise, and that this feature is pervasive in MIM-based models while nearly non-existent in other, non MIM-based SSL models. Importantly, this holds regardless of which positional embedding method is employed and whether predictions are conducted in latent or input-space, suggesting that this is an implicit issue in MIM.

To suppress non-semantic information, we introduce a Semantically Orthogonal Artifact Projection (SOAP) to remove unwanted artifacts that are not useful for inherently semantic tasks, such as instance level classification and salient segmentation—cf. Fig. 1. SOAP is flexible: It is computed directly from data using a Gram-Schmidt based projection, thus requiring no training, and can be attached as an external module to any pretrained SSL backbone.

SOAP pipeline

Figure 1: Pipeline overview; a pretrained MIM encoder outputs dense representations zwhich are used for downstream tasks—we show salient segmentation as an example. By identifying and suppressing principal components encoding positional noise, our SOAP module improves the representations and enhances downstream performance in zero-shot settings.

Methodology: The TL;DR

Semantic invariance refers to the property of a component yielding consistent responses even when the semantic content of local representations varies. In other words, a component is semantically invariant if it produces similar activations regardless of whether the input carries meaningful semantic information.

We introduce a new score for measuring Semantic Invariance (SI) in the learned patch representation space, by comparing the representations of real images to those of synthetically generated “noise” images. The synthetic images are generated by a mixture of pink noise, modulated white noise, and random low-frequency gradient fields.

In short, given a model $f$ that encodes an image $x$ into into $N$ patch embedding representations $z_1,…,z_N$, we perform PCA of the patch representation space using Welford’s algorithm to estimate the covariance, and obtain the eigendecomposision $\text{Cov}(\mathcal{Z}) = V\Lambda V^\top \in \mathbb{R}^{D \times D}$, with principal component vectors $V=(v_1, …, v_D)$. Let $\Omega$ be the set of images with natural spatial frequency spectrum. For each principal component we calculate the aggrated activations of real and synthetic images by

\[P_d= \text{mean} \big[ \text{Pr}(A_{d,n}=1 \, | \, x \sim \mathcal{X}) \big], \,\quad \text{where } \mathcal{X} \subset \Omega \text{ is the set of semantically informative images}\] \[Q_d= \text{mean} \big[ \text{Pr}(A_{d,n}=1 \, | \, x^c \sim \mathcal{X^c}) \big], \, \text{where } \mathcal{X}^c \subset \Omega \text{ is the set of images without semantic content}\]

where $A_{d,n} = \mathbf{1}[ z_n v_d \geq 0 ]$ is the activation of patch $z_n$ for component $v_d$. In practice, $\mathcal{X}$ can be instantiated as a set of natural images (for example ImageNet validation), while $\mathcal{X}^c$ is approximated by our synthetic image generator. The Semantic Invariance (SI) of component $d$ is then measured by

\[s_d = 2 \cdot \frac{P_d \cdot Q_d + (1-P_d)\cdot(1-Q_d)}{\sqrt{P_d^2 + (1-P_d)^2} + \sqrt{Q_d^2 + (1-Q_d)^2}} = 2 \cdot \frac{ \langle {P}_d, {Q}_d \rangle}{||{P}_d|| + ||{Q}_d||},\]

which assigns high scores when $P_d \approx Q_d$, and vice versa.

Finally, we introduce Semantically Orthogonal Artifact Projection (SOAP), a simple Gram-Schmidt-based post-hoc projection method to suppress components with high SI score:

\[P_\phi = I - VWV^\top, \quad W=\text{diag} (w_1,...,w_D) \quad \text{s.t.} \quad w_d = s_d \times \frac{ \sigma (\mu - r)/ \tau}{\sigma (\mu/ \tau)}; \quad r = \text{rank}(s_d).\]

Here, $\mu$ and $\tau$ are hyperparameters to control the cutoff and smoothness of suppression. This projection can be used to “correct” the patch representations $\hat z = P_\phi z$ by removing the contribution of components that encode non-semantic information such as positional noise.

Read our paper for a more in-depth formulation and explanation of the methodology!

Semantic Invariance in MIM models

We analyze popular SSL models by measuring their Semantic Invariance, and find a striking contrast between models trained with and without a MIM-style objective. The main findings are summarized by the plots below. Models with MIM are shown in solid lines; models without MIM are shown in dashed lines.

SI scores plots

Figure 2: Semantic invariance (SI) score in descending order. All scores are shown in the left plot, while the right focuses on the top 10 semantically invariant scores.

The models trained with MIM have at least two principal components with high SI score ($\geq 0.75$). Upon manual inspection, we find that these components encode positional infomation about the patch location. Critically, DINO and DeiT3, which are not trained with MIM, do not exhibit the same behaviour and have lower SI scores in general ($\leq 0.75$).

Looking at the maxumim SI score of each transformer block, we see that while some MIM-models (DINOv2 and DINOv3) have lower SI scores in earlier layers, there is a sharp increase in the last layers, where all MIM-models land well above the $0.75$ threshold. This can be explained as the model saturating more positional information into the embeddings in preparation for solving the MIM task.

SI score vs. depth

Figure 3: Maximum semantic invariance score for MIM models (solid lines) and non-MIM models (dashed lines) vs. model depth. Critically, MIM models show high SI-scores in the last layers. This can be explained by the MIM objective encouraging positional information in the patch embeddings of deeper layers.

SOAP improves zero-shot performance

We use SOAP to correct for semantically invariant components in local embeddings, and find that this improves performance in zero-shot downstream tasks for all MIM models in our study. Read our paper for more results!

Salient segmentation

We use TokenCut for zero-shot evaluation of salient information present in patch embeddings, and find that correcting with SOAP can significantly improve performance. Below are the results for the ECSSD dataset.

Pretrain Model max Fβ IoU Acc
Original embeddings        
DINO ViT-B16 82.580 74.325 90.929
DINOv2 ViT-B16 71.319 63.937 83.147
DINOv3 ViT-B16 36.975 29.122 52.953
iBOT ViT-B16 62.873 56.248 78.785
CAPI ViT-L14 72.456 66.083 84.334
MAE ViT-B16 79.952 71.067 89.410
I-JEPA ViT-H14 37.670 27.989 68.898
SOAP corrected embeddings        
DINOv2 ViT-B16 80.633 72.559 88.687
DINOv3 ViT-B16 42.633 33.742 61.975
iBOT ViT-B16 66.557 60.167 78.340
CAPI ViT-L14 85.219 78.084 92.600
MAE ViT-B16 82.094 72.118 91.444
I-JEPA ViT-H14 40.239 31.162 71.406

kNN segmentation

We evaluate zero-shot segmentation on ADE20k by performing per-patch k-nearest neighbors (kNN) and upsampling the predictions to full image resolution using nearest neighbor interpolation.

Pretrain Model IoU (Org.) Acc (Org.) IoU (SOAP) Δ IoU Acc (SOAP) Δ Acc
DINOv2 ViT-B16 40.253 74.603 40.808 ↑ 0.556 74.723 ↑ 0.120
DINOv3 ViT-B16 43.849 77.943 44.575 ↑ 0.725 78.101 ↑ 0.158
iBOT ViT-B16 27.726 70.859 28.426 ↑ 0.700 71.262 ↑ 0.403
CAPI ViT-L14 31.382 71.626 31.637 ↑ 0.255 71.770 ↑ 0.144
MAE ViT-B16 11.882 58.002 12.651 ↑ 0.769 58.592 ↑ 0.591
I-JEPA ViT-H14 20.952 60.273 21.258 ↑ 0.306 60.287 ↑ 0.014

Citation

@inproceedings{hjelkremtan2025soap,
  title={Suppressing Non-Semantic Noise in Masked Image Modeling Representations},
  author={Hjelkrem-Tan, Martine and Aasan, Marius and Chakraborty, Rwiddhi and Arteaga, Gabriel Y. and Choi, Changkyu and Ram\'irez Rivera, Ad\'in},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}