Suppressing Non-Semantic Noise in Masked Image Modeling Representations

Abstract

Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

Self-supervised learning (SSL) via Masked Image Modeling (MIM) objectives have become a popular source for strong, generalized vision backbones. However, there are key issues with artifacts and noise in the representations learned models that rely on MIM-based objectives.

We present a novel method to measure the amount of non-semantic noise in ViT tokens for state-of- the-art MIM-based models. We characterize non-semantic noise as components that are invariant to the semantic content in the input. This can for example be positional encoding, which are necessary for attention mechanisms but seldom useful in inference, or structural artifacts.

We discover that strong principal components exhibit high levels of non-semantic noise, and that this feature is pervasive in MIM-based models while nearly non-existent in other, non MIM-based SSL models. Importantly, this holds regardless of which positional embedding method is employed and whether predictions are conducted in latent or input-space, suggesting that this is an implicit issue in MIM.

To suppress non-semantic information, we introduce a Semantically Orthogonal Artifact Projection (SOAP) to remove unwanted artifacts that are not useful for inherently semantic tasks, such as instance level classification and salient segmentation—cf. Fig. 1. SOAP is flexible: It is computed directly from data using a Gram-Schmidt based projection, thus requiring no training, and can be attached as an external module to any pretrained SSL backbone.

SOAP pipeline

Figure 1: Pipeline overview; a pretrained MIM encoder outputs dense representations zwhich are used for downstream tasks—we show salient segmentation as an example. By identifying and suppressing principal components encoding positional noise, our SOAP module improves the representations and enhances downstream performance in zero-shot settings.

Methodology: The TL;DR

Semantic invariance refers to the property of a component yielding consistent responses even when the semantic content of local representations varies. In other words, a component is semantically invariant if it produces similar activations regardless of whether the input carries meaningful semantic information.

We introduce a new score for measuring Semantic Invariance (SI) in the learned patch representation space, by comparing the representations of real images to those of synthetically generated “noise” images. The synthetic images are generated by a mixture of pink noise, modulated white noise, and random low-frequency gradient fields.

In short, given a model $f$ that encodes an image $x$ into into $N$ patch embedding representations $z_1,…,z_N$, we perform PCA of the patch representation space using Welford’s algorithm to estimate the covariance, and obtain the eigendecomposision $\text{Cov}(\mathcal{Z}) = V\Lambda V^\top \in \mathbb{R}^{D \times D}$, with principal component vectors $V=(v_1, …, v_D)$. Let $\Omega$ be the set of images with natural spatial frequency spectrum. For each principal component we calculate the aggrated activations of real and synthetic images by

\[P_d= \text{mean} \big[ \text{Pr}(A_{d,n}=1 \, | \, x \sim \mathcal{X}) \big], \,\quad \text{where } \mathcal{X} \subset \Omega \text{ is the set of semantically informative images}\] \[Q_d= \text{mean} \big[ \text{Pr}(A_{d,n}=1 \, | \, x^c \sim \mathcal{X^c}) \big], \, \text{where } \mathcal{X}^c \subset \Omega \text{ is the set of images without semantic content}\]

where $A_{d,n} = \mathbf{1}[ z_n v_d \geq 0 ]$ is the activation of patch $z_n$ for component $v_d$. In practice, $\mathcal{X}$ can be instantiated as a set of natural images (for example ImageNet validation), while $\mathcal{X}^c$ is approximated by our synthetic image generator. The Semantic Invariance (SI) of component $d$ is then measured by

\[s_d = 2 \cdot \frac{P_d \cdot Q_d + (1-P_d)\cdot(1-Q_d)}{\sqrt{P_d^2 + (1-P_d)^2} + \sqrt{Q_d^2 + (1-Q_d)^2}} = 2 \cdot \frac{ \langle {P}_d, {Q}_d \rangle}{||{P}_d|| + ||{Q}_d||},\]

which assigns high scores when $P_d \approx Q_d$, and vice versa.

Finally, we introduce Semantically Orthogonal Artifact Projection (SOAP), a simple Gram-Schmidt-based post-hoc projection method to suppress components with high SI score:

\[P_\phi = I - VWV^\top, \quad W=\text{diag} (w_1,...,w_D) \quad \text{s.t.} \quad w_d = s_d \times \frac{ \sigma (\mu - r)/ \tau}{\sigma (\mu/ \tau)}; \quad r = \text{rank}(s_d).\]

Here, $\mu$ and $\tau$ are hyperparameters to control the cutoff and smoothness of suppression. This projection can be used to “correct” the patch representations $\hat z = P_\phi z$ by removing the contribution of components that encode non-semantic information such as positional noise.

Read our paper for a more in-depth formulation and explanation of the methodology!

Semantic Invariance in MIM models

We analyze popular SSL models by measuring their Semantic Invariance, and find a striking contrast between models trained with and without a MIM-style objective. The main findings are summarized by the plots below. Models with MIM are shown in solid lines; models without MIM are shown in dashed lines.

SI scores plots

Figure 2: Semantic invariance (SI) score in descending order. All scores are shown in the left plot, while the right focuses on the top 10 semantically invariant scores.

The models trained with MIM have at least two principal components with high SI score ($\geq 0.75$). Upon manual inspection, we find that these components encode positional infomation about the patch location. Critically, DINO and DeiT3, which are not trained with MIM, do not exhibit the same behaviour and have lower SI scores in general ($\leq 0.75$).

Looking at the maxumim SI score of each transformer block, we see that while some MIM-models (DINOv2 and DINOv3) have lower SI scores in earlier layers, there is a sharp increase in the last layers, where all MIM-models land well above the $0.75$ threshold. This can be explained as the model saturating more positional information into the embeddings in preparation for solving the MIM task.

SI score vs. depth

Figure 3: Maximum semantic invariance score for MIM models (solid lines) and non-MIM models (dashed lines) vs. model depth. Critically, MIM models show high SI-scores in the last layers. This can be explained by the MIM objective encouraging positional information in the patch embeddings of deeper layers.

SOAP improves zero-shot performance

We use SOAP to correct for semantically invariant components in local embeddings, and find that this improves performance in zero-shot downstream tasks for all MIM models in our study. Read our paper for more results!

Salient segmentation

We use TokenCut for zero-shot evaluation of salient information present in patch embeddings, and find that correcting with SOAP can significantly improve performance. Below are the results for the ECSSD dataset.

Pretrain	Model	max Fβ	IoU	Acc
Original embeddings
DINO	ViT-B16	82.580	74.325	90.929
DINOv2	ViT-B16	71.319	63.937	83.147
DINOv3	ViT-B16	36.975	29.122	52.953
iBOT	ViT-B16	62.873	56.248	78.785
CAPI	ViT-L14	72.456	66.083	84.334
MAE	ViT-B16	79.952	71.067	89.410
I-JEPA	ViT-H14	37.670	27.989	68.898
SOAP corrected embeddings
DINOv2	ViT-B16	80.633	72.559	88.687
DINOv3	ViT-B16	42.633	33.742	61.975
iBOT	ViT-B16	66.557	60.167	78.340
CAPI	ViT-L14	85.219	78.084	92.600
MAE	ViT-B16	82.094	72.118	91.444
I-JEPA	ViT-H14	40.239	31.162	71.406

kNN segmentation

We evaluate zero-shot segmentation on ADE20k by performing per-patch k-nearest neighbors (kNN) and upsampling the predictions to full image resolution using nearest neighbor interpolation.

Pretrain	Model	IoU (Org.)	Acc (Org.)	IoU (SOAP)	Δ IoU	Acc (SOAP)	Δ Acc
DINOv2	ViT-B16	40.253	74.603	40.808	↑ 0.556	74.723	↑ 0.120
DINOv3	ViT-B16	43.849	77.943	44.575	↑ 0.725	78.101	↑ 0.158
iBOT	ViT-B16	27.726	70.859	28.426	↑ 0.700	71.262	↑ 0.403
CAPI	ViT-L14	31.382	71.626	31.637	↑ 0.255	71.770	↑ 0.144
MAE	ViT-B16	11.882	58.002	12.651	↑ 0.769	58.592	↑ 0.591
I-JEPA	ViT-H14	20.952	60.273	21.258	↑ 0.306	60.287	↑ 0.014

Citation

@inproceedings{hjelkremtan2025soap,
  title={Suppressing Non-Semantic Noise in Masked Image Modeling Representations},
  author={Hjelkrem-Tan, Martine and Aasan, Marius and Chakraborty, Rwiddhi and Arteaga, Gabriel Y. and Choi, Changkyu and Ram\'irez Rivera, Ad\'in},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}