• Home
  • Uncategorized
  • Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

arXiv:2511.11041v2 Announce Type: replace-cross
Abstract: We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $tilde e + mu$, where the mean $mu$ is near-identical across all sentences. We study two training-free corrections — subtracting $mu$ directly (R1), or projecting each embedding off the mean direction (R2) — and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB)~citepMMTEB, R2 yields consistent classification gains (paired $bar t = 3.31$, 29 of 38 models with $t>2$, zero losses), and the per-model mean norm $VertmuVert$ correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within $0.18$ pp downstream despite weak geometric alignment between $hatmu$ and the centered top principal component.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844