arXiv:2604.11043v3 Announce Type: replace
Abstract: Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image–text), leaving emphunpaired modality pairs (e.g., audio$leftrightarrow$depth, infrared$leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose textbfEmergentBridge, an embedding-level bridging framework that improves performance on these unpaired pairs emphwithout requiring exhaustive pairwise supervision. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce emphgradient interference, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a emphnoisy bridge anchor (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.
AI needs a strong data fabric to deliver business value
Artificial intelligence is moving quickly in the enterprise, from experimentation to everyday use. Organizations are deploying copilots, agents, and predictive systems across finance, supply chains,


