arXiv:2604.05673v1 Announce Type: cross
Abstract: Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schr”odinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schr”odinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schr”odinger Bridges ($varepsilon=1$, maximum-entropy transport) and deterministic Optimal Transport ($varepsilonto 0$, as in Conditional Flow Matching), controlled by a single entropic regularization parameter $varepsilon$. We prove two key results: (1) the conditional velocity field’s functional form is invariant across the entire $varepsilon$-spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing $varepsilon$ linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate $varepsilon$ that balances multimodal coverage and path straightness. Empirically, while standard bridges require $geq 10$ steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps — without distillation or multi-stage training — substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844