arXiv:2601.00217v2 Announce Type: replace-cross
Abstract: Singing voice synthesis (SVS) aims to generate natural and expressive singing waveforms from symbolic musical scores. In cVAE-based SVS, however, a mismatch arises because the decoder is trained with latent representations inferred from target singing signals, while inference relies on latent representations predicted only from conditioning inputs. This discrepancy can weaken fine expressive acoustic details in the synthesized output. To mitigate this issue, we propose FM-Singer, a flow-matching-based latent refinement framework for cVAE-based singing voice synthesis. Rather than redesigning the acoustic decoder, the proposed method learns a continuous vector field that transports inference-time latent samples toward posterior-like latent representations through ODE-based integration before waveform generation. Because the refinement is performed in latent space, the method remains lightweight and compatible with a strong parallel synthesis backbone. Experimental results on Korean and Chinese singing datasets show that the proposed latent refinement improves objective metrics and perceptual quality while maintaining practical synthesis efficiency. These results suggest that reducing training-inference latent mismatch is a useful direction for improving expressive singing voice synthesis. Code, pre-trained checkpoints, and audio demos are available at https://github.com/alsgur9368/FM-Singer.
Translating AI research into reality: summary of the 2025 voice AI Symposium and Hackathon
The 2025 Voice AI Symposium represented a transition from conceptual research to clinical implementation in vocal biomarker science. Hosted by the NIH-funded Bridge2AI-Voice consortium, the


