VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge

 

Anonymous submission to Interspeech 2025

Abstract

We propose VS-Singer, a vision-guided model that generate stereo singing voices with room reverberation from scene images. VS-Singer consists of a modal interaction network, a decoder based on consistency Schrödinger bridge (CSB) and a spatially-aware feature enhancement module (SFE). Specifically, the modal interaction network fuses spatial features into text encoding to produce a linguistic representation with spatial information. Subsequently, the decoder utilizes the CSB to establish a tractable diffusion bridge between the representation and binaural mel-spectrograms, enabling it to generate samples in one step. Additionally, we employ the SFE to accelerate model convergence. To the best of our knowledge, this work is the first to integrate stereo singing voice synthesis with visual acoustic matching into a unified framework. Experimental results show that VS-Singer can efficiently generate stereo singing voices that match the scene perspective in one step. Particularly, our model achieves 2x faster inference speed compared to cascaded systems.


VS-Singer


VS-Singer

Mel spectrogram by diffusion step

Step 1

Step 4

Step 50

<

Singing Voice Synthesis Demo

Script: 终于找到心有灵犀的美好

GT

NSF-HiFIGAN(Recon.)

DiffSinger

VISinger2

ComoSpeech

VS-Singer

Script: 然后孤单被吞没了

GT

NSF-HiFIGAN(Recon.)

DiffSinger

VISinger2

ComoSpeech

VS-Singer

Script: 能不能给我一首歌的时间

GT

NSF-HiFIGAN(Recon.)

DiffSinger

VISinger2

ComoSpeech

VS-Singer

Ablation Study Demo

Script: 终于找到心有灵犀的美好

Full

W/o Modal Interaction

W/o Schrödinger Bridge

W/o SFE

Script: 然后孤单被吞没了

Full

W/o Modal Interaction

W/o Schrödinger Bridge

W/o SFE

Script: 能不能给我一首歌的时间

Full

W/o Modal Interaction

W/o Schrödinger Bridge

W/o SFE