VS-Signer Demo

Abstract

We propose VS-Singer, a vision-guided model that generate stereo singing voices with room reverberation from scene images. VS-Singer consists of a modal interaction network, a decoder based on consistency Schrödinger bridge (CSB) and a spatially-aware feature enhancement module (SFE). Specifically, the modal interaction network fuses spatial features into text encoding to produce a linguistic representation with spatial information. Subsequently, the decoder utilizes the CSB to establish a tractable diffusion bridge between the representation and binaural mel-spectrograms, enabling it to generate samples in one step. Additionally, we employ the SFE to accelerate model convergence. To the best of our knowledge, this work is the first to integrate stereo singing voice synthesis with visual acoustic matching into a unified framework. Experimental results show that VS-Singer can efficiently generate stereo singing voices that match the scene perspective in one step. Particularly, our model achieves 2x faster inference speed compared to cascaded systems.

VS-Singer

Mel spectrogram by diffusion step

Step 1

Step 4

Step 50

<

Singing Voice Synthesis Demo

Script: 终于找到心有灵犀的美好

GT	NSF-HiFIGAN(Recon.)
DiffSinger	VISinger2	ComoSpeech	VS-Singer
Script: 然后孤单被吞没了

GT	NSF-HiFIGAN(Recon.)
DiffSinger	VISinger2	ComoSpeech	VS-Singer
Script: 能不能给我一首歌的时间

GT	NSF-HiFIGAN(Recon.)
DiffSinger	VISinger2	ComoSpeech	VS-Singer

Ablation Study Demo

Script: 终于找到心有灵犀的美好

Full	W/o Modal Interaction	W/o Schrödinger Bridge	W/o SFE
Script: 然后孤单被吞没了

Full	W/o Modal Interaction	W/o Schrödinger Bridge	W/o SFE
Script: 能不能给我一首歌的时间

Full	W/o Modal Interaction	W/o Schrödinger Bridge	W/o SFE