Anonymous submission to Interspeech 2025
We propose VS-Singer, a vision-guided model that generate stereo singing voices with room reverberation from scene images. VS-Singer consists of a modal interaction network, a decoder based on consistency Schrödinger bridge (CSB) and a spatially-aware feature enhancement module (SFE). Specifically, the modal interaction network fuses spatial features into text encoding to produce a linguistic representation with spatial information. Subsequently, the decoder utilizes the CSB to establish a tractable diffusion bridge between the representation and binaural mel-spectrograms, enabling it to generate samples in one step. Additionally, we employ the SFE to accelerate model convergence. To the best of our knowledge, this work is the first to integrate stereo singing voice synthesis with visual acoustic matching into a unified framework. Experimental results show that VS-Singer can efficiently generate stereo singing voices that match the scene perspective in one step. Particularly, our model achieves 2x faster inference speed compared to cascaded systems.
Mel spectrogram by diffusion step
Step 1 |
Step 4 |
Step 50 |
| Script: 终于找到心有灵犀的美好 |
|
|||
|---|---|---|---|---|
GT |
NSF-HiFIGAN(Recon.) |
|||
DiffSinger |
VISinger2 |
ComoSpeech |
VS-Singer | |
| Script: 然后孤单被吞没了 |
|
|||
GT |
NSF-HiFIGAN(Recon.) |
|||
DiffSinger |
VISinger2 |
ComoSpeech |
VS-Singer | |
| Script: 能不能给我一首歌的时间 |
|
|||
GT |
NSF-HiFIGAN(Recon.) |
|||
DiffSinger |
VISinger2 |
ComoSpeech |
VS-Singer | |
| Script: 终于找到心有灵犀的美好 |
|
|||
|---|---|---|---|---|
Full |
W/o Modal Interaction |
W/o Schrödinger Bridge |
W/o SFE |
|
| Script: 然后孤单被吞没了 |
|
|||
Full |
W/o Modal Interaction |
W/o Schrödinger Bridge |
W/o SFE |
|
| Script: 能不能给我一首歌的时间 |
|
|||
Full |
W/o Modal Interaction |
W/o Schrödinger Bridge |
W/o SFE |
|