EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling
Abstract
Zero-shot text-based singing voice editing enables users to edit the singing content by just performing text operations on the lyrics, while without any additional data from the target singer. However, due to the different demands, challenges occur when applying existing speech editing methods to singing voice editing task, mainly including the lack of systematic consideration concerning prosody in insertion and deletion, as well as the trade-off between the naturalness of pronunciation and the preservation of prosody in replacement. In this paper we propose EditSinger, which is a novel singing voice editing model with specially designed diverse prosody modules to overcome the challenges above. Specifically, 1) a general masked variance adaptor is introduced for the comprehensive prosody modeling of the inserted lyrics and the transition of deletion boundary; and 2) we further design a fusion pitch predictor for replacement. By disentangling the reference pitch and fusing the predicted pronunciation, the edited pitch can be reconstructed, which could ensure a natural pronunciation while preserving the prosody of the original audio. In addition, to the best of our knowledge, it is the first zero-shot text-based singing voice editing system. Our experiments conducted on the OpenSinger prove that EditSinger can synthesize high-quality edited singing voices with natural prosody according to the corresponding operations.
Introduction:
1 Audio Samples
Notes:
Exp. 1:
original lyrics: 朋友爱得那么苦痛 ——
insertion: 朋友如果爱的那么苦痛 ——
replacement: 朋友爱的那么认真(苦痛) —— k u | t ong)
deletion: 朋友爱的(那么)苦痛 —— n a | m e #) k u | t ong
GT | GT(Mel+PWG) | EditSinger(insertion) | EditSinger(replacement) | EditSinger(deletion) |
---|---|---|---|---|
Exp. 2:
original lyrics: 爱可以不问对错 ——
insertion: 爱怎么可以不问对错 ——
replacement: 爱怎么(可以)不问对错 —— k e | y i #) b u | w en # d ui | c uo
deletion: 爱(可以)不问对错 —— k e | y i #) b u | w en # d ui | c uo
GT | GT(Mel+PWG) | EditSinger(insertion) | EditSinger(replacement) | EditSinger(deletion) |
---|---|---|---|---|
Exp. 3:
original lyrics: 你何苦非为他等在雨中 ——
insertion: 你何苦非为他傻傻等在雨中 ——
replacement: 你何苦非为他伫立风(等在雨)中 —— d eng # z ai # y u #) zh ong
deletion: 你(何苦非)为他等在雨中 —— h e | k u # f ei |) w ei # t a # d eng # z ai # y u # zh ong
GT | GT(Mel+PWG) | EditSinger(insertion) | EditSinger(replacement) | EditSinger(deletion) |
---|---|---|---|---|
Exp. 4:
original lyrics: 几朵云在阴天忘了该往哪儿走 ——
insertion: 几朵孤独的云在阴天忘了该往哪儿走 ——
replacement: 几片叶(朵云)在阴天忘了该往哪儿走 —— d uo # y un #) z ai # y in | t ian # w ang # l e # g ai # w ang # n a | r # z ou
deletion: 几朵云(在阴天)忘了该往哪儿走 —— z ai # y in | t ian #) w ang # l e # g ai # w ang # n a | r # z ou
GT | GT(Mel+PWG) | EditSinger(insertion) | EditSinger(replacement) | EditSinger(deletion) |
---|---|---|---|---|
Exp. 5:
original lyrics: 被吹进了左耳 ——
insertion: 被思念吹进了左耳 ——
replacement: 被传递到(吹进了)左耳 —— ch ui | j in # l e # ) z uo | er
deletion: 被吹进(了)左耳 —— l e #) z uo | er
GT | GT(Mel+PWG) | EditSinger(insertion) | EditSinger(replacement) | EditSinger(deletion) |
---|---|---|---|---|
Exp. 6:
original lyrics: 在昏暗中的我 ——
insertion: 在那时昏暗中的我 ——
replacement: 在昏暗中与你(的我) —— d e # w o)
deletion: 在昏暗(中)的我 —— zh ong # ) d e # w o
GT | GT(Mel+PWG) | EditSinger(insertion) | EditSinger(replacement) | EditSinger(deletion) |
---|---|---|---|---|
2 Method Analyses
2.1 Insertion
Exp. 1:
original lyrics: 朋友爱得那么苦痛 ——
insertion: 朋友如果爱的那么苦痛 ——
GT | GT(Mel+PWG) | EditSinger(insertion) | w/o CVA | w/o ML-GAN |
---|---|---|---|---|
Exp. 2:
original lyrics: 几朵云在阴天忘了该往哪儿走 ——
insertion: 几朵孤独的云在阴天忘了该往哪儿走 ——
GT | GT(Mel+PWG) | EditSinger(insertion) | w/o CVA | w/o ML-GAN |
---|---|---|---|---|
2.2 Deletion
Exp. 1:
original lyrics: 你何苦非为他等在雨中 ——
deletion: 你(何苦非)为他等在雨中 —— h e | k u # f ei | ) w ei # t a # d eng # z ai # y u # zh ong
GT | GT(Mel+PWG) | EditSinger(deletion) | w/o CVA | w/o ML-GAN |
---|---|---|---|---|
Exp. 2:
original lyrics: 几朵云在阴天忘了该往哪儿走 ——
deletion: 几朵云(在阴天)忘了该往哪儿走 —— z ai # y in | t ian #) w ang # l e # g ai # w ang # n a | r # z ou
GT | GT(Mel+PWG) | EditSinger(deletion) | w/o CVA | w/o ML-GAN |
---|---|---|---|---|
2.3 Replacement (Dang Test)
Note: In this part, it can fully demonstrate the superior performance of Editsinger(replacement), and it even supports the replacement of entire sentences, which is not available in previous work. “Dang” here can be understood as any character. We have tested many other characters and done experiments including part and whole sentence replacement experiments, which are also very effective. Directly migrating the original prosody (Direct) to the new word without considering the attributes of the word will lead to a decrease in the sense of hearing, and ignoring the prosody of the corresponding position (w/o FPIP) will lead to a decrease in the similarity with the original song.
Exp. 1:
original lyrics: 想挡挡你心口里的风 ——
replacement: 当当当当当当当当当 ——
GT | GT(Mel+PWG) | EditSinger(replacement) | Direct | w/o FPIP | w/o VQVAE | w/o ML-GAN |
---|---|---|---|---|---|---|
Exp. 2:
original lyrics: 听阴天说什么 ——
replacement: 当当当当当当 ——
GT | GT(Mel+PWG) | EditSinger(replacement) | Direct | w/o FPIP | w/o VQVAE | w/o ML-GAN |
---|---|---|---|---|---|---|
3 More Samples
Editing at Different Positions (Begining/Middle/End of the Sentence)
original lyrics: 朋友爱得那么苦痛 ——
Type | Begining | Middle | End |
---|---|---|---|
insertion | |||
如果朋友爱的那么苦痛 | 朋友如果爱的那么苦痛 | 朋友爱的那么苦痛如果 | |
deletion | |||
( |
朋友爱的( |
朋友爱的那么苦( |
|
replacement | |||
认真( |
朋友爱的认真( |
朋友爱的那么认真( |