Skip to the content.

EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling

Abstract

Zero-shot text-based singing voice editing enables users to edit the singing content by just performing text operations on the lyrics, while without any additional data from the target singer. However, due to the different demands, challenges occur when applying existing speech editing methods to singing voice editing task, mainly including the lack of systematic consideration concerning prosody in insertion and deletion, as well as the trade-off between the naturalness of pronunciation and the preservation of prosody in replacement. In this paper we propose EditSinger, which is a novel singing voice editing model with specially designed diverse prosody modules to overcome the challenges above. Specifically, 1) a general masked variance adaptor is introduced for the comprehensive prosody modeling of the inserted lyrics and the transition of deletion boundary; and 2) we further design a fusion pitch predictor for replacement. By disentangling the reference pitch and fusing the predicted pronunciation, the edited pitch can be reconstructed, which could ensure a natural pronunciation while preserving the prosody of the original audio. In addition, to the best of our knowledge, it is the first zero-shot text-based singing voice editing system. Our experiments conducted on the OpenSinger prove that EditSinger can synthesize high-quality edited singing voices with natural prosody according to the corresponding operations.

Introduction:

  • In the first two sections (Audio Samples & Method Analyses), there are some samples of common performance demonstration and comparison experiments.
  • In the third section (More Samples), we provide more samples of different aspects (e.g., comparisons of different editing positions).
  • Our research is based on the open-source dataset OpenSinger, and all experiments conducted in the paper have been authorized by the publisher. This project is currently only used for research, and aims to make contributions and provides some ideas for the community. Please do not used for commercial purposes.
  • 1 Audio Samples

    Notes:

  • GT denotes the original audio(the input audio to be edited).
  • The red part represents the editing region.
  • Words —— Phonemes
  • Exp. 1:

    original lyrics: 朋友爱得那么苦痛 —— p eng | y ou # ai | d e # n a | m e # k u | t ong
    insertion: 朋友如果爱的那么苦痛 —— p eng | y ou # r u | g uo # ai # d e # n a | m e # k u | t ong
    replacement: 朋友爱的那么认真(苦痛) —— p eng | y ou # ai # d e # n a | m e # r en | zh en ( k u | t ong)
    deletion: 朋友爱的(那么)苦痛 —— p eng | y ou # ai # d e # ( n a | m e #) k u | t ong

    GT GT(Mel+PWG) EditSinger(insertion) EditSinger(replacement) EditSinger(deletion)

    Exp. 2:

    original lyrics: 爱可以不问对错 —— ai # k e | y i # b u | w en # d ui | c uo
    insertion: 爱怎么可以不问对错 —— ai # z en | m e # k e | y i # b u | w en # d ui | c uo
    replacement: 爱怎么(可以)不问对错 —— ai # z en | m e # (k e | y i #) b u | w en # d ui | c uo
    deletion: 爱(可以)不问对错 —— ai # ( k e | y i #) b u | w en # d ui | c uo

    GT GT(Mel+PWG) EditSinger(insertion) EditSinger(replacement) EditSinger(deletion)

    Exp. 3:

    original lyrics: 你何苦非为他等在雨中 —— n i # h e | k u # f ei | w ei # t a # d eng # z ai # y u # zh ong
    insertion: 你何苦非为他傻傻等在雨中 —— n i # h e | k u # f ei | w ei # t a # sh a | sh a # d eng # z ai # y u # zh ong
    replacement: 你何苦非为他伫立风(等在雨)中 —— n i # h e | k u # f ei | w ei # t a # zh u | l i # f eng | ( d eng # z ai # y u #) zh ong
    deletion: 你(何苦非)为他等在雨中 —— n i # ( h e | k u # f ei |) w ei # t a # d eng # z ai # y u # zh ong

    GT GT(Mel+PWG) EditSinger(insertion) EditSinger(replacement) EditSinger(deletion)

    Exp. 4:

    original lyrics: 几朵云在阴天忘了该往哪儿走 —— j i | d uo # y un # z ai # y in | t ian # w ang # l e # g ai # w ang # n a | r # z ou
    insertion: 几朵孤独的云在阴天忘了该往哪儿走 —— j i | d uo # g u | d u # d e # y un # z ai # y in | t ian # w ang # l e # g ai # w ang # n a | r # z ou
    replacement: 几片叶(朵云)在阴天忘了该往哪儿走 —— j i | p ian # y e | (d uo # y un #) z ai # y in | t ian # w ang # l e # g ai # w ang # n a | r # z ou
    deletion: 几朵云(在阴天)忘了该往哪儿走 —— j i | d uo # y un | (z ai # y in | t ian #) w ang # l e # g ai # w ang # n a | r # z ou

    GT GT(Mel+PWG) EditSinger(insertion) EditSinger(replacement) EditSinger(deletion)

    Exp. 5:

    original lyrics: 被吹进了左耳 —— b ei # ch ui | j in # l e # z uo | er
    insertion: 被思念吹进了左耳 —— b ei # s i | n ian # ch ui | j in # l e # z uo | er
    replacement: 被传递到(吹进了)左耳 —— b ei # ch uan | d i # d ao # (ch ui | j in # l e # ) z uo | er
    deletion: 被吹进()左耳 —— b ei # ch ui | j in # (l e #) z uo | er

    GT GT(Mel+PWG) EditSinger(insertion) EditSinger(replacement) EditSinger(deletion)

    Exp. 6:

    original lyrics: 在昏暗中的我 —— z ai # h un | an # zh ong # d e # w o
    insertion: 在那时昏暗中的我 —— z ai # n a | sh i # h un | an # zh ong # d e # w o
    replacement: 在昏暗中与你(的我) —— z ai # h un | an # zh ong # y u # n i (d e # w o)
    deletion: 在昏暗()的我 —— z ai # h un | an # ( zh ong # ) d e # w o

    GT GT(Mel+PWG) EditSinger(insertion) EditSinger(replacement) EditSinger(deletion)

    2 Method Analyses

    2.1 Insertion

    Exp. 1:

    original lyrics: 朋友爱得那么苦痛 —— p eng | y ou # ai | d e # n a | m e # k u | t ong
    insertion: 朋友如果爱的那么苦痛 —— p eng | y ou # r u | g uo # ai # d e # n a | m e # k u | t ong

    GT GT(Mel+PWG) EditSinger(insertion) w/o CVA w/o ML-GAN

    Exp. 2:

    original lyrics: 几朵云在阴天忘了该往哪儿走 —— j i | d uo # y un # z ai # y in | t ian # w ang # l e # g ai # w ang # n a | r # z ou
    insertion: 几朵孤独的云在阴天忘了该往哪儿走 —— j i | d uo # g u | d u # d e # y un # z ai # y in | t ian # w ang # l e # g ai # w ang # n a | r # z ou

    GT GT(Mel+PWG) EditSinger(insertion) w/o CVA w/o ML-GAN

    2.2 Deletion

    Exp. 1:

    original lyrics: 你何苦非为他等在雨中 —— n i # h e | k u # f ei | w ei # t a # d eng # z ai # y u # zh ong
    deletion: 你(何苦非)为他等在雨中 —— n i # (h e | k u # f ei | ) w ei # t a # d eng # z ai # y u # zh ong >

    GT GT(Mel+PWG) EditSinger(deletion) w/o CVA w/o ML-GAN

    Exp. 2:

    original lyrics: 几朵云在阴天忘了该往哪儿走 —— j i | d uo # y un # z ai # y in | t ian # w ang # l e # g ai # w ang # n a | r # z ou
    deletion: 几朵云(在阴天)忘了该往哪儿走 —— j i | d uo # y un | (z ai # y in | t ian #) w ang # l e # g ai # w ang # n a | r # z ou

    GT GT(Mel+PWG) EditSinger(deletion) w/o CVA w/o ML-GAN

    2.3 Replacement (Dang Test)

    Note: In this part, it can fully demonstrate the superior performance of Editsinger(replacement), and it even supports the replacement of entire sentences, which is not available in previous work. “Dang” here can be understood as any character. We have tested many other characters and done experiments including part and whole sentence replacement experiments, which are also very effective. Directly migrating the original prosody (Direct) to the new word without considering the attributes of the word will lead to a decrease in the sense of hearing, and ignoring the prosody of the corresponding position (w/o FPIP) will lead to a decrease in the similarity with the original song.

    Exp. 1:

    original lyrics: 想挡挡你心口里的风 —— x iang # d ang # d ang # n i # x in # k ou | l i # d e # f eng
    replacement: 当当当当当当当当当 —— d ang | d ang # d ang | d ang # d ang | d ang # d ang | d ang # d ang

    GT GT(Mel+PWG) EditSinger(replacement) Direct w/o FPIP w/o VQVAE w/o ML-GAN

    Exp. 2:

    original lyrics: 听阴天说什么 —— t ing # y in | t ian # sh uo # sh en | m e
    replacement: 当当当当当当 —— d ang | d ang # d ang | d ang # d ang | d ang

    GT GT(Mel+PWG) EditSinger(replacement) Direct w/o FPIP w/o VQVAE w/o ML-GAN

    3 More Samples

    Editing at Different Positions (Begining/Middle/End of the Sentence)

    original lyrics: 朋友爱得那么苦痛 —— p eng | y ou # ai | d e # n a | m e # k u | t ong

    Type Begining Middle End
    insertion
    如果朋友爱的那么苦痛 朋友如果爱的那么苦痛 朋友爱的那么苦痛如果
    deletion
    (朋友)爱的那么苦痛 朋友爱的(那么)苦痛 朋友爱的那么苦()
    replacement
    认真(朋友)爱的那么苦痛 朋友爱的认真(那么)苦痛 朋友爱的那么认真(苦痛)