Skip to content

Publications

Modern voice conversion and anonymization architectures generally share a design preserving source linguistic content and expressivity while modifying speaker timbre characteristics. This approach leads to a converted signal quite perfectly synchronized with the source signal. In this paper, we hypothesize that this paradigm can help us to quantify the amount of speaker identity preserved in converted voice, refered here as prosody (including speech melody and rhythm). Based on this observation, we propose a method to split and disentangle speaker representation into complementary embeddings conveying respectively prosodic and timbre information. Additionally, we propose a method to evaluate prosody preservation in standard voice privacy architectures and we validate the power of prosodic and timbre embeddings to detect related voice attributes.