Skip to content

Home

Explicit Voice Attributes ANR

Describing a voice in a few words remains a very arbitrary task. We can speak with a "deep", "breathy" or "hoarse" voice, but the full characterization of a voice would require a close set of rigorously defined attributes constituting an ontology. However, such a description grid does not exist.

Machine learning applied to speech also suffers the same weakness: in most automatic processing tasks, when a speaker is modeled, abstract global representations are used without making their characteristics explicit. For instance, automatic speaker verification/identification is usually tackled thanks to the x-vectors paradigm, which consists in describing a speaker's voice by an embedding vector only designed to distinguish speakers. Despite their very good accuracy for speaker identification, x-vectors are usually unsuitable to detect similarities between different voices with common characteristics.

The same observations can be made for speech generation. For example, control of speech synthesis is usually done by injecting speaker style or identity via unstructured representations (Global Style Tokens, Variational Auto Encoders, etc.) from a reference audio recording. These representations allow to circumvent the difficult task of defining and learning ontologies, but they only make it possible to "mimic" a subset of the characteristics of a reference voice (gender, fundamental frequency, rhythm, intensity) without explaining its attributes. They are also still limited by their inability to generate new original voices.

The objective of this project is to crack the codes of human voices via learning explicit and structured representations of voice attributes. The realization of this main objective will have a strong scientific and technological impact, in at least two fields of application: firstly, in speech analysis, it will unlock the understanding of the complex entanglement of the characteristics of a human voice; secondly, in voice generation, it will open the way to a wide range of applications to create a voice with the desired attributes, allowing the design of so-called voice persona.

The set of attributes will be defined by human expertise or discovered gradually from the data by using slightly supervised or unsupervised neural networks. It will cover a detailed and explicit description of the timbre, voice quality, phonation, “speaker biases” such as specific pronunciations (e.g. "lé"/"lait" in french) or speech impairments (e.g. lisping), regional or non-native accents, and paralinguistics such as emotions or style. Ideally, each relevant attribute could be controlled in synthesis and conversion by an intensity degree, allowing it to be amplified or erased from the voice, within a structured embedding. These new attributes could emerge from expert definition or via neural networks algorithms such as voice disentanglement or self-supervised representations which would automatically discover salient attributes in multi-speaker datasets.

The main expected industrial outcomes concern different use cases of voice transformation. First is "Voice anonymization": for the purpose of enabling GDPR-compliant voice recordings, voice conversion systems could be set to remove attributes that are strongly associated to a speaker's identity, while the other attributes remain unchanged to preserve the intelligibility, the naturalness, and the expressivity of the manipulated voice; second is "Voice creation": new voices could be sculpted upon a set of desired attributes, in order to feed the creative industry.

Contract nb: ANR-23-CE23-0018