Modelling individual and cross-cultural variation in the map** of emotions to speech prosody
The existence of a map** between emotions and speech prosody is commonly assumed.
We propose a Bayesian modelling framework to analyse this map**. Our models are fitted …
We propose a Bayesian modelling framework to analyse this map**. Our models are fitted …
Giving robots a voice: Human-in-the-loop voice creation and open-ended labeling
Speech is a natural interface for humans to interact with robots. Yet, aligning a robot's voice
to its appearance is challenging due to the rich vocabulary of both modalities. Previous …
to its appearance is challenging due to the rich vocabulary of both modalities. Previous …
[PDF][PDF] Words are all you need? capturing human sensory similarity with textual descriptors
Recent advances in multimodal training use textual descriptions to significantly enhance
machine understanding of images and videos. Yet, it remains unclear to what extent …
machine understanding of images and videos. Yet, it remains unclear to what extent …
VoiceMe: Personalized voice generation in TTS
Novel text-to-speech systems can generate entirely new voices that were not seen during
training. However, it remains a difficult task to efficiently create personalized voices from a …
training. However, it remains a difficult task to efficiently create personalized voices from a …
Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control
Neural Text-To-Speech (TTS) models achieve great performances regarding naturalness,
but modeling expressivity remains an ongoing challenge. Some success was found through …
but modeling expressivity remains an ongoing challenge. Some success was found through …
Speaking Rate Control of end-to-end TTS Models by Direct Manipulation of the Encoder's Output Embeddings
Since neural Text-To-Speech models have achieved such high standards in terms of
naturalness, the main focus of the field has gradually shifted to gaining more control over the …
naturalness, the main focus of the field has gradually shifted to gaining more control over the …
[PDF][PDF] Analysis by synthesis: Using an expressive tts model as feature extractor for paralinguistic speech classification
Modeling adequate features of speech prosody is one key factor to good performance in
affective speech classification. However, the distinction between the prosody that is induced …
affective speech classification. However, the distinction between the prosody that is induced …
Using Gibbs Sampling with People to characterize perceptual and aesthetic evaluations in multidimensional visual stimulus space
Aesthetic appreciation is inherently multidimensional: many different stimulus dimensions
(eg, colors, shapes, sizes) contribute to our aesthetic experience. However, most studies in …
(eg, colors, shapes, sizes) contribute to our aesthetic experience. However, most studies in …
Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People
Conversational tones--the manners and attitudes in which speakers communicate--are
essential to effective communication. Amidst the increasing popularization of Large …
essential to effective communication. Amidst the increasing popularization of Large …
VoiceX: A Text-To-Speech Framework for Custom Voices
Modern TTS systems are capable of creating highly realistic and natural-sounding speech.
Despite these developments, the process of customizing TTS voices remains a complex …
Despite these developments, the process of customizing TTS voices remains a complex …