Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

Balanced multimodal learning via on-the-fly gradient modulation

X Peng, Y Wei, A Deng, D Wang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Audio-visual learning helps to comprehensively understand the world, by integrating
different senses. Accordingly, multiple input modalities are expected to boost model …

CTNet: Conversational transformer network for emotion recognition

Z Lian, B Liu, J Tao - IEEE/ACM Transactions on Audio, Speech …, 2021 - ieeexplore.ieee.org
Emotion recognition in conversation is a crucial topic for its widespread applications in the
field of human-computer interactions. Unlike vanilla emotion recognition of individual …

[PDF][PDF] Speech emotion recognition with multi-task learning.

X Cai, J Yuan, R Zheng, L Huang, K Church - Interspeech, 2021 - academia.edu
Speech emotion recognition (SER) classifies speech into emotion categories such as:
Happy, Angry, Sad and Neutral. Recently, deep learning has been applied to the SER task …

Overview of speaker modeling and its applications: From the lens of deep speaker representation learning

S Wang, Z Chen, KA Lee, Y Qian… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Speaker individuality information is among the most critical elements within speech signals.
By thoroughly and accurately modeling this information, it can be utilized in various …

The VoicePrivacy 2024 Challenge Evaluation Plan

N Tomashenko, X Miao, P Champion, S Meyer… - arxiv preprint arxiv …, 2024 - arxiv.org
The task of the challenge is to develop a voice anonymization system for speech data which
conceals the speaker's voice identity while protecting linguistic content and emotional states …

[PDF][PDF] Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer's Disease and Assess its Severity.

R Pappagari, J Cho, L Moro-Velazquez, N Dehak - Interspeech, 2020 - researchgate.net
In this study, we analyze the use of state-of-the-art technologies for speaker recognition and
natural language processing to detect Alzheimer's Disease (AD) and to assess its severity …

Emotion recognition by fusing time synchronous and time asynchronous representations

W Wu, C Zhang, PC Woodland - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
In this paper, a novel two-branch neural network model structure is proposed for multimodal
emotion recognition, which consists of a time synchronous branch (TSB) and a time …

[PDF][PDF] Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios.

R Pappagari, J Cho, S Joshi, L Moro-Velázquez… - Interspeech, 2021 - researchgate.net
In this study, we analyze the use of speech and speaker recognition technologies and
natural language processing to detect Alzheimer disease (AD) and estimate mini-mental …

Multimodal emotion recognition using transfer learning from speaker recognition and bert-based models

S Padi, SO Sadjadi, D Manocha, RD Sriram - arxiv preprint arxiv …, 2022 - arxiv.org
Automatic emotion recognition plays a key role in computer-human interaction as it has the
potential to enrich the next-generation artificial intelligence with emotional intelligence. It …