Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Decoupled multimodal distilling for emotion recognition

Y Li, Y Wang, Z Cui - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Human multimodal emotion recognition (MER) aims to perceive human emotions via
language, visual and acoustic modalities. Despite the impressive performance of previous …

Disentangled representation learning for multimodal emotion recognition

D Yang, S Huang, H Kuang, Y Du… - Proceedings of the 30th …, 2022 - dl.acm.org
Multimodal emotion recognition aims to identify human emotions from text, audio, and visual
modalities. Previous methods either explore correlations between different modalities or …

A survey of deep learning-based multimodal emotion recognition: Speech, text, and face

H Lian, C Lu, S Li, Y Zhao, C Tang, Y Zong - Entropy, 2023 - mdpi.com
Multimodal emotion recognition (MER) refers to the identification and understanding of
human emotional states by combining different signals, including—but not limited to—text …

Incomplete multimodality-diffused emotion recognition

Y Wang, Y Li, Z Cui - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Human multimodal emotion recognition (MER) aims to perceive and understand human
emotions via various heterogeneous modalities, such as language, vision, and acoustic …

Mart: Masked affective representation learning via masked temporal distribution distillation

Z Zhang, P Zhao, E Park… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Limited training data is a long-standing problem for video emotion analysis (VEA). Existing
works leverage the power of large-scale image datasets for transferring while failing to …

Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences

D Yang, Y Liu, C Huang, M Li, X Zhao, Y Wang… - Knowledge-Based …, 2023 - Elsevier
Perceiving human emotions from a multimodal perspective has received significant attention
in knowledge engineering communities. Due to the variable receiving frequency for …

Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences

D Yang, H Kuang, S Huang, L Zhang - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Understanding human behaviors and intents from videos is a challenging task. Video flows
usually involve time-series data from different modalities, such as natural language, facial …

Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis

L Sun, Z Lian, B Liu, J Tao - IEEE Transactions on Affective …, 2023 - ieeexplore.ieee.org
With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA)
has attracted increasing attention recently. Despite significant progress, there are still two …