Learning in audio-visual context: A review, analysis, and new perspective
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …
understanding. To mimic human perception ability, audio-visual learning, aimed at …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Decoupled multimodal distilling for emotion recognition
Human multimodal emotion recognition (MER) aims to perceive human emotions via
language, visual and acoustic modalities. Despite the impressive performance of previous …
language, visual and acoustic modalities. Despite the impressive performance of previous …
Disentangled representation learning for multimodal emotion recognition
Multimodal emotion recognition aims to identify human emotions from text, audio, and visual
modalities. Previous methods either explore correlations between different modalities or …
modalities. Previous methods either explore correlations between different modalities or …
A survey of deep learning-based multimodal emotion recognition: Speech, text, and face
Multimodal emotion recognition (MER) refers to the identification and understanding of
human emotional states by combining different signals, including—but not limited to—text …
human emotional states by combining different signals, including—but not limited to—text …
Incomplete multimodality-diffused emotion recognition
Human multimodal emotion recognition (MER) aims to perceive and understand human
emotions via various heterogeneous modalities, such as language, vision, and acoustic …
emotions via various heterogeneous modalities, such as language, vision, and acoustic …
Mart: Masked affective representation learning via masked temporal distribution distillation
Limited training data is a long-standing problem for video emotion analysis (VEA). Existing
works leverage the power of large-scale image datasets for transferring while failing to …
works leverage the power of large-scale image datasets for transferring while failing to …
Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences
Perceiving human emotions from a multimodal perspective has received significant attention
in knowledge engineering communities. Due to the variable receiving frequency for …
in knowledge engineering communities. Due to the variable receiving frequency for …
Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences
Understanding human behaviors and intents from videos is a challenging task. Video flows
usually involve time-series data from different modalities, such as natural language, facial …
usually involve time-series data from different modalities, such as natural language, facial …
Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis
With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA)
has attracted increasing attention recently. Despite significant progress, there are still two …
has attracted increasing attention recently. Despite significant progress, there are still two …