Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Multimodal fusion on low-quality data: A comprehensive survey

Q Zhang, Y Wei, Z Han, H Fu, X Peng, C Deng… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal fusion focuses on integrating information from multiple modalities with the goal of
more accurate prediction, which has achieved remarkable progress in a wide range of …

Dynamic neural networks: A survey

Y Han, G Huang, S Song, L Yang… - IEEE transactions on …, 2021 - ieeexplore.ieee.org
Dynamic neural network is an emerging research topic in deep learning. Compared to static
models which have fixed computational graphs and parameters at the inference stage …

Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification

Z Han, F Yang, J Huang, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Integration of heterogeneous and high-dimensional data (eg, multiomics) is becoming
increasingly important. Existing multimodal classification algorithms mainly focus on …

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably)

Y Huang, J Lin, C Zhou, H Yang… - … conference on machine …, 2022 - proceedings.mlr.press
Despite the remarkable success of deep multi-modal learning in practice, it has not been
well-explained in theory. Recently, it has been observed that the best uni-modal network …

On uni-modal feature learning in supervised multi-modal learning

C Du, J Teng, T Li, Y Liu, T Yuan… - International …, 2023 - proceedings.mlr.press
We abstract the features (ie learned representations) of multi-modal data into 1) uni-modal
features, which can be learned from uni-modal training, and 2) paired features, which can …

Dynamic multimodal fusion

Z Xue, R Marculescu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Deep multimodal learning has achieved great progress in recent years. However, current
fusion approaches are static in nature, ie, they process and fuse multimodal inputs with …

Skeleton graph-neural-network-based human action recognition: A survey

M Feng, J Meunier - Sensors, 2022 - mdpi.com
Human action recognition has been applied in many fields, such as video surveillance and
human computer interaction, where it helps to improve performance. Numerous reviews of …

Efficient deep visual and inertial odometry with adaptive visual modality selection

M Yang, Y Chen, HS Kim - European Conference on Computer Vision, 2022 - Springer
In recent years, deep learning-based approaches for visual-inertial odometry (VIO) have
shown remarkable performance outperforming traditional geometric methods. Yet, all …

Curriculum-listener: Consistency-and complementarity-aware audio-enhanced temporal sentence grounding

H Chen, X Wang, X Lan, H Chen, X Duan… - Proceedings of the 31st …, 2023 - dl.acm.org
Temporal Sentence Grounding aims to retrieve a video moment given a natural language
query. Most existing literature merely focuses on visual information in videos without …