Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding

TT Nguyen, P Nguyen, K Luu - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Visual interactivity understanding within visual scenes presents a significant challenge in
computer vision. Existing methods focus on complex interactivities while leveraging a simple …

[HTML][HTML] Style-aware two-stage learning framework for video captioning

Y Ma, Z Zhu, Y Qi, A Beheshti, Y Li, L Qing… - Knowledge-Based Systems, 2024 - Elsevier
Significant progress has been made in video captioning in recent years. However, most
existing methods directly learn from all given captions without distinguishing the styles of …

Towards unified multimodal editing with enhanced knowledge collaboration

K Pan, Z Fan, J Li, Q Yu, H Fei, S Tang, R Hong… - arxiv preprint arxiv …, 2024 - arxiv.org
The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for
effective knowledge editing. Current methods, including intrinsic knowledge editing and …

Contextual Augmented Global Contrast for Multimodal Intent Recognition

K Sun, Z **e, M Ye, H Zhang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Multimodal intent recognition (MIR) aims to perceive the human intent polarity via language
visual and acoustic modalities. The inherent intent ambiguity makes it challenging to …

Low-rank Prompt Interaction for Continual Vision-Language Retrieval

W Yan, Y Wang, W Lin, Z Guo, Z Zhao… - Proceedings of the 32nd …, 2024 - dl.acm.org
Research on continual learning in multi-modal tasks has been receiving increasing
attention. However, most existing work overlooks the explicit cross-modal and cross-task …

Semantic Alignment for Multimodal Large Language Models

T Wu, M Li, J Chen, W Ji, W Lin, J Gao… - Proceedings of the …, 2024 - dl.acm.org
Research on M ulti-modal L arge L anguage M odel s (MLLMs) towards the multi-image
cross-modal instruction has received increasing attention and made significant progress …

Calibrating Prompt from History for Continual Vision-Language Retrieval and Grounding

T **, W Yan, Y Wang, S Cai, Q Shuai… - Proceedings of the 32nd …, 2024 - dl.acm.org
In the field of machine learning, continual learning is a crucial concept that allows models to
adapt to non-stationary data distributions. However, most of the existing works focus on uni …

CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos

TT Nguyen, P Nguyen, X Li, J Cothren, A Yilmaz… - arxiv preprint arxiv …, 2024 - arxiv.org
Video scene graph generation (VidSGG) has emerged as a transformative approach to
capturing and interpreting the intricate relationships among objects and their temporal …

Subject-Oriented Video Captioning

Y Ma, C Teng, Y Qi, G Li, L Qing, Q Wu… - arxiv preprint arxiv …, 2023 - arxiv.org
Describing video content according to users' needs is a long-held goal. Although existing
video captioning methods have made significant progress, the generated captions may not …

: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

W Lin, Y Feng, WK Han, T **, Z Zhao, F Wu… - The Thirty-eight … - openreview.net
Understanding human emotions is fundamental to enhancing human-computer interaction,
especially for embodied agents that mimic human behavior. Traditional emotion analysis …