Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis

M Luo, H Fei, B Li, S Wu, Q Liu, S Poria… - Proceedings of the …, 2024 - dl.acm.org
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and
advancement, there are still gaps in defining a more holistic research target seamlessly …

Video-of-thought: Step-by-step video reasoning from perception to cognition

H Fei, S Wu, W Ji, H Zhang, M Zhang… - Forty-first International …, 2024 - openreview.net
Existing research of video understanding still struggles to achieve in-depth comprehension
and reasoning in complex videos, primarily due to the under-exploration of two key …

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

I Rodin, A Furnari, K Min, S Tripathi… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present Egocentric Action Scene Graphs (EASGs) a new representation for long-
form understanding of egocentric videos. EASGs extend standard manually-annotated …

SpeechEE: A Novel Benchmark for Speech Event Extraction

B Wang, M Zhang, H Fei, Y Zhao, B Li, S Wu… - Proceedings of the …, 2024 - dl.acm.org
Event extraction (EE) is a critical direction in the field of information extraction, laying an
important foundation for the construction of structured knowledge bases. EE from text has …

Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos

D Verma, D Roy, B Fernando - arxiv preprint arxiv:2407.20642, 2024 - arxiv.org
Situation recognition refers to the ability of an agent to identify and understand various
situations or contexts based on available information and sensory inputs. It involves the …

[PDF][PDF] NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations

M Luo, H Zhang, S Wu, B Li, H Han… - Proceedings of the 18th …, 2024 - researchgate.net
This paper describes the architecture of our system developed for Task 3 of SemEval-2024:
Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of …

Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference Resolution

L Zheng, B Chen, H Fei, F Li, S Wu, L Liao… - Proceedings of the …, 2024 - dl.acm.org
Coreference resolution, an essential task in natural language processing, is particularly
challenging in multi-modal scenarios where data comes in various forms and modalities …

XFashion: Character Animation Generation via Facial-enhanced and Granularly Controlling

Y Zhao, B Li, H Fei - Proceedings of the 5th International Workshop on …, 2024 - dl.acm.org
Recent research has achieved advancements in animated fashion video synthesis.
However, existing methods generate videos only with the guidance of poses, thus resulting …

Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

JL Romero, K Min, S Tripathi, M Karimzadeh - arxiv preprint arxiv …, 2025 - arxiv.org
Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic
backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep …

Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLM

H Fei, M Luo, J Xu, S Wu, W Ji, ML Lee… - Proceedings of the 1st …, 2024 - dl.acm.org
Multimodal large language models (MLLMs) are evolving rapidly but suffer from significant
challenges, such as hallucinations, which compromise the reliability and utility of MLLMs …