Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and
advancement, there are still gaps in defining a more holistic research target seamlessly …
advancement, there are still gaps in defining a more holistic research target seamlessly …
Video-of-thought: Step-by-step video reasoning from perception to cognition
Existing research of video understanding still struggles to achieve in-depth comprehension
and reasoning in complex videos, primarily due to the under-exploration of two key …
and reasoning in complex videos, primarily due to the under-exploration of two key …
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
Abstract We present Egocentric Action Scene Graphs (EASGs) a new representation for long-
form understanding of egocentric videos. EASGs extend standard manually-annotated …
form understanding of egocentric videos. EASGs extend standard manually-annotated …
SpeechEE: A Novel Benchmark for Speech Event Extraction
Event extraction (EE) is a critical direction in the field of information extraction, laying an
important foundation for the construction of structured knowledge bases. EE from text has …
important foundation for the construction of structured knowledge bases. EE from text has …
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos
Situation recognition refers to the ability of an agent to identify and understand various
situations or contexts based on available information and sensory inputs. It involves the …
situations or contexts based on available information and sensory inputs. It involves the …
[PDF][PDF] NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations
This paper describes the architecture of our system developed for Task 3 of SemEval-2024:
Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of …
Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of …
Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference Resolution
Coreference resolution, an essential task in natural language processing, is particularly
challenging in multi-modal scenarios where data comes in various forms and modalities …
challenging in multi-modal scenarios where data comes in various forms and modalities …
XFashion: Character Animation Generation via Facial-enhanced and Granularly Controlling
Recent research has achieved advancements in animated fashion video synthesis.
However, existing methods generate videos only with the guidance of poses, thus resulting …
However, existing methods generate videos only with the guidance of poses, thus resulting …
Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition
Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic
backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep …
backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep …
Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLM
Multimodal large language models (MLLMs) are evolving rapidly but suffer from significant
challenges, such as hallucinations, which compromise the reliability and utility of MLLMs …
challenges, such as hallucinations, which compromise the reliability and utility of MLLMs …