- Academic Search

Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis

M Luo, H Fei, B Li, S Wu, Q Liu, S Poria… - Proceedings of the …, 2024 - dl.acm.org

While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and
advancement, there are still gaps in defining a more holistic research target seamlessly …

Speichern Zitieren Zitiert von: 7 Ähnliche Artikel Alle 11 Versionen

[Free GPT-4]

[PDF] openreview.net

Video-of-thought: Step-by-step video reasoning from perception to cognition

H Fei, S Wu, W Ji, H Zhang, M Zhang… - Forty-first International …, 2024 - openreview.net

Existing research of video understanding still struggles to achieve in-depth comprehension
and reasoning in complex videos, primarily due to the under-exploration of two key …

Speichern Zitieren Zitiert von: 62 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] thecvf.com

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

I Rodin, A Furnari, K Min, S Tripathi… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present Egocentric Action Scene Graphs (EASGs) a new representation for long-
form understanding of egocentric videos. EASGs extend standard manually-annotated …

Speichern Zitieren Zitiert von: 7 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

SpeechEE: A Novel Benchmark for Speech Event Extraction

B Wang, M Zhang, H Fei, Y Zhao, B Li, S Wu… - Proceedings of the …, 2024 - dl.acm.org

Event extraction (EE) is a critical direction in the field of information extraction, laying an
important foundation for the construction of structured knowledge bases. EE from text has …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel Alle 4 Versionen

[Free GPT-4]

[PDF] arxiv.org

Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos

D Verma, D Roy, B Fernando - arxiv preprint arxiv:2407.20642, 2024 - arxiv.org

Situation recognition refers to the ability of an agent to identify and understand various
situations or contexts based on available information and sensory inputs. It involves the …

Speichern Zitieren Zitiert von: 1 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] researchgate.net

[PDF][PDF] NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations

M Luo, H Zhang, S Wu, B Li, H Han… - Proceedings of the 18th …, 2024 - researchgate.net

This paper describes the architecture of our system developed for Task 3 of SemEval-2024:
Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of …

Speichern Zitieren Zitiert von: 5 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] smu.edu.sg

Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference Resolution

L Zheng, B Chen, H Fei, F Li, S Wu, L Liao… - Proceedings of the …, 2024 - dl.acm.org

Coreference resolution, an essential task in natural language processing, is particularly
challenging in multi-modal scenarios where data comes in various forms and modalities …

Speichern Zitieren Ähnliche Artikel Alle 3 Versionen

XFashion: Character Animation Generation via Facial-enhanced and Granularly Controlling

Y Zhao, B Li, H Fei - Proceedings of the 5th International Workshop on …, 2024 - dl.acm.org

Recent research has achieved advancements in animated fashion video synthesis.
However, existing methods generate videos only with the guidance of poses, thus resulting …

Speichern Zitieren Ähnliche Artikel Alle 2 Versionen

[Free GPT-4]

[PDF] arxiv.org

Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

JL Romero, K Min, S Tripathi, M Karimzadeh - arxiv preprint arxiv …, 2025 - arxiv.org

Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic
backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep …

Speichern Zitieren Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] acm.org

Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLM

H Fei, M Luo, J Xu, S Wu, W Ji, ML Lee… - Proceedings of the 1st …, 2024 - dl.acm.org

Multimodal large language models (MLLMs) are evolving rapidly but suffer from significant
challenges, such as hallucinations, which compromise the reliability and utility of MLLMs …

Speichern Zitieren Ähnliche Artikel Alle 2 Versionen

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Constructing holistic spatio-temporal scene graph for video semantic role labeling

Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis

Video-of-thought: Step-by-step video reasoning from perception to cognition

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

SpeechEE: A Novel Benchmark for Speech Event Extraction

Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos

[PDF][PDF] NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations

Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference Resolution

XFashion: Character Animation Generation via Facial-enhanced and Granularly Controlling

Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLM