Video-of-thought: Step-by-step video reasoning from perception to cognition

H Fei, S Wu, W Ji, H Zhang, M Zhang, ML Lee… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing research of video understanding still struggles to achieve in-depth comprehension
and reasoning in complex videos, primarily due to the under-exploration of two key …

Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms

H Fei, S Wu, W Ji, H Zhang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Text-to-video (T2V) synthesis has gained increasing attention in the community in
which the recently emerged diffusion models (DMs) have promisingly shown stronger …

Dreamlip: Language-image pre-training with long captions

K Zheng, Y Zhang, W Wu, F Lu, S Ma, X **… - … on Computer Vision, 2024 - Springer
Abstract Language-image pre-training largely relies on how precisely and thoroughly a text
describes its paired image. In practice, however, the contents of an image can be so rich that …

Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing

H Fei, S Wu, H Zhang, TS Chua, S Yan - arxiv preprint arxiv:2412.19806, 2024 - arxiv.org
Recent developments of vision large language models (LLMs) have seen remarkable
progress, yet still encounter challenges towards multimodal generalists, such as coarse …

Panosent: A panoptic sextuple extraction benchmark for multimodal conversational aspect-based sentiment analysis

M Luo, H Fei, B Li, S Wu, Q Liu, S Poria… - Proceedings of the …, 2024 - dl.acm.org
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and
advancement, there are still gaps in defining a more holistic research target seamlessly …

Who evaluates the evaluations? objectively scoring text-to-image prompt coherence metrics with t2iscorescore (ts2)

M Saxon, F Jahara, M Khoshnoodi, Y Lu… - arxiv preprint arxiv …, 2024 - arxiv.org
With advances in the quality of text-to-image (T2I) models has come interest in
benchmarking their prompt faithfulness--the semantic coherence of generated images to the …

Nus-emo at semeval-2024 task 3: Instruction-tuning llm for multimodal emotion-cause analysis in conversations

M Luo, H Zhang, S Wu, B Li, H Han, H Fei - arxiv preprint arxiv …, 2024 - arxiv.org
This paper describes the architecture of our system developed for Task 3 of SemEval-2024:
Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of …

Modeling implicit variable and latent structure for aspect-based sentiment quadruple extraction

Y Nie, J Fu, Y Zhang, C Li - Neurocomputing, 2024 - Elsevier
The realm of aspect-based sentiment analysis (ABSA), which delves into the nuanced
sentiment expressions individuals hold towards specific services or products, has …

Multimodal emotion-cause pair extraction with holistic interaction and label constraint

B Li, H Fei, F Li, T Chua, D Ji - ACM Transactions on Multimedia …, 2024 - dl.acm.org
The multimodal emotion-cause pair extraction (MECPE) task aims to detect the emotions,
causes, and emotion-cause pairs from multimodal conversations. Existing methods for this …

SpeechEE: A Novel Benchmark for Speech Event Extraction

B Wang, M Zhang, H Fei, Y Zhao, B Li, S Wu… - Proceedings of the …, 2024 - dl.acm.org
Event extraction (EE) is a critical direction in the field of information extraction, laying an
important foundation for the construction of structured knowledge bases. EE from text has …