Multi-object hallucination in vision-language models

X Chen, Z Ma, X Zhang, S Xu, S Qian, J Yang… - ar** Video-Language Alignment via LLM-Based Self-Questioning and Answering
J Chen, K Ma, H Huang, J Shen, H Fang… - arxiv preprint arxiv …, 2024 - arxiv.org
The development of multi-modal models has been rapidly advancing, with some
demonstrating remarkable capabilities. However, annotating video-text pairs remains …

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

J Li, S Tao, Y Yan, X Gu, H Xu, X Zheng, Y Lyu… - arxiv preprint arxiv …, 2024 - arxiv.org
Endeavors have been made to explore Large Language Models for video analysis (Video-
LLMs), particularly in understanding and interpreting long videos. However, existing Video …

[PDF][PDF] Mitigating Language Bias of LMMs in Social Intelligence Understanding with Virtual Counterfactual Calibration

P Chen, XY Guo, YF Li, X Zhang… - Proceedings of the 2024 …, 2024 - aclanthology.org
Social intelligence is essential for understanding complex human expressions and social
interactions. While large multimodal models (LMMs) have demonstrated remarkable …