Towards analyzing and mitigating sycophancy in large vision-language models
Large Vision-Language Models (LVLMs) have shown significant capability in vision-
language understanding. However, one critical issue that persists in these models is …
language understanding. However, one critical issue that persists in these models is …
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA
assistance involving scene text. EgoTextVQA contains 1.5 K ego-view videos and 7K scene …
assistance involving scene text. EgoTextVQA contains 1.5 K ego-view videos and 7K scene …
Question-Answering Dense Video Events
Multimodal Large Language Models (MLLMs) have shown excellent performance in
question-answering of single-event videos. In this paper, we present question-answering …
question-answering of single-event videos. In this paper, we present question-answering …
On the Consistency of Video Large Language Models in Temporal Comprehension
Video large language models (Video-LLMs) can temporally ground language queries and
retrieve video moments. Yet, such temporal comprehension capabilities are neither well …
retrieve video moments. Yet, such temporal comprehension capabilities are neither well …
Can Video Large Language Models Comprehend Language in Videos?
Recent advancements in video large language models (Video-LLMs) have shown
capabilities of temporally-grounding language queries or retrieving video moments in …
capabilities of temporally-grounding language queries or retrieving video moments in …