Towards analyzing and mitigating sycophancy in large vision-language models

Y Zhao, R Zhang, J **ao, C Ke, R Hou, Y Hao… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have shown significant capability in vision-
language understanding. However, one critical issue that persists in these models is …

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

S Zhou, J **ao, Q Li, Y Li, X Yang, D Guo… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA
assistance involving scene text. EgoTextVQA contains 1.5 K ego-view videos and 7K scene …

Question-Answering Dense Video Events

H Qin, J **ao, A Yao - arxiv preprint arxiv:2409.04388, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have shown excellent performance in
question-answering of single-event videos. In this paper, we present question-answering …

On the Consistency of Video Large Language Models in Temporal Comprehension

M Jung, J **ao, BT Zhang, A Yao - arxiv preprint arxiv:2411.12951, 2024 - arxiv.org
Video large language models (Video-LLMs) can temporally ground language queries and
retrieve video moments. Yet, such temporal comprehension capabilities are neither well …

Can Video Large Language Models Comprehend Language in Videos?

M Jung, J **ao, BT Zhang, A Yao - Workshop on Video-Language Models … - openreview.net
Recent advancements in video large language models (Video-LLMs) have shown
capabilities of temporally-grounding language queries or retrieving video moments in …