mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding

A Hu, H Xu, L Zhang, J Ye, M Yan, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodel Large Language Models (MLLMs) have achieved promising OCR-free
Document Understanding performance by increasing the supported resolution of document …

A survey of video datasets for grounded event understanding

K Sanders, B Van Durme - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
While existing video benchmarks largely consider specialized downstream tasks like
retrieval or question-answering (QA) contemporary multimodal AI systems must be capable …

Multivent: Multilingual videos of events and aligned natural text

K Sanders, D Etter, R Kriz… - Advances in Neural …, 2023 - proceedings.neurips.cc
Everyday news coverage has shifted from traditional broadcasts towards a wide range of
presentation formats such as first-hand, unedited video footage. Datasets that reflect the …

Reading between the lanes: Text videoqa on the road

G Tom, M Mathew, S Garcia-Bordils, D Karatzas… - … on Document Analysis …, 2023 - Springer
Text and signs around roads provide crucial information for drivers, vital for safe navigation
and situational awareness. Scene text recognition in motion is a challenging problem, while …

Making the v in text-VQA matter

S Hegde, S Jahagirdar… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Text-based VQA aims at answering questions by reading the text present in the images. It
requires a large amount of scene-text relationship understanding compared to the VQA task …

VTLayout: a multi-modal approach for video text layout

Y Zhao, J Ma, Z Qi, Z **e, Y Luo, Q Kang… - Proceedings of the 31st …, 2023 - dl.acm.org
The rapid explosion of video distribution is accompanied by a massive amount of video text,
which encompasses rich information about the video content. While previous research has …

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

S Jahagirdar, M Mathew, D Karatzas… - Proceedings of the …, 2023 - openaccess.thecvf.com
Researchers have extensively studied the field of vision and language, discovering that both
visual and textual content is crucial for understanding scenes effectively. Particularly …

Scene-Text Grounding for Text-Based Video Question Answering

S Zhou, J **ao, X Yang, P Song, D Guo, A Yao… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their
opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we …

Video question answering for people with visual impairments using an egocentric 360-degree camera

I Song, M Joo, J Kwon, J Lee - arxiv preprint arxiv:2405.19794, 2024 - arxiv.org
This paper addresses the daily challenges encountered by visually impaired individuals,
such as limited access to information, navigation difficulties, and barriers to social …

Dissecting multimodality in VideoQA transformer models by impairing modality fusion

IS Rawal, A Matyasko, S Jaiswal, B Fernando… - arxiv preprint arxiv …, 2023 - arxiv.org
While VideoQA Transformer models demonstrate competitive performance on standard
benchmarks, the reasons behind their success are not fully understood. Do these models …