„Google“ mokslinčius

A Hu, H Xu, L Zhang, J Ye, M Yan, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodel Large Language Models (MLLMs) have achieved promising OCR-free
Document Understanding performance by increasing the supported resolution of document …

Išsaugoti Cituoti Cituoja 17 Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

A survey of video datasets for grounded event understanding

K Sanders, B Van Durme - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

While existing video benchmarks largely consider specialized downstream tasks like
retrieval or question-answering (QA) contemporary multimodal AI systems must be capable …

Išsaugoti Cituoti Cituoja 3 Susiję straipsniai Visos 6 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Multivent: Multilingual videos of events and aligned natural text

K Sanders, D Etter, R Kriz… - Advances in Neural …, 2023 - proceedings.neurips.cc

Everyday news coverage has shifted from traditional broadcasts towards a wide range of
presentation formats such as first-hand, unedited video footage. Datasets that reflect the …

Išsaugoti Cituoti Cituoja 7 Susiję straipsniai Visos 7 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Reading between the lanes: Text videoqa on the road

G Tom, M Mathew, S Garcia-Bordils, D Karatzas… - … on Document Analysis …, 2023 - Springer

Text and signs around roads provide crucial information for drivers, vital for safe navigation
and situational awareness. Scene text recognition in motion is a challenging problem, while …

Išsaugoti Cituoti Cituoja 5 Susiję straipsniai Visos 8 versijos

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Making the v in text-VQA matter

S Hegde, S Jahagirdar… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Text-based VQA aims at answering questions by reading the text present in the images. It
requires a large amount of scene-text relationship understanding compared to the VQA task …

Išsaugoti Cituoti Cituoja 3 Susiję straipsniai Visos 9 versijos HTML kopija

VTLayout: a multi-modal approach for video text layout

Y Zhao, J Ma, Z Qi, Z **e, Y Luo, Q Kang… - Proceedings of the 31st …, 2023 - dl.acm.org

The rapid explosion of video distribution is accompanied by a massive amount of video text,
which encompasses rich information about the video content. While previous research has …

Išsaugoti Cituoti Cituoja 2 Susiję straipsniai

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

S Jahagirdar, M Mathew, D Karatzas… - Proceedings of the …, 2023 - openaccess.thecvf.com

Researchers have extensively studied the field of vision and language, discovering that both
visual and textual content is crucial for understanding scenes effectively. Particularly …

Išsaugoti Cituoti Cituoja 1 Susiję straipsniai Visos 7 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scene-Text Grounding for Text-Based Video Question Answering

S Zhou, J **ao, X Yang, P Song, D Guo, A Yao… - arxiv preprint arxiv …, 2024 - arxiv.org

Existing efforts in text-based video question answering (TextVideoQA) are criticized for their
opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we …

Išsaugoti Cituoti Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video question answering for people with visual impairments using an egocentric 360-degree camera

I Song, M Joo, J Kwon, J Lee - arxiv preprint arxiv:2405.19794, 2024 - arxiv.org

This paper addresses the daily challenges encountered by visually impaired individuals,
such as limited access to information, navigation difficulties, and barriers to social …

Išsaugoti Cituoti Cituoja 1 Susiję straipsniai Visos 2 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dissecting multimodality in VideoQA transformer models by impairing modality fusion

IS Rawal, A Matyasko, S Jaiswal, B Fernando… - arxiv preprint arxiv …, 2023 - arxiv.org

While VideoQA Transformer models demonstrate competitive performance on standard
benchmarks, the reasons behind their success are not fully understood. Do these models …

Išsaugoti Cituoti Cituoja 1 Susiję straipsniai Visos 6 versijos HTML kopija

Kurti įspėjimą

Cituoti

Išplėstinė paieška

Išsaugota skiltyje „Mano biblioteka“

Watching the news: Towards videoqa models that can read

mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding

A survey of video datasets for grounded event understanding

Multivent: Multilingual videos of events and aligned natural text

Reading between the lanes: Text videoqa on the road

Making the v in text-VQA matter

VTLayout: a multi-modal approach for video text layout

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Scene-Text Grounding for Text-Based Video Question Answering

Video question answering for people with visual impairments using an egocentric 360-degree camera

Dissecting multimodality in VideoQA transformer models by impairing modality fusion