Google Académico

Guardar Citar Citado por 1066 Artículos relacionados Las 6 versiones Versión en HTML

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arxiv preprint arxiv …, 2023 - arxiv.org

Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

Guardar Citar Citado por 163 Artículos relacionados Las 5 versiones Versión en HTML

[PDF] neurips.cc

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2023 - proceedings.neurips.cc

We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

Guardar Citar Citado por 204 Artículos relacionados Las 2 versiones

Llama-vid: An image is worth 2 tokens in large language models

Y Li, C Wang, J Jia - European Conference on Computer Vision, 2024 - Springer

In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

Guardar Citar Citado por 178 Artículos relacionados Las 3 versiones Versión en HTML

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Guardar Citar Citado por 250 Artículos relacionados Las 4 versiones Versión en HTML

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

Guardar Citar Citado por 39 Artículos relacionados Las 7 versiones Versión en HTML

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

Guardar Citar Citado por 59 Artículos relacionados Las 2 versiones Versión en HTML

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Guardar Citar Citado por 297 Artículos relacionados Las 6 versiones

Distribution-balanced loss for multi-label classification in long-tailed datasets

T Wu, Q Huang, Z Liu, Y Wang, D Lin - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer

We present a new loss function called Distribution-Balanced Loss for the multi-label
recognition problems that exhibit long-tailed class distributions. Compared to conventional …