Gemini: a family of highly capable multimodal models

G Team, R Anil, S Borgeaud, JB Alayrac, J Yu… - arxiv preprint arxiv …, 2023 - arxiv.org
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Y Fan, X Ma, R Wu, Y Du, J Li, Z Gao, Q Li - European Conference on …, 2024 - Springer
We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

Can i trust your answer? visually grounded video question answering

J **ao, A Yao, Y Li, TS Chua - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
We study visually grounded VideoQA in response to the emerging trends of utilizing
pretraining techniques for video-language understanding. Specifically by forcing vision …

Anymal: An efficient and scalable any-modality augmented language model

S Moon, A Madotto, Z Lin, T Nagarajan… - Proceedings of the …, 2024 - aclanthology.org
Abstract We present Any-Modality Augmented Language Model (AnyMAL), a unified model
that reasons over diverse input modality signals (ie text, image, video, audio, IMU motion …