Gemini: a family of highly capable multimodal models
G Team, R Anil, S Borgeaud, JB Alayrac, J Yu… - arxiv preprint arxiv …, 2023 - arxiv.org
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …
capabilities across image, audio, video, and text understanding. The Gemini family consists …
Mvbench: A comprehensive multi-modal video understanding benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
Onellm: One framework to align all modalities with language
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …
strong multimodal understanding capability. However existing works rely heavily on modality …
VideoAgent: Long-Form Video Understanding with Large Language Model as Agent
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding
We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …
language models) with a novel unified memory mechanism could tackle the challenging …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
BRAVE: Broadening the visual encoding of vision-language models
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …
language model (LM) that interprets the encoded features to solve downstream tasks …
Can i trust your answer? visually grounded video question answering
We study visually grounded VideoQA in response to the emerging trends of utilizing
pretraining techniques for video-language understanding. Specifically by forcing vision …
pretraining techniques for video-language understanding. Specifically by forcing vision …
Anymal: An efficient and scalable any-modality augmented language model
Abstract We present Any-Modality Augmented Language Model (AnyMAL), a unified model
that reasons over diverse input modality signals (ie text, image, video, audio, IMU motion …
that reasons over diverse input modality signals (ie text, image, video, audio, IMU motion …