- Academic Search

G Team, R Anil, S Borgeaud, JB Alayrac, J Yu… - arxiv preprint arxiv …, 2023 - arxiv.org

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …

Zapisz Cytuj Cytowane przez 2510 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

Zapisz Cytuj Cytowane przez 252 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Zapisz Cytuj Cytowane przez 69 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Onellm: One framework to align all modalities with language

J Han, K Gong, Y Zhang, J Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …

Zapisz Cytuj Cytowane przez 98 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Zapisz Cytuj Cytowane przez 40 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]

[PDF] arxiv.org

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Y Fan, X Ma, R Wu, Y Du, J Li, Z Gao, Q Li - European Conference on …, 2024 - Springer

We explore how reconciling several foundation models (large language models and vision-
language models) with a novel unified memory mechanism could tackle the challenging …

Zapisz Cytuj Cytowane przez 32 Powiązane artykuły Wszystkie wersje 4

[Free GPT-4]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Zapisz Cytuj Cytowane przez 60 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer

Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

Zapisz Cytuj Cytowane przez 28 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]

[PDF] thecvf.com

Can i trust your answer? visually grounded video question answering

J **ao, A Yao, Y Li, TS Chua - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

We study visually grounded VideoQA in response to the emerging trends of utilizing
pretraining techniques for video-language understanding. Specifically by forcing vision …

Zapisz Cytuj Cytowane przez 44 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] aclanthology.org

Anymal: An efficient and scalable any-modality augmented language model

S Moon, A Madotto, Z Lin, T Nagarajan… - Proceedings of the …, 2024 - aclanthology.org

Abstract We present Any-Modality Augmented Language Model (AnyMAL), a unified model
that reasons over diverse input modality signals (ie text, image, video, audio, IMU motion …

Zapisz Cytuj Cytowane przez 74 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Self-chained image-language model for video localization and question answering

Gemini: a family of highly capable multimodal models

Mvbench: A comprehensive multi-modal video understanding benchmark

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Onellm: One framework to align all modalities with language

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Video understanding with large language models: A survey

BRAVE: Broadening the visual encoding of vision-language models

Can i trust your answer? visually grounded video question answering

Anymal: An efficient and scalable any-modality augmented language model