A survey on benchmarks of multimodal large language models
Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both
academia and industry due to their remarkable performance in various applications such as …
academia and industry due to their remarkable performance in various applications such as …
Egoschema: A diagnostic benchmark for very long-form video language understanding
K Mangalam, R Akshulakov… - Advances in Neural …, 2023 - proceedings.neurips.cc
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …
benchmark to evaluate long video understanding capabilities of modern vision and …
One-peace: Exploring one general representation model toward unlimited modalities
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Video question answering: Datasets, algorithms and challenges
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …
to the given videos. It has earned increasing attention with recent research trends in joint …
Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …
of rich and complex dynamic audio-visual components. Although existing Multimodal Large …
Worldgpt: Empowering llm as multimodal world model
World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …
environment simulation to complex scenario construction. However, existing models are …
Funqa: Towards surprising video comprehension
Surprising videos, eg, funny clips, creative performances, or visual illusions, attract
significant attention. Enjoyment of these videos is not simply a response to visual stimuli; …
significant attention. Enjoyment of these videos is not simply a response to visual stimuli; …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …