- Academic Search

C Fu, Y Dai, Y Luo, L Li, S Ren, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs)
have emerged as a focal point in recent advancements. However, the predominant focus …

Save Cite Cited by 140 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Save Cite Cited by 69 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Video instruction tuning with synthetic data

Y Zhang, J Wu, W Li, B Li, Z Ma, Z Liu, C Li - arxiv preprint arxiv …, 2024 - arxiv.org

The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …

Save Cite Cited by 38 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Save Cite Cited by 60 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arxiv preprint arxiv …, 2024 - arxiv.org

Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

Save Cite Cited by 30 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Save Cite Cited by 5 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Hourvideo: 1-hour video-language understanding

K Chandrasegaran, A Gupta, LM Hadzic, T Kota… - arxiv preprint arxiv …, 2024 - arxiv.org

We present HourVideo, a benchmark dataset for hour-long video-language understanding.
Our dataset consists of a novel task suite comprising summarization, perception (recall …

Save Cite Cited by 11 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Audiobench: A universal benchmark for audio large language models

B Wang, X Zou, G Lin, S Sun, Z Liu, W Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce AudioBench, a universal benchmark designed to evaluate Audio Large
Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among …

Save Cite Cited by 16 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

H Wang, Z Xu, Y Cheng, S Diao, Y Zhou, Y Cao… - arxiv preprint arxiv …, 2024 - arxiv.org

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …

Save Cite Cited by 7 Related articles All 3 versions Free GPT-4 View as HTML

Cite

Advanced search

Saved to My library

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

Video instruction tuning with synthetic data

Video understanding with large language models: A survey

Kangaroo: A powerful video-language model supporting long-context video input

Mme-survey: A comprehensive survey on evaluation of multimodal llms

Hourvideo: 1-hour video-language understanding

Audiobench: A universal benchmark for audio large language models

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models