- Academic Search

J Li, W Lu, H Fei, M Luo, M Dai, M **a, Y **… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both
academia and industry due to their remarkable performance in various applications such as …

保存引用被引用次数：14 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

On-device language models: A comprehensive review

J Xu, Z Li, W Chen, Q Wang, X Gao, Q Cai… - arxiv preprint arxiv …, 2024 - arxiv.org

The advent of large language models (LLMs) revolutionized natural language processing
applications, and running LLMs on edge devices has become increasingly attractive for …

保存引用被引用次数：13 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

C Fu, Y Dai, Y Luo, L Li, S Ren, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs)
have emerged as a focal point in recent advancements. However, the predominant focus …

保存引用被引用次数：136 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P **, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

保存引用被引用次数：168 相关文章所有 4 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Lita: Language instructed temporal-localization assistant

DA Huang, S Liao, S Radhakrishnan, H Yin… - … on Computer Vision, 2024 - Springer

There has been tremendous progress in multimodal Large Language Models (LLMs).
Recent works have extended these models to video input with promising instruction …

保存引用被引用次数：42 相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

保存引用被引用次数：58 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arxiv preprint arxiv …, 2024 - arxiv.org

Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

保存引用被引用次数：30 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Lvbench: An extreme long video understanding benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …

保存引用被引用次数：27 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Y Liu, Y Cao, Z Gao, W Wang, Z Chen, W Wang… - Science China …, 2024 - Springer

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the
performance of vision large language models (VLLMs), existing visual instruction tuning …

保存引用被引用次数：15 相关文章所有 3 个版本

[Free GPT-4]

[PDF] arxiv.org

Longvu: Spatiotemporal adaptive compression for long video-language understanding

X Shen, Y **ong, C Zhao, L Wu, J Chen, C Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) have shown promising progress in
understanding and analyzing video content. However, processing long videos remains a …

保存引用被引用次数：16 相关文章所有 2 个版本 HTML 版

引用

高级搜索

已保存到“我的图书馆”

A survey on benchmarks of multimodal large language models

On-device language models: A comprehensive review

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Lita: Language instructed temporal-localization assistant

Video understanding with large language models: A survey

Kangaroo: A powerful video-language model supporting long-context video input

Lvbench: An extreme long video understanding benchmark

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Longvu: Spatiotemporal adaptive compression for long video-language understanding