Google Академик

Y Zhang, J Wu, W Li, B Li, Z Ma, Z Liu, C Li - arxiv preprint arxiv …, 2024 - arxiv.org

The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …

Сачувај Цитирај 47 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arxiv preprint arxiv …, 2024 - arxiv.org

Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Сачувај Цитирај 34 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Longvlm: Efficient long video understanding via large language models

Y Weng, M Han, H He, X Chang, B Zhuang - European Conference on …, 2024 - Springer

Abstract Empowered by Large Language Models (LLMs), recent advancements in Video-
based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These …

Сачувај Цитирај 26 пута наведен Сродни чланци Све верзије (7)

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Mlp can be a good transformer learner

S Lin, P Lyu, D Liu, T Tang, X Liang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Self-attention mechanism is the key of the Transformer but often criticized for its computation
demands. Previous token pruning works motivate their methods from the view of …

Сачувај Цитирај 10 пута наведен Сродни чланци Све верзије (6) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

H Hua, Y Tang, C Xu, J Luo - arxiv preprint arxiv:2404.12353, 2024 - arxiv.org

Video summarization aims to create short, accurate, and cohesive summaries of longer
videos. Despite the existence of various video summarization datasets, a notable limitation …

Сачувај Цитирај 20 пута наведен Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

J Wang, C Wang, K Huang, J Huang, L ** - arxiv preprint arxiv …, 2024 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in
numerous applications. However, the emphasis on brief summary texts during pre-training …

Сачувај Цитирај 3 пута наведен Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

Y Gao, L Fischer, A Lintner, S Ebling - arxiv preprint arxiv:2410.08860, 2024 - arxiv.org

Audio descriptions (ADs) function as acoustic commentaries designed to assist blind
persons and persons with visual impairments in accessing digital media content on …

Сачувај Цитирај Сродни чланци Све верзије (4) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

D Liu, C Whitehouse, X Yu, L Mahon, R Saxena… - arxiv preprint arxiv …, 2025 - arxiv.org

Transforming recorded videos into concise and accurate textual summaries is a growing
challenge in multimodal learning. This paper introduces VISTA, a dataset specifically …

Сачувај Цитирај Сродни чланци HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

X Deng, Q Yu, A Athar, C Yang, L Yang, X **… - arxiv preprint arxiv …, 2025 - arxiv.org

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic
segmentation and grounded image captioning. Building upon the COCO dataset with …

Сачувај Цитирај Сродни чланци Све верзије (2) HTML верзија

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Shotluck holmes: A family of efficient small-scale large language vision models for video captioning and summarization

R Luo, A Peng, A Vasudev, R Jain - … of the 2nd International Workshop on …, 2024 - dl.acm.org

Video is an increasingly prominent and information-dense medium, yet it poses substantial
challenges for language models. A typical video consists of a sequence of shorter segments …

Сачувај Цитирај 2 пута наведен Сродни чланци Све верзије (5)

Направи обавештење

Цитирај

Напредна претрага

Сачувано у мојој библиотеци

Shot2story20k: A new benchmark for comprehensive understanding of multi-shot videos

Video instruction tuning with synthetic data

Longvila: Scaling long-context visual language models for long videos

Longvlm: Efficient long video understanding via large language models

Mlp can be a good transformer learner

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

Shotluck holmes: A family of efficient small-scale large language vision models for video captioning and summarization