Google Učenjak

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

C Yang, X Dong, X Zhu, W Su, J Wang, H Tian… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Vision-Language Models (VLMs) have been extended to understand both images
and videos. Visual token compression is leveraged to reduce the considerable token length …

Shrani Navedi Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

K Tao, C Qin, H You, Y Sui, H Wang - arxiv preprint arxiv:2411.15024, 2024 - arxiv.org

Video large language models (VLLMs) have significantly advanced recently in processing
complex video content, yet their inference efficiency remains constrained because of the …

Shrani Navedi Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

C Wei, Y Zhong, H Tan, Y Zeng, Y Liu, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal
segmentation models for the image and video domains have made rapid progress recently …

Shrani Navedi Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LinVT: Empower Your Image-level Large Language Model to Understand Videos

L Gao, Y Zhong, Y Zeng, H Tan, D Li, Z Zhao - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have been widely used in various tasks, motivating us to
develop an LLM-based assistant for videos. Instead of training from scratch, we propose a …

Shrani Navedi Sorodni članki Vse različice: 2 V obliki HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

B Xu, Y Shang, Y Ge, Q Lou, Y Yan - arxiv preprint arxiv:2411.15446, 2024 - arxiv.org

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-
language tasks but face significant deployment challenges due to their high computational …

Shrani Navedi Navedeno v 1 virih Sorodni članki Vse različice: 2 V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

xgen-mm-vid (blip-3-video): You only need 32 tokens to represent a video even in vlms

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

LinVT: Empower Your Image-level Large Language Model to Understand Videos

freePruner: A Training-free Approach for Large Multimodal Model Acceleration