PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

C Yang, X Dong, X Zhu, W Su, J Wang, H Tian… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (VLMs) have been extended to understand both images
and videos. Visual token compression is leveraged to reduce the considerable token length …

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

K Tao, C Qin, H You, Y Sui, H Wang - arxiv preprint arxiv:2411.15024, 2024 - arxiv.org
Video large language models (VLLMs) have significantly advanced recently in processing
complex video content, yet their inference efficiency remains constrained because of the …

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models

C Wei, Y Zhong, H Tan, Y Zeng, Y Liu, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal
segmentation models for the image and video domains have made rapid progress recently …

LinVT: Empower Your Image-level Large Language Model to Understand Videos

L Gao, Y Zhong, Y Zeng, H Tan, D Li, Z Zhao - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have been widely used in various tasks, motivating us to
develop an LLM-based assistant for videos. Instead of training from scratch, we propose a …

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

B Xu, Y Shang, Y Ge, Q Lou, Y Yan - arxiv preprint arxiv:2411.15446, 2024 - arxiv.org
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-
language tasks but face significant deployment challenges due to their high computational …