Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

Y Wang, H Zhang, J Tian, Y Tang - arxiv preprint arxiv:2412.01268, 2024 - arxiv.org
Most existing GUI agents typically depend on non-vision inputs like HTML source code or
accessibility trees, limiting their flexibility across diverse software environments and …

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

X Ye, Y Gan, Y Ge, XP Zhang, Y Tang - arxiv preprint arxiv:2412.00447, 2024 - arxiv.org
Large Vision Language Models (LVLMs) have achieved significant success across multi-
modal tasks. However, the computational cost of processing long visual tokens can be …

AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture

J Han, L Du, Y Wu, X Zhou, H Du, W Zheng - arxiv preprint arxiv …, 2025 - arxiv.org
The success of VLMs often relies on the dynamic high-resolution schema that adaptively
augments the input images to multiple crops, so that the details of the images can be …