Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
Most existing GUI agents typically depend on non-vision inputs like HTML source code or
accessibility trees, limiting their flexibility across diverse software environments and …
accessibility trees, limiting their flexibility across diverse software environments and …
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Large Vision Language Models (LVLMs) have achieved significant success across multi-
modal tasks. However, the computational cost of processing long visual tokens can be …
modal tasks. However, the computational cost of processing long visual tokens can be …
AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture
J Han, L Du, Y Wu, X Zhou, H Du, W Zheng - arxiv preprint arxiv …, 2025 - arxiv.org
The success of VLMs often relies on the dynamic high-resolution schema that adaptively
augments the input images to multiple crops, so that the details of the images can be …
augments the input images to multiple crops, so that the details of the images can be …