- Academic Search

Y Wang, H Zhang, J Tian, Y Tang - arxiv preprint arxiv:2412.01268, 2024 - arxiv.org

Most existing GUI agents typically depend on non-vision inputs like HTML source code or
accessibility trees, limiting their flexibility across diverse software environments and …

Gem Citer Citeret af 1 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

X Ye, Y Gan, Y Ge, XP Zhang, Y Tang - arxiv preprint arxiv:2412.00447, 2024 - arxiv.org

Large Vision Language Models (LVLMs) have achieved significant success across multi-
modal tasks. However, the computational cost of processing long visual tokens can be …

Gem Citer Relaterede artikler Alle 4 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture

J Han, L Du, Y Wu, X Zhou, H Du, W Zheng - arxiv preprint arxiv …, 2025 - arxiv.org

The success of VLMs often relies on the dynamic high-resolution schema that adaptively
augments the input images to multiple crops, so that the details of the images can be …

Gem Citer Relaterede artikler Alle 2 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Self-calibrated clip for training-free open-vocabulary segmentation

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture