UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
J Zhang, Y Guo, Y Hu, X Chen, X Zhu… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained
Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically …
Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically …
Exploring annotation-free image captioning with retrieval-augmented pseudo sentence generation
Recently, training an image captioner without annotated image-sentence pairs has gained
traction. Previous methods have faced limitations due to either using mismatched corpora for …
traction. Previous methods have faced limitations due to either using mismatched corpora for …