Cross-modal adapter for text-video retrieval H Jiang, J Zhang, R Huang, C Ge, Z Ni, J Lu, J Zhou, S Song, G Huang arXiv preprint arXiv:2211.09623, 2022 | 43 | 2022 |
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers J Zhang, Y Guo, X Chen, YJ Wang, Y Hu, C Shi, J Chen CoRL 2024, 2024 | 4 | 2024 |
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations Y Hu, Y Guo, P Wang, X Chen, YJ Wang, J Zhang, K Sreenath, C Lu, ... arXiv preprint arXiv:2412.14803, 2024 | 2 | 2024 |
Prediction with action: Visual policy learning via joint denoising process Y Guo, Y Hu, J Zhang, YJ Wang, X Chen, C Lu, J Chen The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 | 2 | 2024 |
Improving Vision-Language-Action Model with Online Reinforcement Learning Y Guo, J Zhang, X Chen, X Ji, YJ Wang, Y Hu, J Chen arXiv preprint arXiv:2501.16664, 2025 | 1 | 2025 |
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent J Zhang, Y Guo, Y Hu, X Chen, X Zhu, J Chen arXiv preprint arXiv:2501.18867, 2025 | | 2025 |