Osprey: Pixel understanding with visual instruction tuning Y Yuan, W Li, J Liu, D Tang, X Luo, C Qin, L Zhang, J Zhu Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 72 | 2024 |
Tokenpacker: Efficient visual projector for multimodal llm W Li, Y Yuan, J Liu, D Tang, S Wang, J Qin, J Zhu, L Zhang arXiv preprint arXiv:2407.02392, 2024 | 29 | 2024 |
Point2mask: Point-supervised panoptic segmentation via optimal transport W Li, Y Yuan, S Wang, J Zhu, J Li, J Liu, L Zhang Proceedings of the IEEE/CVF International Conference on Computer Vision, 572-581, 2023 | 24 | 2023 |
Chain of ideas: Revolutionizing research via novel idea development with llm agents L Li, W Xu, J Guo, R Zhao, X Li, Y Yuan, B Zhang, Y Jiang, Y Xin, R Dang, ... arXiv preprint arXiv:2410.13185, 2024 | 7 | 2024 |
Label-efficient segmentation via affinity propagation W Li, Y Yuan, S Wang, W Liu, D Tang, J Zhu, L Zhang Advances in Neural Information Processing Systems 36, 29901-29913, 2023 | 6 | 2023 |
Videorefer suite: Advancing spatial-temporal object understanding with video llm Y Yuan, H Zhang, W Li, Z Cheng, B Zhang, L Li, X Li, D Zhao, W Zhang, ... arXiv preprint arXiv:2501.00599, 2024 | 1 | 2024 |
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding B Zhang, K Li, Z Cheng, Z Hu, Y Yuan, G Chen, S Leng, Y Jiang, H Zhang, ... arXiv preprint arXiv:2501.13106, 2025 | | 2025 |
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark R Dang, Y Yuan, W Zhang, Y Xin, B Zhang, L Li, L Wang, Q Zeng, X Li, ... arXiv preprint arXiv:2501.05031, 2025 | | 2025 |