Internvideo2: Scaling foundation models for multimodal video understanding Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei, R Zheng, Z Wang, Y Shi, ... European Conference on Computer Vision, 396-416, 2024 | 129 | 2024 |
Timesuite: Improving mllms for long video understanding via grounded tuning X Zeng, K Li, C Wang, X Li, T Jiang, Z Yan, S Li, Y Shi, Z Yue, Y Wang, ... arXiv preprint arXiv:2410.19702, 2024 | 3 | 2024 |
Task preference optimization: Improving multimodal large language models with vision task alignment Z Yan, Z Li, Y He, C Wang, K Li, X Li, X Zeng, Z Wang, Y Wang, Y Qiao, ... arXiv preprint arXiv:2412.19326, 2024 | 1 | 2024 |
InternVideo2. 5: Empowering Video MLLMs with Long and Rich Context Modeling Y Wang, X Li, Z Yan, Y He, J Yu, X Zeng, C Wang, C Ma, H Huang, J Gao, ... arXiv preprint arXiv:2501.12386, 2025 | | 2025 |