Internvideo2: Scaling foundation models for multimodal video understanding Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei, R Zheng, Z Wang, Y Shi, ... European Conference on Computer Vision, 396-416, 2024 | 118 | 2024 |
Video mamba suite: State space model as a versatile alternative for video understanding G Chen, Y Huang, J Xu, B Pei, Z Chen, Z Li, J Wang, K Li, T Lu, L Wang arXiv preprint arXiv:2403.09626, 2024 | 59 | 2024 |
EgoExoLearn: A Dataset for Bridging Asynchronous Ego-and Exo-centric View of Procedural Activities in Real World Y Huang, G Chen, J Xu, M Zhang, L Yang, B Pei, H Zhang, L Dong, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 23 | 2024 |
Egovideo: Exploring egocentric foundation model and downstream adaptation B Pei, G Chen, J Xu, Y He, Y Liu, K Pan, Y Huang, Y Wang, T Lu, L Wang, ... arXiv preprint arXiv:2406.18070, 2024 | 8 | 2024 |
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model Y Huang, J Xu, B Pei, Y He, G Chen, L Yang, X Chen, Y Wang, Z Nie, ... arXiv preprint arXiv:2412.21080, 2024 | 1 | 2024 |
Cg-bench: Clue-grounded question answering benchmark for long video understanding G Chen, Y Liu, Y Huang, Y He, B Pei, J Xu, Y Wang, T Lu, L Wang arXiv preprint arXiv:2412.12075, 2024 | 1 | 2024 |