Internvideo2: Scaling foundation models for multimodal video understanding Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei, R Zheng, Z Wang, Y Shi, ... European Conference on Computer Vision, 396-416, 2024 | 131 | 2024 |
Video mamba suite: State space model as a versatile alternative for video understanding G Chen, Y Huang, J Xu, B Pei, Z Chen, Z Li, J Wang, K Li, T Lu, L Wang arXiv preprint arXiv:2403.09626, 2024 | 68 | 2024 |
Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world Y Huang, G Chen, J Xu, M Zhang, L Yang, B Pei, H Zhang, L Dong, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 27 | 2024 |
Egovideo: Exploring egocentric foundation model and downstream adaptation B Pei, G Chen, J Xu, Y He, Y Liu, K Pan, Y Huang, Y Wang, T Lu, L Wang, ... arXiv preprint arXiv:2406.18070, 2024 | 8 | 2024 |
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model Y Huang, J Xu, B Pei, Y He, G Chen, L Yang, X Chen, Y Wang, Z Nie, ... arXiv preprint arXiv:2412.21080, 2024 | 1 | 2024 |
Cg-bench: Clue-grounded question answering benchmark for long video understanding G Chen, Y Liu, Y Huang, Y He, B Pei, J Xu, Y Wang, T Lu, L Wang arXiv preprint arXiv:2412.12075, 2024 | 1 | 2024 |
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning B Pei, Y Huang, J Xu, G Chen, Y He, L Yang, Y Wang, W Xie, Y Qiao, ... The Thirteenth International Conference on Learning Representations, 0 | | |