Прати
Yaya Shi
Yaya Shi
Верификована је имејл адреса на mail.ustc.edu.cn - Почетна страница
Наслов
Навело
Навело
Година
mplug-owl: Modularization empowers large language models with multimodality
Q Ye, H Xu, G Xu, J Ye, M Yan, Y Zhou, J Wang, A Hu, P Shi, Y Shi, C Li, ...
arXiv preprint arXiv:2304.14178, 2023
8412023
Object relational graph with teacher-recommended learning for video captioning
Z Zhang, Y Shi, C Yuan, B Li, P Wang, W Hu, ZJ Zha
Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2020
3732020
mplug-2: A modularized multi-modal foundation model across text, image and video
H Xu, Q Ye, M Yan, Y Shi, J Ye, Y Xu, C Li, B Bi, Q Qian, W Wang, G Xu, ...
International Conference on Machine Learning, 38728-38748, 2023
1342023
Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching
Y Shi, X Yang, H Xu, C Yuan, B Li, W Hu, ZJ Zha
Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2022
412022
mplug-paperowl: Scientific diagram analysis with the multimodal large language model
A Hu, Y Shi, H Xu, J Ye, Q Ye, M Yan, C Li, Q Qian, J Zhang, F Huang
Proceedings of the 32nd ACM International Conference on Multimedia, 6929-6938, 2024
322024
Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks
H Xu, Q Ye, X Wu, M Yan, Y Miao, J Ye, G Xu, A Hu, Y Shi, G Xu, C Li, ...
arXiv preprint arXiv:2306.04362, 2023
232023
Learning video-text aligned representations for video captioning
Y Shi, H Xu, C Yuan, B Li, W Hu, ZJ Zha
ACM Transactions on Multimedia Computing, Communications and Applications 19 …, 2023
182023
Mibench: Evaluating multimodal large language models over multiple images
H Liu, X Zhang, H Xu, Y Shi, C Jiang, M Yan, J Zhang, F Huang, C Yuan, ...
arXiv preprint arXiv:2407.15272, 2024
92024
mPLUGOwl: Modularization Empowers Large Language Models with Multimodality. CoRR abs/2304.14178 (2023)
Q Ye, H Xu, G Xu, J Ye, M Yan, Y Zhou, J Wang, A Hu, P Shi, Y Shi, C Li, ...
92023
Learning semantics-grounded vocabulary representation for video-text retrieval
Y Shi, H Liu, H Xu, Z Ma, Q Ye, A Hu, M Yan, J Zhang, F Huang, C Yuan, ...
Proceedings of the 31st ACM International Conference on Multimedia, 4460-4470, 2023
52023
Uniqrnet: Unifying referring expression grounding and segmentation with qrnet
J Ye, J Tian, M Yan, H Xu, Q Ye, Y Shi, X Yang, X Wang, J Zhang, L He, ...
ACM Transactions on Multimedia Computing, Communications and Applications 20 …, 2024
22024
iMOVE: Instance-Motion-Aware Video Understanding
J Li, Y Shi, Z Ma, H Xu, F Cheng, H Xiao, R Kang, F Yang, T Gao, D Zhang
arXiv preprint arXiv:2502.11594, 2025
2025
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
H Liu, Y Shi, H Xu, C Yuan, Q Ye, C Li, M Yan, J Zhang, F Huang, B Li, ...
arXiv preprint arXiv:2403.00249, 2024
2024
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
H Liu, Y Shi, H Xu, C Yuan, Q Ye, C Li, M Yan, J Zhang, F Huang, B Li, ...
arXiv preprint arXiv:2402.16769, 2024
2024
VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning
Z Zhang, Y Shi, J Wei, C Yuan, B Li, W Hu
arXiv preprint arXiv:1910.05752, 2019
2019
Систем тренутно не може да изврши ову радњу. Пробајте поново касније.
Чланци 1–15