Seguir
Zijia Zhao
Zijia Zhao
Institute of Automation, Chinese Academy Sciences (CASIA)
Dirección de correo verificada de ia.ac.cn
Título
Citado por
Citado por
Año
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
S Chen, H Li, Q Wang, Z Zhao, M Sun, X Zhu, J Liu
Advances in Neural Information Processing Systems 36, 72842-72866, 2023
1102023
Vl-mamba: Exploring state space models for multimodal learning
Y Qiao, Z Yu, L Guo, S Chen, Z Zhao, M Sun, Q Wu, J Liu
arXiv preprint arXiv:2403.13600, 2024
632024
Chatbridge: Bridging modalities with large language model as a language catalyst
Z Zhao, L Guo, T Yue, S Chen, S Shao, X Zhu, Z Yuan, J Liu
arXiv preprint arXiv:2305.16103, 2023
542023
Opt: Omni-perception pre-trainer for cross-modal understanding and generation
J Liu, X Zhu, F Liu, L Guo, Z Zhao, M Sun, W Wang, H Lu, S Zhou, J Zhang, ...
arXiv preprint arXiv:2107.00249, 2021
472021
Mamo: Fine-grained vision-language representations learning with masked multimodal modeling
Z Zhao, L Guo, X He, S Shao, Z Yuan, J Liu
Proceedings of the 46th International ACM SIGIR Conference on Research and …, 2023
16*2023
Sc-tune: Unleashing self-consistent referential comprehension in large vision language models
T Yue, J Cheng, L Guo, X Dai, Z Zhao, X He, G Xiong, Y Lv, J Liu
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024
102024
Needle in a video haystack: A scalable synthetic framework for benchmarking video mllms
Z Zhao, H Lu, Y Huo, Y Du, T Yue, L Guo, B Wang, W Chen, J Liu
arXiv e-prints, arXiv: 2406.09367, 2024
92024
Mm21 pre-training for video understanding challenge: Video captioning with pretraining techniques
S Chen, X Zhu, D Hao, W Liu, J Liu, Z Zhao, L Guo, J Liu
Proceedings of the 29th ACM International Conference on Multimedia, 4853-4857, 2021
82021
Towards event-oriented long video understanding
Y Du, K Zhou, Y Huo, Y Li, WX Zhao, H Lu, Z Zhao, B Wang, W Chen, ...
arXiv preprint arXiv:2406.14129, 2024
72024
Beyond literal descriptions: understanding and locating open-world objects aligned with human intentions
W Wang, Y Zhang, X He, Y Yan, Z Zhao, X Wang, J Liu
arXiv preprint arXiv:2402.11265, 2024
22024
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Z Zhao, L Guo, T Yue, E Hu, S Shao, Z Yuan, H Huang, J Liu
arXiv preprint arXiv:2410.18715, 2024
12024
Exploring the design space of visual context representation in video mllms
Y Du, Y Huo, K Zhou, Z Zhao, H Lu, H Huang, WX Zhao, B Wang, W Chen, ...
arXiv preprint arXiv:2410.13694, 2024
12024
OneDiff: A Generalist Model for Image Difference Captioning
E Hu, L Guo, T Yue, Z Zhao, S Xue, J Liu
Proceedings of the Asian Conference on Computer Vision, 2439-2455, 2024
12024
Collaborative Training of Tiny-Large Vision Language Models
S Lu, L Guo, W Wang, Z Zhao, T Yue, J Liu, S Liu
Proceedings of the 32nd ACM International Conference on Multimedia, 4928-4937, 2024
2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
H Huang, Y Huo, Z Zhao, H Lu, S Wu, B Wang, Q Liu, W Chen, L Wang
arXiv preprint arXiv:2410.16166, 2024
2024
El sistema no puede realizar la operación en estos momentos. Inténtalo de nuevo más tarde.
Artículos 1–15