A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering Y Li, L Wang, B Hu, X Chen, W Zhong, C Lyu, W Wang, M Zhang arXiv preprint arXiv:2311.07536, 2023 | 37 | 2023 |
Lmeye: An interactive perception network for large language models Y Li, B Hu, X Chen, L Ma, Y Xu, M Zhang IEEE Transactions on Multimedia, 2024 | 36 | 2024 |
Videovista: A versatile benchmark for video understanding and reasoning Y Li, X Chen, B Hu, L Wang, H Shi, M Zhang arXiv preprint arXiv:2406.11303, 2024 | 18 | 2024 |
A multi-modal context reasoning approach for conditional inference on joint textual and visual clues Y Li, B Hu, X Chen, Y Ding, L Ma, M Zhang arXiv preprint arXiv:2305.04530, 2023 | 14 | 2023 |
Vision-language model for generating textual descriptions from clinical images: Model development and validation study J Ji, Y Hou, X Chen, Y Pan, Y Xiang JMIR Formative Research 8, e32690, 2024 | 10 | 2024 |
Llms meet long video: Advancing long video comprehension with an interactive visual adapter in llms Y Li, X Chen, B Hu, M Zhang arXiv preprint arXiv:2402.13546 3 (7), 2024 | 7 | 2024 |
Cognitive visual-language mapper: Advancing multimodal comprehension with enhanced visual knowledge alignment Y Li, X Chen, B Hu, H Shi, M Zhang arXiv preprint arXiv:2402.13561, 2024 | 2 | 2024 |