Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?

R Zhang, D Jiang, Y Zhang, H Lin, Z Guo, P Qiu… - … on Computer Vision, 2024 - Springer
The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained
unparalleled attention. However, their capabilities in visual math problem-solving remain …

Theoremqa: A theorem-driven question answering dataset

W Chen, M Yin, M Ku, P Lu, Y Wan, X Ma… - Proceedings of the …, 2023 - aclanthology.org
The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving
fundamental math problems like GSM8K by achieving over 90% accuracy. However, their …

A survey of deep learning for mathematical reasoning

P Lu, L Qiu, W Yu, S Welleck, KW Chang - arxiv preprint arxiv:2212.10535, 2022 - arxiv.org
Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in
various fields, including science, engineering, finance, and everyday life. The development …

RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing

Z Zhang, T Zhao, Y Guo, J Yin - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Pretrained vision-language models (VLMs) utilizing extensive image–text paired data have
demonstrated unprecedented image–text association capabilities, achieving remarkable …

Large language models for mathematical reasoning: Progresses and challenges

J Ahn, R Verma, R Lou, D Liu, R Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive
capabilities of human intelligence. In recent times, there has been a notable surge in the …

Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression

J Chen, T Li, J Qin, P Lu, L Lin, C Chen… - arxiv preprint arxiv …, 2022 - arxiv.org
Geometry problem solving is a well-recognized testbed for evaluating the high-level multi-
modal reasoning capability of deep models. In most existing works, two main geometry …