Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Minicpm-v: A gpt-4v level mllm on your phone
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …
reshaped the landscape of AI research and industry, shedding light on a promising path …
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?
The remarkable progress of Multi-modal Large Language Models (MLLMs) has gained
unparalleled attention. However, their capabilities in visual math problem-solving remain …
unparalleled attention. However, their capabilities in visual math problem-solving remain …
Theoremqa: A theorem-driven question answering dataset
The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving
fundamental math problems like GSM8K by achieving over 90% accuracy. However, their …
fundamental math problems like GSM8K by achieving over 90% accuracy. However, their …
A survey of deep learning for mathematical reasoning
Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in
various fields, including science, engineering, finance, and everyday life. The development …
various fields, including science, engineering, finance, and everyday life. The development …
RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing
Pretrained vision-language models (VLMs) utilizing extensive image–text paired data have
demonstrated unprecedented image–text association capabilities, achieving remarkable …
demonstrated unprecedented image–text association capabilities, achieving remarkable …
Large language models for mathematical reasoning: Progresses and challenges
Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive
capabilities of human intelligence. In recent times, there has been a notable surge in the …
capabilities of human intelligence. In recent times, there has been a notable surge in the …
Document understanding dataset and evaluation (dude)
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …
embrace the challenge of creating more practically-oriented benchmarks. Document …
Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression
Geometry problem solving is a well-recognized testbed for evaluating the high-level multi-
modal reasoning capability of deep models. In most existing works, two main geometry …
modal reasoning capability of deep models. In most existing works, two main geometry …