Diverse Inference and Verification for Advanced Reasoning

I Drori, G Longhitano, M Mao, S Hyun, Y Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress
in mathematics and coding, yet find challenging advanced tasks such as International …

RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

Z Zhai, H Li, X Han, Z Zhang, Y Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advances in large language models (LLMs) have shown that they can answer
questions requiring complex reasoning. However, their ability to identify and respond to text …