Chatbot arena: An open platform for evaluating llms by human preference

WL Chiang, L Zheng, Y Sheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have unlocked new capabilities and applications; however,
evaluating the alignment with human preferences still poses significant challenges. To …

Generalization or memorization: Data contamination and trustworthy evaluation for large language models

Y Dong, X Jiang, H Liu, Z **, B Gu, M Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent statements about the impressive capabilities of large language models (LLMs) are
usually supported by evaluating on open-access benchmarks. Considering the vast size and …

Spiking-physformer: Camera-based remote photoplethysmography with parallel spike-driven transformer

M Liu, J Tang, Y Chen, H Li, J Qi, S Li, K Wang, J Gan… - Neural Networks, 2025 - Elsevier
Artificial neural networks (ANNs) can help camera-based remote photoplethysmography
(rPPG) in measuring cardiac activity and physiological signals from facial videos, such as …

Key-point-driven data synthesis with its enhancement on mathematical reasoning

Y Huang, X Liu, Y Gong, Z Gou, Y Shen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have shown great potential in complex reasoning tasks, yet
their performance is often hampered by the scarcity of high-quality, reasoning-focused …

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

J Li, G Li, X Zhang, Y Dong, Z ** - arxiv preprint arxiv:2404.00599, 2024 - arxiv.org
How to evaluate Large Language Models (LLMs) in code generation is an open question.
Existing benchmarks demonstrate poor alignment with real-world code repositories and are …

Livecodebench: Holistic and contamination free evaluation of large language models for code

N Jain, K Han, A Gu, WD Li, F Yan, T Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) applied to code-related applications have emerged as a
prominent field, attracting significant interest from both academia and industry. However, as …

Can Language Models Solve Olympiad Programming?

Q Shi, M Tang, K Narasimhan, S Yao - arxiv preprint arxiv:2404.10952, 2024 - arxiv.org
Computing olympiads contain some of the most challenging problems for humans, requiring
complex algorithmic reasoning, puzzle solving, in addition to generating efficient code …

Benchmark Data Contamination of Large Language Models: A Survey

C Xu, S Guan, D Greene, M Kechadi - arxiv preprint arxiv:2406.04244, 2024 - arxiv.org
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …

Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs

Z Zeng, P Chen, H Jiang, J Jia - arxiv preprint arxiv:2312.17080, 2023 - arxiv.org
In this work, we introduce a novel evaluation paradigm for Large Language Models, one that
challenges them to engage in meta-reasoning. This approach addresses critical …

Real-time Fake News from Adversarial Feedback

S Chen, Y Huang, B Dhingra - arxiv preprint arxiv:2410.14651, 2024 - arxiv.org
We show that existing evaluations for fake news detection based on conventional sources,
such as claims on fact-checking websites, result in high accuracies over time for LLM-based …