Chatbot arena: An open platform for evaluating llms by human preference
Large Language Models (LLMs) have unlocked new capabilities and applications; however,
evaluating the alignment with human preferences still poses significant challenges. To …
evaluating the alignment with human preferences still poses significant challenges. To …
Generalization or memorization: Data contamination and trustworthy evaluation for large language models
Recent statements about the impressive capabilities of large language models (LLMs) are
usually supported by evaluating on open-access benchmarks. Considering the vast size and …
usually supported by evaluating on open-access benchmarks. Considering the vast size and …
Spiking-physformer: Camera-based remote photoplethysmography with parallel spike-driven transformer
Artificial neural networks (ANNs) can help camera-based remote photoplethysmography
(rPPG) in measuring cardiac activity and physiological signals from facial videos, such as …
(rPPG) in measuring cardiac activity and physiological signals from facial videos, such as …
Key-point-driven data synthesis with its enhancement on mathematical reasoning
Large language models (LLMs) have shown great potential in complex reasoning tasks, yet
their performance is often hampered by the scarcity of high-quality, reasoning-focused …
their performance is often hampered by the scarcity of high-quality, reasoning-focused …
EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
How to evaluate Large Language Models (LLMs) in code generation is an open question.
Existing benchmarks demonstrate poor alignment with real-world code repositories and are …
Existing benchmarks demonstrate poor alignment with real-world code repositories and are …
Livecodebench: Holistic and contamination free evaluation of large language models for code
Large Language Models (LLMs) applied to code-related applications have emerged as a
prominent field, attracting significant interest from both academia and industry. However, as …
prominent field, attracting significant interest from both academia and industry. However, as …
Can Language Models Solve Olympiad Programming?
Computing olympiads contain some of the most challenging problems for humans, requiring
complex algorithmic reasoning, puzzle solving, in addition to generating efficient code …
complex algorithmic reasoning, puzzle solving, in addition to generating efficient code …
Benchmark Data Contamination of Large Language Models: A Survey
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …
Gemini has transformed the field of natural language processing. However, it has also …
Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
In this work, we introduce a novel evaluation paradigm for Large Language Models, one that
challenges them to engage in meta-reasoning. This approach addresses critical …
challenges them to engage in meta-reasoning. This approach addresses critical …
Real-time Fake News from Adversarial Feedback
We show that existing evaluations for fake news detection based on conventional sources,
such as claims on fact-checking websites, result in high accuracies over time for LLM-based …
such as claims on fact-checking websites, result in high accuracies over time for LLM-based …