From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Authorship attribution in the era of llms: Problems, methodologies, and challenges

B Huang, C Chen, K Shu - ACM SIGKDD Explorations Newsletter, 2025 - dl.acm.org
Accurate attribution of authorship is crucial for maintaining the integrity of digital content,
improving forensic investigations, and mitigating the risks of misinformation and plagiarism …

Datacomp-lm: In search of the next generation of training sets for language models

J Li, A Fang, G Smyrnis, M Ivgi, M Jordan… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset
experiments with the goal of improving language models. As part of DCLM, we provide a …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

Language models scale reliably with over-training and on downstream tasks

SY Gadre, G Smyrnis, V Shankar, S Gururangan… - arxiv preprint arxiv …, 2024 - arxiv.org
Scaling laws are useful guides for derisking expensive training runs, as they predict
performance of large models using cheaper, small-scale experiments. However, there …

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

B Peng, D Goldstein, Q Anthony, A Albalak… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the
RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed …

Chatqa: Surpassing gpt-4 on conversational qa and rag

Z Liu, W **, R Roy, P Xu, C Lee… - The Thirty-eighth …, 2024 - openreview.net
In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-
augmented generation (RAG) and conversational question answering (QA). To enhance …

Scaling laws for precision

T Kumar, Z Ankner, BF Spector, B Bordelon… - arxiv preprint arxiv …, 2024 - arxiv.org
Low precision training and inference affect both the quality and cost of language models, but
current scaling laws do not account for this. In this work, we devise" precision-aware" scaling …

Entropy law: The story behind data compression and llm performance

M Yin, C Wu, Y Wang, H Wang, W Guo, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Data is the cornerstone of large language models (LLMs), but not all data is useful for model
learning. Carefully selected data can better elicit the capabilities of LLMs with much less …

Training on the test task confounds evaluation and emergence

R Dominguez-Olmedo, FE Dorner, M Hardt - arxiv preprint arxiv …, 2024 - arxiv.org
We study a fundamental problem in the evaluation of large language models that we call
training on the test task. Unlike wrongful practices like training on the test data, leakage, or …