Pre-trained language models for text generation: A survey

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org
Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

Bridging the gap: A survey on integrating (human) feedback for natural language generation

P Fernandes, A Madaan, E Liu, A Farinhas… - Transactions of the …, 2023 - direct.mit.edu
Natural language generation has witnessed significant advancements due to the training of
large language models on vast internet-scale datasets. Despite these advancements, there …

Holistic evaluation of language models

R Bommasani, P Liang, T Lee - … of the New York Academy of …, 2023 - Wiley Online Library
Abstract Language models (LMs) like GPT‐3, PaLM, and ChatGPT are the foundation for
almost all major language technologies, but their capabilities, limitations, and risks are not …

NusaCrowd: Open source initiative for Indonesian NLP resources

S Cahyawijaya, H Lovenia, AF Aji… - Findings of the …, 2023 - aclanthology.org
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for
Indonesian languages, including opening access to previously non-public resources …

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

Evaluating general-purpose ai with psychometrics

X Wang, L Jiang, J Hernandez-Orallo… - arxiv preprint arxiv …, 2023 - arxiv.org
Comprehensive and accurate evaluation of general-purpose AI systems such as large
language models allows for effective mitigation of their risks and deepened understanding of …

Chef: A comprehensive evaluation framework for standardized assessment of multimodal large language models

Z Shi, Z Wang, H Fan, Z Yin, L Sheng, Y Qiao… - arxiv preprint arxiv …, 2023 - arxiv.org
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting
with visual content with myriad potential downstream tasks. However, even though a list of …

Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

A Elmadany, A El-Shangiti… - Findings of the …, 2023 - aclanthology.org
We present Dolphin, a novel benchmark that addresses the need for a natural language
generation (NLG) evaluation framework dedicated to the wide collection of Arabic …

Measuring the measuring tools: An automatic evaluation of semantic metrics for text corpora

G Kour, S Ackerman, O Raz, E Farchi, B Carmeli… - arxiv preprint arxiv …, 2022 - arxiv.org
The ability to compare the semantic similarity between text corpora is important in a variety
of natural language processing applications. However, standard methods for evaluating …

LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English

T Santosh, C Weiss, M Grabmair - arxiv preprint arxiv:2410.09527, 2024 - arxiv.org
In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress.
However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking …