[HTML][HTML] A survey of GPT-3 family large language models including ChatGPT and GPT-4

KS Kalyan - Natural Language Processing Journal, 2024 - Elsevier
Large language models (LLMs) are a special class of pretrained language models (PLMs)
obtained by scaling model size, pretraining corpus and computation. LLMs, because of their …

An empirical survey on long document summarization: Datasets, models, and metrics

HY Koh, J Ju, M Liu, S Pan - ACM computing surveys, 2022 - dl.acm.org
Long documents such as academic articles and business reports have been the standard
format to detail out important issues and complicated subjects that require extra attention. An …

G-eval: NLG evaluation using gpt-4 with better human alignment

Y Liu, D Iter, Y Xu, S Wang, R Xu, C Zhu - arxiv preprint arxiv:2303.16634, 2023 - arxiv.org
The quality of texts generated by natural language generation (NLG) systems is hard to
measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE …

Chateval: Towards better llm-based evaluators through multi-agent debate

CM Chan, W Chen, Y Su, J Yu, W Xue, S Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Text evaluation has historically posed significant challenges, often demanding substantial
labor and time cost. With the emergence of large language models (LLMs), researchers …

Ragas: Automated evaluation of retrieval augmented generation

S Es, J James, LE Anke… - Proceedings of the 18th …, 2024 - aclanthology.org
Abstract We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework
for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAGAs is …

Is chatgpt a good nlg evaluator? a preliminary study

J Wang, Y Liang, F Meng, Z Sun, H Shi, Z Li… - arxiv preprint arxiv …, 2023 - arxiv.org
Recently, the emergence of ChatGPT has attracted wide attention from the computational
linguistics community. Many prior studies have shown that ChatGPT achieves remarkable …

News summarization and evaluation in the era of gpt-3

T Goyal, JJ Li, G Durrett - arxiv preprint arxiv:2209.12356, 2022 - arxiv.org
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …

Benchmarking foundation models with language-model-as-an-examiner

Y Bai, J Ying, Y Cao, X Lv, Y He… - Advances in …, 2023 - proceedings.neurips.cc
Numerous benchmarks have been established to assess the performance of foundation
models on open-ended question answering, which serves as a comprehensive test of a …

Towards a unified multi-dimensional evaluator for text generation

M Zhong, Y Liu, D Yin, Y Mao, Y Jiao, P Liu… - arxiv preprint arxiv …, 2022 - arxiv.org
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural
Language Generation (NLG), ie, evaluating the generated text from multiple explainable …

Bartscore: Evaluating generated text as text generation

W Yuan, G Neubig, P Liu - Advances in neural information …, 2021 - proceedings.neurips.cc
A wide variety of NLP applications, such as machine translation, summarization, and dialog,
involve text generation. One major challenge for these applications is how to evaluate …