- Academic Search

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org

Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

Simpan Kutip Dirujuk 21 kali Artikel terkait 4 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Benchmark data contamination of large language models: A survey

C Xu, S Guan, D Greene, M Kechadi - arxiv preprint arxiv:2406.04244, 2024 - arxiv.org

The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …

Simpan Kutip Dirujuk 28 kali Artikel terkait 4 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Chatbot arena: An open platform for evaluating llms by human preference

WL Chiang, L Zheng, Y Sheng… - … on Machine Learning, 2024 - openreview.net

Large Language Models (LLMs) have unlocked new capabilities and applications; however,
evaluating the alignment with human preferences still poses significant challenges. To …

Simpan Kutip Dirujuk 423 kali Artikel terkait 8 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Simpan Kutip Dirujuk 135 kali Artikel terkait 7 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Don't make your LLM an evaluation benchmark cheater

K Zhou, Y Zhu, Z Chen, W Chen, WX Zhao… - arxiv preprint arxiv …, 2023 - arxiv.org

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence,
attaining remarkable improvement in model capacity. To assess the model performance, a …

Simpan Kutip Dirujuk 140 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Livecodebench: Holistic and contamination free evaluation of large language models for code

N Jain, K Han, A Gu, WD Li, F Yan, T Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) applied to code-related applications have emerged as a
prominent field, attracting significant interest from both academia and industry. However, as …

Simpan Kutip Dirujuk 119 kali Artikel terkait 5 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

LLM Dataset Inference: Did you train on my dataset?

P Maini, H Jia, N Papernot… - Advances in Neural …, 2025 - proceedings.neurips.cc

The proliferation of large language models (LLMs) in the real world has come with a rise in
copyright cases against companies for training their models on unlicensed data from the …

Simpan Kutip Dirujuk 27 kali Artikel terkait 6 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Bridging language and items for retrieval and recommendation

Y Hou, J Li, Z He, A Yan, X Chen, J McAuley - arxiv preprint arxiv …, 2024 - arxiv.org

This paper introduces BLaIR, a series of pretrained sentence embedding models
specialized for recommendation scenarios. BLaIR is trained to learn correlations between …

Simpan Kutip Dirujuk 83 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Task contamination: Language models may not be few-shot anymore

C Li, J Flanigan - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org

Large language models (LLMs) offer impressive performance in various zero-shot and few-
shot tasks. However, their success in zero-shot or few-shot settings may be affected by task …

Simpan Kutip Dirujuk 81 kali Artikel terkait 4 versi Versi HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Hallucination-free? assessing the reliability of leading ai legal research tools

V Magesh, F Surani, M Dahl, M Suzgun… - arxiv preprint arxiv …, 2024 - arxiv.org

Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI).
Such tools are designed to assist with a wide range of core legal tasks, from search and …

Simpan Kutip Dirujuk 52 kali Artikel terkait 5 versi Versi HTML

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

Proving test set contamination in black-box language models

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

Benchmark data contamination of large language models: A survey

Chatbot arena: An open platform for evaluating llms by human preference

Foundational challenges in assuring alignment and safety of large language models

Don't make your LLM an evaluation benchmark cheater

Livecodebench: Holistic and contamination free evaluation of large language models for code

LLM Dataset Inference: Did you train on my dataset?

Bridging language and items for retrieval and recommendation

Task contamination: Language models may not be few-shot anymore

Hallucination-free? assessing the reliability of leading ai legal research tools