Gemini: a family of highly capable multimodal models G Team, R Anil, S Borgeaud, JB Alayrac, J Yu, R Soricut, J Schalkwyk, ... arXiv preprint arXiv:2312.11805, 2023 | 2556 | 2023 |
Palm 2 technical report R Anil, AM Dai, O Firat, M Johnson, D Lepikhin, A Passos, S Shakeri, ... arXiv preprint arXiv:2305.10403, 2023 | 1579 | 2023 |
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... arXiv preprint arXiv:2206.04615, 2022 | 1318 | 2022 |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context G Team, P Georgiev, VI Lei, R Burnell, L Bai, A Gulati, G Tanzer, ... arXiv preprint arXiv:2403.05530, 2024 | 1031 | 2024 |
BLiMP: The benchmark of linguistic minimal pairs for English A Warstadt, A Parrish, H Liu, A Mohananey, W Peng, SF Wang, ... Transactions of the Association for Computational Linguistics 8, 377-392, 2020 | 470 | 2020 |
Gemma 2: Improving open language models at a practical size G Team, M Riviere, S Pathak, PG Sessa, C Hardin, S Bhupatiraju, ... arXiv preprint arXiv:2408.00118, 2024 | 354 | 2024 |
BBQ: A hand-built bias benchmark for question answering A Parrish, A Chen, N Nangia, V Padmakumar, J Phang, J Thompson, ... arXiv preprint arXiv:2110.08193, 2021 | 324 | 2021 |
Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs A Warstadt arXiv preprint arXiv:1909.02597, 2019 | 138 | 2019 |
Dataperf: Benchmarks for data-centric ai development M Mazumder, C Banbury, X Yao, B Karlaš, W Gaviria Rojas, S Diamos, ... Advances in Neural Information Processing Systems 36, 2024 | 136 | 2024 |
QuALITY: Question answering with long input texts, yes! RY Pang, A Parrish, N Joshi, N Nangia, J Phang, A Chen, V Padmakumar, ... arXiv preprint arXiv:2112.08608, 2021 | 126 | 2021 |
Inverse scaling: When bigger isn't better IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ... arXiv preprint arXiv:2306.09479, 2023 | 85 | 2023 |
PaLM 2 Technical Report; 2023 R Anil, AM Dai, O Firat, M Johnson, D Lepikhin, A Passos, S Shakeri, ... arXiv preprint arXiv:2305.10403, 2023 | 59 | 2023 |
Dices dataset: Diversity in conversational ai evaluation for safety L Aroyo, A Taylor, M Diaz, C Homan, A Parrish, G Serapio-García, ... Advances in Neural Information Processing Systems 36, 2024 | 43 | 2024 |
Does putting a linguist in the loop improve NLU data collection? A Parrish, W Huang, O Agha, SH Lee, N Nangia, A Warstadt, K Aggarwal, ... arXiv preprint arXiv:2104.07179, 2021 | 41 | 2021 |
What do nlp researchers believe? results of the nlp community metasurvey J Michael, A Holtzman, A Parrish, A Mueller, A Wang, A Chen, D Madaan, ... arXiv preprint arXiv:2208.12852, 2022 | 34 | 2022 |
Introducing v0. 5 of the ai safety benchmark from mlcommons B Vidgen, A Agrawal, AM Ahmed, V Akinwande, N Al-Nuaimi, N Alfaraj, ... arXiv preprint arXiv:2404.12241, 2024 | 31 | 2024 |
A toolbox for surfacing health equity harms and biases in large language models SR Pfohl, H Cole-Lewis, R Sayres, D Neal, M Asiedu, A Dieng, ... Nature Medicine 30 (12), 3590-3600, 2024 | 29 | 2024 |
Two failures of self-consistency in the multi-step reasoning of LLMs A Chen, J Phang, A Parrish, V Padmakumar, C Zhao, SR Bowman, K Cho arXiv preprint arXiv:2305.14279, 2023 | 27 | 2023 |
NOPE: A corpus of naturally-occurring presuppositions in English A Parrish, S Schuster, A Warstadt, O Agha, SH Lee, Z Zhao, SR Bowman, ... arXiv preprint arXiv:2109.06987, 2021 | 24 | 2021 |
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation J Quaye, A Parrish, O Inel, C Rastogi, HR Kirk, M Kahng, E Van Liemt, ... The 2024 ACM Conference on Fairness, Accountability, and Transparency, 388-406, 2024 | 21* | 2024 |