Gpt-4o system card A Hurst, A Lerer, AP Goucher, A Perelman, A Ramesh, A Clark, AJ Ostrow, ... arXiv preprint arXiv:2410.21276, 2024 | 144 | 2024 |
Mle-bench: Evaluating machine learning agents on machine learning engineering JS Chan, N Chowdhury, O Jaffe, J Aung, D Sherburn, E Mays, G Starace, ... arXiv preprint arXiv:2410.07095, 2024 | 17* | 2024 |
AI Sandbagging: Language Models can Strategically Underperform on Evaluations T van der Weij, F Hofstätter, O Jaffe, SF Brown, FR Ward arXiv preprint arXiv:2406.07358, 2024 | 13 | 2024 |
SWE-bench Verified N Chowdhury, J Aung, CJ Shern, O Jaffe, D Sherburn, G Starace, E Mays, ... Aug, 2024 | 4 | 2024 |
Tall tales at different scales: Evaluating scaling trends for deception in language models FR Ward, F Hofstätter, LA Thomson, HM Wood, O Jaffe, P Bartak, ... | 1 | 2023 |
AI Sandbagging: Language Models can Selectively Underperform on Evaluations T van der Weij, F Hofstätter, O Jaffe, SF Brown, FR Ward Workshop on Socially Responsible Language Modelling Research, 0 | | |