Training verifiers to solve math word problems K Cobbe, V Kosaraju, M Bavarian, M Chen, H Jun, L Kaiser, M Plappert, ... arXiv preprint arXiv:2110.14168, 2021 | 2614 | 2021 |
Webgpt: Browser-assisted question-answering with human feedback R Nakano, J Hilton, S Balaji, J Wu, L Ouyang, C Kim, C Hesse, S Jain, ... arXiv preprint arXiv:2112.09332, 2021 | 1120 | 2021 |
Quantifying generalization in reinforcement learning K Cobbe, O Klimov, C Hesse, T Kim, J Schulman International conference on machine learning, 1282-1289, 2019 | 777 | 2019 |
Leveraging procedural generation to benchmark reinforcement learning K Cobbe, C Hesse, J Hilton, J Schulman International conference on machine learning, 2048-2056, 2020 | 640 | 2020 |
Let's verify step by step H Lightman, V Kosaraju, Y Burda, H Edwards, B Baker, T Lee, J Leike, ... arXiv preprint arXiv:2305.20050, 2023 | 560 | 2023 |
Event scheduling presentation in a graphical user interface environment Y Shoham, JE Bank, K Cobbe, A Matta, M Rubin, ZI Weiner, KT Toft US Patent 10,088,973, 2018 | 230 | 2018 |
Phasic policy gradient KW Cobbe, J Hilton, O Klimov, J Schulman International Conference on Machine Learning, 2020-2027, 2021 | 193 | 2021 |
Training verifiers to solve math word problems, 2021 K Cobbe, V Kosaraju, M Bavarian, M Chen, H Jun, L Kaiser, M Plappert, ... URL https://arxiv. org/abs/2110.14168, 2021 | 164 | 2021 |
Openai o1 system card A Jaech, A Kalai, A Lerer, A Richardson, A El-Kishky, A Low, A Helyar, ... arXiv preprint arXiv:2412.16720, 2024 | 27 | 2024 |
Measuring sample efficiency and generalization in reinforcement learning benchmarks: Neurips 2020 procgen benchmark S Mohanty, J Poonganam, A Gaidon, A Kolobov, B Wulfe, D Chakraborty, ... arXiv preprint arXiv:2103.15332, 2021 | 24 | 2021 |
Batch size-invariance for policy optimization J Hilton, K Cobbe, J Schulman Advances in Neural Information Processing Systems 35, 17086-17098, 2022 | 15 | 2022 |