CORE-bench v1.1 Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. agent-evals/core-bench-v1.1-mainline Viewer • Updated 1 day ago • 39 • 51 agent-evals/core-bench-v1.1-ood Viewer • Updated 1 day ago • 19 • 35
CORE-bench v1.1 Benchmark for AI agents on scientific reproducibility — mainline (39) and OOD (19) splits derived from Code Ocean capsules. agent-evals/core-bench-v1.1-mainline Viewer • Updated 1 day ago • 39 • 51 agent-evals/core-bench-v1.1-ood Viewer • Updated 1 day ago • 19 • 35