Benchmark explorer
Every model & condition
Filter and sort the full benchmark. valid → correct → works is the funnel; pitfall% is over the 28 known-bug prompts. Numbers come straight from the published data, baked in below.
Benchmark explorer
Filter and sort the full benchmark. valid → correct → works is the funnel; pitfall% is over the 28 known-bug prompts. Numbers come straight from the published data, baked in below.