Benchmark explorer

Every model & condition

Filter and sort the full benchmark. valid → correct → works is the funnel; pitfall% is over the 28 known-bug prompts. Numbers come straight from the published data, baked in below.

← back to overview