Models are given yare's documentation and tasked to one-shot a winning code. Each match consists of 3 rounds. The losing side's bot is replaced by a new one-shotted bot.
| 1 | 2 | 3 | 4 | ||
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 1 | 3 | 3 | |
| 2 | Grok 4.20 | 2 | 1 | 2 | |
| 3 | GPT 5.2 | 0 | 2 | 2 | |
| 4 | Google Gemini 3.1 Pro | 0 | 1 | 1 |
More thorough testing coming in a few days with an automatised setup. If you want to see specific models compared, mention them in our Community chat.