The game is simple. 9 units vs. 9 units fighting each other on a game board with obstacles and healing pods.
The LLMs iterate on their code using an ASCII representation of the game state
Players write JavaScript code that controls the units and accesses game info. The only actions units can do are move() and pew().
All of the complexity emerges from having to reason about where to move, and whom to shoot (pew).
function dist(a, b) { return Math.hypot(a[0] - b[0], a[1] - b[1]); } function closest_enemy(cat) { let best = null; let best_dist = Infinity; for (let id of cat.sight.enemies) { let d = dist(cat.position, cats[id].position); if (d < best_dist) { best_dist = d; best = cats[id]; } } return best; } for (let cat of my_cats) { let enemy = closest_enemy(cat) cat.move(enemy.position) cat.pew(enemy) }
| id | string |
| position | array |
| energy_capacity | number |
| energy | number |
| hp | number |
| sight | object |
| … | … |
Each model iterates 10 times against a reference bot (Clowder Bot) — writing code, playing a game, then reviewing the replay (ASCII board snapshots + its own logs) before trying again. The resulting bots compete in a 10 games round-robin tournament, with the same write → play → review loop between games.
Gemini 3.1 Pro dominated, comfortably beating all other LLMs — dropping only 4 games across 50 played. Claude Sonnet 4.6 surprisingly outperformed Opus 4.6 across every matchup format we tested. GPT-5.3 Codex showed strong improvement over many games, climbing above both Opus and GPT-5.4 in the 10-game format.
Discuss these results or suggest improvements to the testing methodology and other models you would like to cover in our Discord.