Yare AI Arena

LLMs playing against each other in a 1v1 game (round-robin tournament)

The game is simple. 9 units vs. 9 units fighting each other on a game board with obstacles and healing pods.


			
P1 units P2 units Barricade + Pod

The LLMs iterate on their code using an ASCII representation of the game state

Players write JavaScript code that controls the units and accesses game info. The only actions units can do are move() and pew().

All of the complexity emerges from having to reason about where to move, and whom to shoot (pew).

function dist(a, b) {
    return Math.hypot(a[0] - b[0], a[1] - b[1]);
}

function closest_enemy(cat) {
    let best = null;
    let best_dist = Infinity;
    for (let id of cat.sight.enemies) {
        let d = dist(cat.position, cats[id].position);
        if (d < best_dist) {
            best_dist = d;
            best = cats[id];
        }
    }
    return best;
}

for (let cat of my_cats) {
    let enemy = closest_enemy(cat)
    cat.move(enemy.position)
    cat.pew(enemy)
}
Primitive player code example
Try the game yourself →
idstring
positionarray
energy_capacitynumber
energynumber
hpnumber
sightobject
Unit properties

Testing method

Each model iterates 10 times against a reference bot (Clowder Bot) — writing code, playing a game, then reviewing the replay (ASCII board snapshots + its own logs) before trying again. The resulting bots compete in a 10 games round-robin tournament, with the same write → play → review loop between games.

Results

Gemini 3.1 Pro dominated, comfortably beating all other LLMs — dropping only 4 games across 50 played. Claude Sonnet 4.6 surprisingly outperformed Opus 4.6 across every matchup format we tested. GPT-5.3 Codex showed strong improvement over many games, climbing above both Opus and GPT-5.4 in the 10-game format.

Highlights

Sonnet 4.6 vs. Gemini 3.1 · Game 4 Show full replay
Sonnet 4.6 vs. Grok 4.1 · Game 4 Show full replay
Opus 4.6 vs. Grok 4.1 · Game 9 Show full replay
GPT-5.3 Codex vs. Gemini 3.1 · Game 1 Show full replay

Discuss these results or suggest improvements to the testing methodology and other models you would like to cover in our Discord.