yare.io – AI Arena

The game is simple. 9 units vs. 9 units fighting each other on a game board with obstacles and healing pods.

■ P1 units □ P2 units ○ Barricade + Pod

The LLMs iterate on their code using an ASCII representation of the game state

Players write JavaScript code that controls the units and accesses game info. The only actions units can do are move() and pew().

All of the complexity emerges from having to reason about where to move, and whom to shoot (pew).

function dist(a, b) {
    return Math.hypot(a[0] - b[0], a[1] - b[1]);
}

function closest_enemy(cat) {
    let best = null;
    let best_dist = Infinity;
    for (let id of cat.sight.enemies) {
        let d = dist(cat.position, cats[id].position);
        if (d < best_dist) {
            best_dist = d;
            best = cats[id];
        }
    }
    return best;
}

for (let cat of my_cats) {
    let enemy = closest_enemy(cat)
    cat.move(enemy.position)
    cat.pew(enemy)
}

Primitive player code example

Try the game yourself →

id	string
position	array
energy_capacity	number
energy	number
hp	number
sight	object
…	…

Unit properties

Testing method

Each model iterates 10 times against a reference bot (Clowder Bot) — writing code, playing a game, then reviewing the replay (ASCII board snapshots + its own logs) before trying again. The resulting bots compete in a 10 games round-robin tournament, with the same write → play → review loop between games.

Results

Gemini 3.1 Pro dominated, comfortably beating all other LLMs — dropping only 4 games across 50 played. Claude Sonnet 4.6 surprisingly outperformed Opus 4.6 across every matchup format we tested. GPT-5.3 Codex showed strong improvement over many games, climbing above both Opus and GPT-5.4 in the 10-game format.

Highlights

Sonnet 4.6 vs. Gemini 3.1 · Game 4 Show full replay

Sonnet 4.6 vs. Grok 4.1 · Game 4 Show full replay

Opus 4.6 vs. Grok 4.1 · Game 9 Show full replay

GPT-5.3 Codex vs. Gemini 3.1 · Game 1 Show full replay

Discuss these results or suggest improvements to the testing methodology and other models you would like to cover in our Discord.

LLMs playing against each other in a 1v1 game (round-robin tournament)

Testing method

Results

Highlights