Tools: Show Hn: A Real-time Strategy Game That AI Agents Can Play 2026

Tools: Show Hn: A Real-time Strategy Game That AI Agents Can Play 2026

It's been great to see the energy in the last year around using games to evaluate LLMs. Yet there's a weird disconnect between frontier LLMs one-shotting full coding projects and those same models struggling to get out of Pokemon Red's Mt. Moon.

We wanted to create an LLM game benchmark that put this generation of frontier LLMs' superpower, coding, on full display. Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." In Screeps, human players write javascript strategies that get executed in the game's environment. Players gain resources, lose territory, and have units wiped out. It's a traditional RTS, but controlled entirely through code.

The Screeps paradigm, writing code and having it execute in a real-time game environment, is well suited for an LLM benchmark. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

1 GPT 5.2 was run with high reasoning level. Future versions of LLM Skirmish could be run with xhigh reasoning level. Using xhigh was slowing down rounds and in initial test rounds did not show notable improvements over high.

2 Gemini 3 Pro under performance is driven by rounds 4-5. Explored in detail here.

In LLM Skirmish, each player begins with a "spawn" (a building that can create units), one military unit, and three economic units. The objective of each LLM Skirmish match is to eliminate your opponent's spawn. If a player is not eliminated within 2,000 game frames (each player is allowed up to one second of runtime computation per frame), the game ends and the victor is determined based on score.

Every LLM Skirmish tournament consists of five rounds. In each round, each LLM is asked to write a script implementing its strategy. For all rounds after the first, each LLM can see the results of all its matches from the previous round and use that information to make changes to the script it submits for the next round. In every round, every player plays all other players once. This means there are 10 matches per round and 50 matches per tournament.

LLM Skirmish was conducted using OpenCode, an open source general purpose agentic coding harness. OpenCo

Source: HackerNews