Building chessLLM: Why I Built an AI Chess Arena for Modern LLMs, How It Works, and What’s Next
Could Claude beat GPT-4 at chess? Does Gemini actually climb an ELO ladder—or just vibe its way through openings? I wanted a real answer, not Reddit lore. So I built chessLLM: a clean, terminal-first benchmark that lets modern language models battle each other (and real engines) in a controlled arena with budgets, telemetry, and receipts.
If you’ve seen my LedgerGlass app, you know how I work: minimal surface area, no mysterious black boxes, and ruthless attention to cost and UX. chessLLM follows the same philosophy—except the “UI” is a beautiful terminal board and a live dashboard that makes evaluation feel like a sport.
TL;DR
- What: A Python CLI that benchmarks LLMs via head-to-head chess games, with smart opponent selection and strict move validation.
- How: Two modes—pure prompts vs agent reasoning with a lightweight analysis toolkit and structured workflows.
- Why: I wanted a scientific, repeatable way to compare models; anecdotes weren’t cutting it.
- Nice bits: Real-time board + stats, PGN logging, historical leaderboards, and built-in cost tracking (because I hate surprises).
- Who: Runs providers I use daily—OpenAI, Anthropic, Google—plus Stockfish for grounding and sanity checks.
- Next: Opening books, tournament brackets, MCTS-style planning, and multi-agent collaboration.
Why I built it (and why this is my kind of project)
I’m a grad student at uOttawa and a part-time software engineer at Canada’s NRC. My weeks are a weird blend of academic rigor, production constraints, and building small tools that actually ship. That lens shaped this project:
- I needed real evaluation, not vibes. If I’m going to say “Model A > Model B for structured reasoning,” I want a dataset, not a hunch.
- I care about costs. I track expenses obsessively (LedgerGlass changed how I think about spend), so chessLLM tells me exactly what each game costs—per model, per move, per run.
- I love terminal UIs. A tight CLI with great ergonomics beats a heavy web stack when iteration speed matters.
- I’m curious about agents—practically. Not “agents that can do everything.” Just… do they help on a bounded task like chess selection? Answer: yes, dramatically.
What I set out to build (constraints that kept me honest)
- Terminal-first. Unicode pieces, smooth live updates, useful keyboard shortcuts. No browser. No CSS tantrums.
- Provider-agnostic. One clean interface for multiple vendors. Swap models without rewriting logic.
- Cost-aware. Live budget limits, spending breakdowns, and “stop before my wallet cries” safeguards.
- Engine-grade validation. All moves are verified; Stockfish can be your sparring partner at calibrated ELOs.
- Reproducible. Everything’s logged as PGN with full config so you can rerun, compare, and publish.
From web apps to terminal chess (and why it was addictive)
Most of my recent stuff is web + APIs. Flipping to a terminal app felt like a dopamine cheat code:
- Immediate feedback. Run → see a board animate → watch two AIs argue over the center.
- Focus. No React state or CSS breakpoints—just logic, data, and performance.
- Async that actually sings. Concurrent calls, UI refreshes, and game state all play nicely with modern Python.
- Performance as craft. Shaving milliseconds off rendering and batching calls feels like tuning a game engine.
How chessLLM works
Architecture at a glance
- Core models:
BotSpec,GameRecord,LadderStats,LiveStatekeep data types explicit and debuggable. - Engine layer: Stockfish handles legality + optional opponent play at specific ELOs.
- LLM layer: Unified clients for OpenAI/Anthropic/Google plus a random baseline.
- Game manager: Orchestrates turns, logs PGNs, tracks timing/costs, and updates the live dashboard.
- Agent system: Optional reasoning pass with lightweight evaluation and scored move selection.
- Storage: SQLite-backed history, leaderboards, and spend analytics.
Two ways to play
Traditional prompting (default)
- Send FEN → ask for a UCI move in strict format
- Parse → validate → apply → repeat
Agent mode (--use-agent)
- Observe the position with quick heuristics (material, structure, king safety, center, activity)
- Generate all legal moves and score them
- Adjust strategy by phase (opening/middlegame/endgame)
- Pick the move with a confidence score and emit a compact reasoning trace
The analysis toolkit (simple, on purpose)
- Position: center control, piece activity, king safety, pawn structure, development
- Material: standard piece values + imbalance tracking
- Move classes: captures, checks, castling, development, tactical, defensive
- Endgames: king activity, pawn race progress, coordination
- Threats: detect/defend simple tactics without overfitting
Real-time terminal UI
- Unicode board with last-move highlights and coordinates
- Live ELO, win-rate, timers, token/cost meters
- Opponent picker: random, 600-ELO Stockfish, or full ladder climbing
Performance analytics
- Historical leaderboards + ELO deltas
- PGN logs with metadata
- Timing per move and per provider
- Cost reports that make budgeting easy to reason about
Patterns that held up
AsyncIO that stays readable
- Concurrent provider calls without starving the UI
- Graceful recovery if a model suggests an illegal move or times out
Terminal UIs can be beautiful
richmakes dashboards feel like a real app, not a log dump- Frame rates that feel smooth but not noisy
LLM chess behavior emerges fast
- Tiny format tweaks can make/break parsing stability
- Context management matters with long games
- Agent > raw prompts once positions get even slightly tactical
Engines keep us honest
- Stockfish is a rock—fast, fair, and great for grounding difficulty
- ELO calibration gives “levels” that are meaningful, not arbitrary
Python patterns I reach for
- Dataclasses for clarity and serialization
- Type hints across async boundaries
- Factory methods for model clients and agents
- Env-driven config + CLI flags for everything that matters
The agent “oh wow” moment
Side-by-side runs made it obvious:
Prompts only (struggles)
- Illegal moves happen more than you’d expect
- Decisions feel unprincipled in messy middlegames
- Notation/formatting errors are a tax you keep paying
Agent mode (wins)
- Structured scoring leads to saner, repeatable choices
- Quick threat checks catch blunders before they happen
- Confidence scores + traces make debugging feel like analysis, not guesswork
I ship tools for people who care about control and observability; this was the first time I felt that same “observability” applied to a model’s thinking in a game loop.
Roadmap
Short-term (next month)
- Opening book support
- Elimination brackets with seeds
- Custom evaluation hooks
- Batch runs for large experiments
Quarter-scale
- Stronger evaluation via Leela-style nets
- MCTS-like lookahead for the agent path
- Multi-agent collaboration (committee of styles)
- Deeper stats: opening repertoires, tactical motifs, endgames
Exploratory
- Learn-from-own-games mode
- “Play like X” style transfer
- Puzzle-only benchmarks
- Real-time commentary streams
- Cross-domain strategy experiments (Go, poker, etc.)
