ML / AIProject

AI Agent for the Squadro Board Game

Personal Project · 2025

An AlphaZero-inspired agent using Monte Carlo Tree Search guided by a self-play-trained policy-value CNN. Outperforms all other implemented algorithms and average human players — including the author.

PythonPyTorch★ Outperforms Humans

Official Rules Code

Demo — Author vs AI

I (yellow) play the Squadro board game against a deep reinforcement-learning agent (red) built from scratch in Python.

8+Algorithms implementedFrom random to AlphaZero

100Games per matchupControlled benchmark

1.8MModel parameters5-pawn DQL agent

7.1 MBModel size on diskLightweight inference

About

What is Squadro?

♟️

The Game

Squadro is a two-player board game on a 5×5 grid. The goal is to complete a return trip with four pawns before your opponent. Each pawn moves at a speed determined by the dots at its starting position (1–3). Landing on an opponent's pawn sends it back to the start.

🤖

The Challenge

The game tree is large but not infinite — exactly the sweet spot where learned evaluation can sometimes beat pure rollout, and vice versa. This project explores that boundary by pitting eight different algorithms against each other under controlled conditions.

🏆

The Result

MCTS Rollout outperforms all agents including the author (human). MCTS Deep Q-Learning is second overall, but beats Rollout at very short time budgets (<0.2s/move), where neural inference costs dominate the search budget.

Agents

Eight algorithms, one leaderboard

Every algorithm navigates the same exploration–exploitation tradeoff: explore the game tree, then evaluate the states you reach. The quality of the evaluation and the number of explorations possible within the time budget determines who wins.

MCTS Deep Q-LearningTop Tier

MCTS with a policy-value CNN trained by self-play. The AlphaZero variant. Fewer tree searches than rollout but each is guided by a learned neural network.

Beats rollout at <0.2s/move; slower at longer budgets due to CPU inference cost.

MCTS RolloutTop Tier

MCTS with random playouts to end-of-game as the state evaluator. Simple but fast — runs ~10x more simulations than DQL per move.

Best overall at 3s/move. Small state space makes fast rollouts decisive.

MCTS AdvancementMid Tier

MCTS with a heuristic evaluation function based on relative pawn advancement. No neural network; cheaper per simulation but lower evaluation quality.

Minimax + Alpha-Beta PruningMid Tier

Exhaustive tree search to fixed depth with alpha-beta pruning to skip provably suboptimal branches. Deterministic and interpretable.

Relative AdvancementBasic

Greedy one-move lookahead using relative advancement as the evaluation function. No tree search.

AdvancementBasic

Greedy one-move lookahead using absolute advancement (ignores opponent state).

MCTS Q-LearningMid Tier

MCTS guided by a learned Q-value lookup table. Practical only for small grids (≤3 pawns) where the state space fits in memory.

RandomBasic

Uniformly random move selection. Baseline for all comparisons.

Results

Pairwise win-rate matrix

All agents evaluated head-to-head under identical conditions: max 3 seconds per move, 100 games per pair, original 5×5 grid. Values show the win rate of the row agent against the column agent.

Row vs Column →	Human	MCTS DQL ★	MCTS Rollout ★	MCTS Advancement	AB Relative Advancement	Relative Advancement	Advancement	Random
Human	—	0.20	0.00	0.40	0.80	1.00	1.00	1.00
MCTS DQL ★	0.80	—	0.24	0.75	0.54	1.00	1.00	1.00
MCTS Rollout ★	1.00	0.76	—	0.94	0.77	0.98	0.99	1.00
MCTS Advancement	0.60	0.25	0.06	—	0.32	1.00	1.00	1.00
AB Relative Advancement	0.20	0.46	0.23	0.68	—	1.00	1.00	1.00
Relative Advancement	0.00	0.00	0.02	0.00	0.00	—	0.50	0.97
Advancement	0.00	0.00	0.01	0.00	0.00	0.50	—	0.95
Random	0.00	0.00	0.00	0.00	0.00	0.03	0.05	—

Win rate of the row agent against the column agent over 100 games (max 3 sec/move). ★ = top-tier agents.Green = high win rate · Red = low win rate

Architecture

Inside the AlphaZero variant

The MCTS Deep Q-Learning agent is a variant of AlphaZero. Each move is selected by running MCTS guided by a policy-value CNN trained purely through self-play — no human data or hard-coded heuristics.

Several techniques stabilize and accelerate training:

▸

Board flipping — Exploits board symmetry to double effective training data

▸

Dual-value head — Separate heads for policy and win-probability estimation

▸

Cosine annealing LR — Adaptive learning rate scheduling for stable convergence

▸

Per-player loss balancing — Prevents one player from dominating the gradient

▸

Entropy regularization — Policy regularization via entropy loss to encourage exploration

▸

Experience replay sampling — Adaptive buffer sampling based on self-play win rate

▸

Backpropagation freeze — Player-dependent freeze in case of deep Elo asymmetry

Training Metrics

Self-play win rate— Stays near 50% — model improves symmetrically

Buffer diversity— Remains above 80% — prevents overfitting

Checkpoint win rate— Rises above 70% → triggers checkpoint replacement

Elo— Smoothly increasing — key convergence metric

Models

Pre-trained agents on Hugging Face

No training required — models are downloaded automatically from Hugging Face on first use. All models are lightweight and run well on CPU.

Q-Learning

2 pawns18 kB

Q-Learning

3 pawns6.2 MB

Deep Q-Learning

3 pawns380k params1.5 MB

Deep Q-Learning

4 pawns1.8M params7.1 MB

Deep Q-Learning

5 pawns1.8M params7.1 MB

Usage

Get started in two commands

Install

pip install squadro

Works on Linux, Windows, macOS · Python ≥ 3.11

Play against the best agent

import squadro
squadro.GamePlay(agent_1='best').run()

Downloads the pre-trained model automatically on first run

Train your own agent

trainer = squadro.DeepQLearningTrainer(
  n_pawns=5, model_path='my_model'
)
trainer.run()

A few days on CPU for 5 pawns; much faster on GPU

GitHub Repository Hugging Face Models