Skip to main contentSkip to navigation
S
Projects
ML / AIProject

AI Agent for the Squadro Board Game

Personal Project · 2025

PythonPyTorchOutperforms Humans

Demo — Author vs AI

I (yellow) play the Squadro board game against a deep reinforcement-learning agent (red) built from scratch in Python.

8+Algorithms implementedFrom random to AlphaZero
100Games per matchupControlled benchmark
1.8MModel parameters5-pawn DQL agent
7.1 MBModel size on diskLightweight inference
About

What is Squadro?

♟️

The Game

Squadro is a two-player board game on a 5×5 grid. The goal is to complete a return trip with four pawns before your opponent. Each pawn moves at a speed determined by the dots at its starting position (1–3). Landing on an opponent's pawn sends it back to the start.

🤖

The Challenge

The game tree is large but not infinite — exactly the sweet spot where learned evaluation can sometimes beat pure rollout, and vice versa. This project explores that boundary by pitting eight different algorithms against each other under controlled conditions.

🏆

The Result

MCTS Rollout outperforms all agents including the author (human). MCTS Deep Q-Learning is second overall, but beats Rollout at very short time budgets (<0.2s/move), where neural inference costs dominate the search budget.

Agents

Eight algorithms, one leaderboard

Every algorithm navigates the same exploration–exploitation tradeoff: explore the game tree, then evaluate the states you reach. The quality of the evaluation and the number of explorations possible within the time budget determines who wins.

MCTS Deep Q-LearningTop Tier

MCTS with a policy-value CNN trained by self-play. The AlphaZero variant. Fewer tree searches than rollout but each is guided by a learned neural network.

Beats rollout at <0.2s/move; slower at longer budgets due to CPU inference cost.

MCTS RolloutTop Tier

MCTS with random playouts to end-of-game as the state evaluator. Simple but fast — runs ~10x more simulations than DQL per move.

Best overall at 3s/move. Small state space makes fast rollouts decisive.

MCTS AdvancementMid Tier

MCTS with a heuristic evaluation function based on relative pawn advancement. No neural network; cheaper per simulation but lower evaluation quality.

Minimax + Alpha-Beta PruningMid Tier

Exhaustive tree search to fixed depth with alpha-beta pruning to skip provably suboptimal branches. Deterministic and interpretable.

Relative AdvancementBasic

Greedy one-move lookahead using relative advancement as the evaluation function. No tree search.

AdvancementBasic

Greedy one-move lookahead using absolute advancement (ignores opponent state).

MCTS Q-LearningMid Tier

MCTS guided by a learned Q-value lookup table. Practical only for small grids (≤3 pawns) where the state space fits in memory.

RandomBasic

Uniformly random move selection. Baseline for all comparisons.

Results

Pairwise win-rate matrix

All agents evaluated head-to-head under identical conditions: max 3 seconds per move, 100 games per pair, original 5×5 grid. Values show the win rate of the row agent against the column agent.

Row vs Column →HumanMCTS DQL ★MCTS Rollout ★MCTS AdvancementAB Relative AdvancementRelative AdvancementAdvancementRandom
Human0.200.000.400.801.001.001.00
MCTS DQL ★0.800.240.750.541.001.001.00
MCTS Rollout ★1.000.760.940.770.980.991.00
MCTS Advancement0.600.250.060.321.001.001.00
AB Relative Advancement0.200.460.230.681.001.001.00
Relative Advancement0.000.000.020.000.000.500.97
Advancement0.000.000.010.000.000.500.95
Random0.000.000.000.000.000.030.05

Win rate of the row agent against the column agent over 100 games (max 3 sec/move). ★ = top-tier agents.Green = high win rate · Red = low win rate

Architecture

Inside the AlphaZero variant

The MCTS Deep Q-Learning agent is a variant of AlphaZero. Each move is selected by running MCTS guided by a policy-value CNN trained purely through self-play — no human data or hard-coded heuristics.

Several techniques stabilize and accelerate training:

Board flippingExploits board symmetry to double effective training data
Dual-value headSeparate heads for policy and win-probability estimation
Cosine annealing LRAdaptive learning rate scheduling for stable convergence
Per-player loss balancingPrevents one player from dominating the gradient
Entropy regularizationPolicy regularization via entropy loss to encourage exploration
Experience replay samplingAdaptive buffer sampling based on self-play win rate
Backpropagation freezePlayer-dependent freeze in case of deep Elo asymmetry

Training Metrics

Training metrics: win rate, buffer diversity, policy/value loss, Elo
Self-play win rateStays near 50% — model improves symmetrically
Buffer diversityRemains above 80% — prevents overfitting
Checkpoint win rateRises above 70% → triggers checkpoint replacement
EloSmoothly increasing — key convergence metric
Models

Pre-trained agents on Hugging Face

Q-Learning
2 pawns18 kB
Q-Learning
3 pawns6.2 MB
Deep Q-Learning
3 pawns380k params1.5 MB
Deep Q-Learning
4 pawns1.8M params7.1 MB
Deep Q-Learning
5 pawns1.8M params7.1 MB
Usage

Get started in two commands

Install
pip install squadro
Works on Linux, Windows, macOS · Python ≥ 3.11
Play against the best agent
import squadro
squadro.GamePlay(agent_1='best').run()
Downloads the pre-trained model automatically on first run
Train your own agent
trainer = squadro.DeepQLearningTrainer(
  n_pawns=5, model_path='my_model'
)
trainer.run()
A few days on CPU for 5 pawns; much faster on GPU