♟️ Screenshot → FEN: Teaching a CNN to Read Chess Boards
I wanted a simple thing: look at a chess.com screenshot and get the board position as a FEN string. Voice command → screenshot → FEN → Stockfish analysis. A coaching loop for a ~1200 rated player who keeps blundering bishops.
What I got instead was a crash course in why computer vision is harder than it looks, why heuristics break on the second image, and why the deeplearning.ai CNN course is worth doing before you attempt something like this.
This post walks through the full journey — the five failed attempts at board detection, the from-scratch piece classifier, synthetic data generation, constrained decoding, and what we shipped. It's co-authored with Joe, my AI assistant running on OpenClaw, who pair-programmed the entire thing with me in a Jupyter notebook.
The Goal
Given a random chess.com screenshot, return:
- The FEN for the largest visible board
- Board orientation (white or black on bottom)
- Confidence metrics per square
Sounds straightforward. It wasn't.
Module 1: Finding the Board (Five Attempts)
The first problem is deceptively simple: given a full-screen screenshot with browser chrome, sidebars, ads, and maybe a YouTube video, find the chess board.
Attempt 1: Pretrained YOLO
YOLOv8 out of the box. COCO doesn't include "chess board" as a class, but maybe it'd detect it as a TV screen or monitor?
Nope. It found people, laptops, and a cell phone. No board. Fair enough — COCO has 80 classes and chess boards aren't one of them.
Attempt 2: Edge Detection + Contours
Classic CV approach: Canny edges → find contours → filter for squares. Zero candidates. The board's internal grid lines confused the contour detector, and the pieces sitting on squares broke any clean edge signal.
Attempt 3: Color Template Matching
Chess.com uses known color themes. If we know the exact light/dark square colors, we can create a binary mask and find the largest matching region.
THEMES = {
'green': {'light': (235,236,208), 'dark': (115,149, 82)},
'brown': {'light': (240,217,181), 'dark': (181,136, 99)},
'blue': {'light': (222,227,230), 'dark': (140,162,173)},
'purple': {'light': (230,213,236), 'dark': (150,111,168)},
}
This found board-colored regions but couldn't reliably isolate the board from the rest of the UI. Chess.com's sidebar uses similar greens.
Attempt 4: Sliding Window + Checkerboard Scoring
Combine color matching with structural validation. Slide a square window across the image at multiple scales, score each position on how well it matches an 8×8 alternating grid:
def checkerboard_score(region):
"""Score how well a region matches an 8x8 alternating grid."""
h, w = region.shape[:2]
sq_h, sq_w = h // 8, w // 8
light_vals, dark_vals = [], []
for r in range(8):
for c in range(8):
cell = region[r*sq_h:(r+1)*sq_h, c*sq_w:(c+1)*sq_w]
mean_val = cell.mean()
if (r + c) % 2 == 0:
light_vals.append(mean_val)
else:
dark_vals.append(mean_val)
contrast = abs(np.mean(light_vals) - np.mean(dark_vals))
uniformity = 1 / (1 + np.std(light_vals) + np.std(dark_vals))
return contrast * uniformity
This actually worked! On one image. It found the correct board with a slight offset — about a quarter-square off. A refinement pass (nudging ±pixels to maximize score) got it close enough.
Then I tried a second screenshot. It broke.
Lesson learned: Heuristics are great for exploration and documentation. They teach you what features matter. But they're brittle across varied inputs. After four attempts, the answer was obvious: train a model.
Attempt 5: Fine-Tune YOLOv8
I grabbed a labeled chess board dataset (498 training images, YOLO format) and fine-tuned
YOLOv8n for 50 epochs on our RTX 4090. Single class: chessboard.
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
model.train(
data='fen-data/data.yaml',
epochs=50,
imgsz=640,
batch=16,
name='fen-yolo/chessboard_v1'
)
Trained in minutes. Worked on every test image. Reliable bounding boxes with 90%+ confidence.
Four attempts of hand-crafted heuristics → one fine-tuning run that just works. This is the lesson the CNN course drills into you, and it hits differently when you've lived through the alternative.
| Attempt | Approach | Result |
|---|---|---|
| 1 | Pretrained YOLO | Detected people, not boards ❌ |
| 2 | Edge detection + contours | Zero candidates ❌ |
| 3 | Color template matching | Found regions, couldn't isolate ❌ |
| 4 | Sliding window + checkerboard | Worked on 1 image, broke on 2nd ⚠️ |
| 5 | Fine-tuned YOLOv8 | Reliable across all test images ✅ |
Module 2: Piece Classification (From Scratch)
With the board reliably cropped, the next step: split it into 64 squares and classify
each one. 13 classes: empty, wp, wn, wb, wr, wq, wk, bp, bn, bb, br, bq, bk.
This is where the deeplearning.ai Deep Learning Specialization (specifically Course 4: Convolutional Neural Networks) paid off directly. The course walks you through building CNN architectures from first principles — conv layers, pooling, batch norm, data augmentation — and that's exactly what this classifier needed.
The Architecture
Nothing fancy. A straightforward CNN stack built with TensorFlow/Keras, following patterns from the course:
model = keras.Sequential([
layers.Rescaling(1./255, input_shape=(*IMG_SIZE, 3)),
# Data augmentation (during training only)
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.02),
layers.RandomZoom(0.05),
# Conv blocks with increasing filters
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Conv2D(128, 3, padding='same', activation='relu'),
layers.BatchNormalization(),
layers.MaxPooling2D(),
# Dense head
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.4),
layers.Dense(13, activation='softmax'),
])
96×96 input images. Three conv blocks (32→64→128 filters), batch normalization after each, max pooling to downsample. Flatten → Dense → Dropout → 13-class softmax. Adam optimizer, sparse categorical crossentropy.
If you've done Course 4, this should look familiar. It's the standard architecture the course builds up to, minus the residual connections (which weren't needed for 13 classes on 96×96 crops).
The Data Problem
I started with manually labeled square crops from a few chess.com screenshots. Maybe 200 images total. Training worked — 95%+ validation accuracy.
Then I ran end-to-end on a new screenshot and it fell apart. Bishops got classified as pawns. Some empty squares got classified as pieces. The model had memorized the training screenshots' specific piece styles and colors.
The CNN course warns you about this: small datasets + high-capacity models = overfitting. The fix isn't more layers — it's more data.
Synthetic Data to the Rescue
I wrote a generator using python-chess that creates random legal board
positions, renders them as images with chess.com-style piece sprites, and auto-labels
all 64 squares:
import chess
import chess.svg
from pathlib import Path
import random
def generate_synthetic_boards(num_boards, out_dir):
"""Generate random legal positions, render, split into labeled squares."""
for i in range(num_boards):
board = chess.Board()
# Play random moves to get a realistic mid-game position
for _ in range(random.randint(5, 40)):
moves = list(board.legal_moves)
if not moves:
break
board.push(random.choice(moves))
# Render board and split into 64 squares
img = render_board(board) # custom renderer using piece sprites
for rank in range(8):
for file in range(8):
square = chess.square(file, 7 - rank)
piece = board.piece_at(square)
label = piece_to_label(piece) # e.g., 'wp', 'bk', 'empty'
crop = extract_square(img, rank, file)
save_path = Path(out_dir) / label / f'synth_{i}_r{rank}c{file}.png'
save_path.parent.mkdir(parents=True, exist_ok=True)
cv2.imwrite(str(save_path), crop)
I generated 2,000 synthetic boards (128,000 square crops), merged them with the real data, and retrained. The merged model generalized much better — bishops were no longer pawns.
Module 3: From Predictions to Valid FEN
Even with good per-square accuracy, the raw predictions often produced invalid boards. Chess has rules: one king per side, max 8 pawns, max 2 bishops (before promotions), etc. A board with 3 white kings and 10 black pawns is nonsense.
Constrained Decoding
Instead of blindly taking the argmax prediction for each square, I used the top-k probabilities and applied chess-legal constraints:
MAX_COUNTS = {
'bk': 1, 'wk': 1,
'bp': 8, 'wp': 8,
'bb': 2, 'wb': 2,
'bn': 2, 'wn': 2,
'br': 2, 'wr': 2,
'bq': 1, 'wq': 1,
}
def decode_with_constraints(topk_grid):
"""Build a valid board from top-k predictions per square."""
pred = [[topk_grid[r][c][0][0] for c in range(8)] for r in range(8)]
for piece, max_count in MAX_COUNTS.items():
while count(pred, piece) > max_count:
# Find the square where this piece has the lowest confidence
worst = find_lowest_confidence_square(pred, topk_grid, piece)
# Replace with next-best prediction that doesn't violate constraints
pred[worst.r][worst.c] = next_valid_alternative(topk_grid, worst, pred)
return board_to_fen(pred)
The constrained decoder acts as a safety net. When the CNN says "3 white kings," the decoder keeps the most confident one and downgrades the others to their second-choice prediction. It's a simple greedy repair, not beam search — but it catches most invalid boards.
Orientation Detection
Chess.com renders rank numbers (1–8) and file letters (a–h) in the corners of edge squares. If '1' is at the bottom, white's on the bottom. If '8' is at the bottom, black is.
I used Tesseract OCR on the corner squares. It was flaky — the labels are tiny, anti-aliased, and partially covered by pieces. Multiple rounds of preprocessing (thresholding, cropping to the exact corner pixel region, upscaling) got it working reliably enough.
The API
The whole pipeline is wrapped in a FastAPI service:
@app.post("/fen")
async def extract_fen(file: UploadFile):
img = load_image(file)
board_crop = yolo_detect_largest_board(img)
orientation = detect_orientation(board_crop)
squares = split_8x8(board_crop)
predictions = classify_squares(squares)
fen = constrained_decode(predictions)
return {
"fen": fen,
"orientation": orientation,
"confidence_summary": compute_confidence(predictions),
"warnings": collect_warnings(predictions)
}
Dockerized, runs on our RTX 4090 at home. The end-to-end dream: say "analyze this position" → OpenClaw takes a screenshot → hits the API → feeds the FEN to Stockfish → tells me what I should've played instead of hanging my bishop.
What We Learned
1. Heuristics teach, models ship
The four failed heuristic attempts weren't wasted. They taught me what features matter (checkerboard contrast, color clustering, grid alignment) and made the YOLO labels and CNN architecture choices informed rather than random. But heuristics don't generalize. The moment you test on a second image, they break.
2. The CNN course content is directly applicable
The deeplearning.ai CNN course (Course 4 of the Deep Learning Specialization) covers conv layers → pooling → batch norm → data augmentation → transfer learning. Every single one of those showed up in this project. If you're a student working through the course and wondering "when will I use this?" — this is when. Build something real alongside the coursework.
3. Synthetic data is a superpower
When your model overfits on 200 real images, generating 128,000 synthetic squares
from python-chess is a game-changer. The key insight: you don't need
photorealistic renders. You need variety in piece positions and board states.
The model learns piece shapes, not pixel-perfect rendering.
4. Post-processing matters as much as the model
A 97% per-square accuracy still means ~2 wrong squares per board. That's enough to produce an illegal position. Constrained decoding — using domain knowledge (chess rules) to repair predictions — turned "mostly right" into "usably correct."
5. Keep the failure log
The notebook is intentionally a full engineering log, including every failed attempt. When you're debugging at 1am and the sliding window scorer is returning garbage, you want to see exactly why you abandoned that approach three days ago. Future-you will thank present-you.
Human + AI Pair Programming
This project was built entirely through pair programming between me (Andre) and Joe, an AI assistant running on OpenClaw. The workflow: I'd describe what I wanted, Joe would write the code, I'd run it in the Jupyter notebook, we'd look at the results together, and iterate.
What worked well about this:
- Rapid prototyping — each heuristic attempt was coded and tested in minutes
- No context loss — the notebook captured every decision and its rationale
- Complementary strengths — I knew what chess boards look like; Joe knew OpenCV APIs and TensorFlow patterns
- Honest feedback — when an approach was clearly failing, neither of us was emotionally attached to it
The full notebook (with all the failed attempts preserved) is open source on GitHub.
What's Next
- Domain-specific training data — ingest more chess.com piece styles to reduce bishop/pawn confusion on unseen themes
- Transfer learning — try MobileNet or EfficientNet as the backbone instead of from-scratch (Course 4, Week 2 material)
- Voice integration — the original dream: "Hey Joe, analyze this position" → screenshot → FEN → Stockfish eval → spoken analysis
- Mobile — run the pipeline on a phone screenshot from the chess.com app
If you're taking the deeplearning.ai CNN course and want a project to apply what you're learning, chess piece classification is a great one. It's visual, the classes are well-defined, you can generate unlimited training data, and the domain constraints give you a natural post-processing layer to implement. Plus you end up with something you can actually use.