Familiar Actors

Why this exists

I'm terrible at recognizing faces. When I'm watching a show, I'll often see an actor and think "I know that face..." — then I'll look them up and realize it's not who I thought it was. Now I'm stuck wondering: who was I thinking of?

Familiar Actors aims to solve that. Type in the actor you just looked up, and it'll show you actors who look similar. One of them is probably (hopefully) who you had in mind.

How it works

Actor headshots are sourced from TMDB's database of 406,323 actors.
Each headshot is processed through OpenCLIP (ViT-B-32) to generate an embedding — a 512-dimensional vector that captures visual features like face shape, coloring, and overall appearance.
When you search for an actor, their embedding is compared against all 406,317 others using cosine similarity.
The most similar faces are returned, ranked by score.

Tech stack

FastAPI + Jinja2 + HTMX — Python backend with server-rendered templates and async interactivity
OpenCLIP (ViT-B-32) — visual similarity embeddings from OpenAI's CLIP model
SQLModel / SQLite — actor metadata storage
numpy — in-memory cosine similarity search across 100k+ embeddings
rapidfuzz — typo-tolerant actor name search
TMDB API — actor data, headshots, and movie/show cast lists
Deployed on Railway

The development story

This project was built collaboratively with Claude (Anthropic's AI assistant) over the course of a few intensive sessions. The collaboration was genuine — not "AI writes code, human approves" but a back-and-forth where we workshopped ideas, debated approaches, and course-corrected together.

Some highlights from the process:

The ArcFace pivot. We initially used deepface with ArcFace embeddings for facial similarity. The results were terrible — Samuel L. Jackson's closest match was a young white woman. It turned out ArcFace is designed for face verification (is this the same person?), not visual similarity. Swapping to CLIP, which captures holistic visual similarity, immediately brought match scores from 35% to 89%.
The Railway deployment battle. Getting the dataset onto Railway's volume was an adventure involving inode limits (180k tiny files), ephemeral disk confusion, and multiple volume wipes. The solution was consolidating 180k individual embedding files into a single numpy array — the app now downloads a 206MB tarball on first boot.
TV cast discovery. We found that TMDB's regular credits endpoint only returns 4 actors for The Office (not even Steve Carell!). Switching to the aggregate_credits endpoint returned all 691 cast members across every season.

Dataset

The database currently contains 406,323 actors with 406,317 searchable embeddings. Data is sourced from TMDB's popular actors endpoint, cast lists from top-rated movies and TV shows, and a background crawler that systematically discovers actors across decades of film and television.

Building this dataset is a slow, respectful process. TMDB generously provides free API access, but their rate limits (40 requests per 10 seconds) mean collecting data on hundreds of thousands of actors takes days of continuous crawling. The current dataset represents multiple days of background processing — crawling credits from movies and TV shows spanning decades, downloading headshots, and generating CLIP embeddings for each one. For example, check out this last crawl session that just fetched info for actors in movies released from 1970-2026: INFO Discover crawl complete: 214030 new actors from 439382 titles in 251809s It took nearly 3 days to fetch and process data for 214k actors from ~440k movies.

About me

I'm Steven Zuber, a software engineer at SymbyAI and a long time artificial intelligence enthusiast. While I've used Claude in my professional work for years and have used Claude to help with dozens of small, one-off problems, this is my first project I've started from scratch with Claude Code. You can find this project's source code on GitHub.

If you have any feedback, suggestions for improvement, or just want to chat about the project, please feel free to check out the GitHub Discussions page for this project.