Familiar Actors

That actor looks familiar — but who are you actually thinking of? Search by name or by what you're watching.

Want the full breakdown? Check out the technical deep dive for details on the embedding pipeline, similarity search, deployment architecture, and lessons learned.

Why this exists

I'm terrible at recognizing faces. When I'm watching a show, I'll often see an actor and think "I know that face..." — then I'll look them up and realize it's not who I thought it was. Now I'm stuck wondering: who was I thinking of?

Familiar Actors aims to solve that. Type in the actor you just looked up, and it'll show you actors who look similar. One of them is probably (hopefully) who you had in mind.

How it works

  1. Actor headshots are sourced from TMDB's database of 406,323 actors.
      By the way, that's the real current number. The counts on this page come from two SQL queries in the /about route — one for actors, one for embedding paths.
  2. Each headshot is processed through OpenCLIP (ViT-B-32) to generate an embedding — a 512-dimensional vector that captures visual features like face shape, coloring, and overall appearance.
  3. When you search for an actor, their embedding is compared against all 406,317 others using cosine similarity.
  4. The most similar faces are returned, ranked by score.

Tech stack

  • FastAPI + Jinja2 + HTMX — Python backend with server-rendered templates and async interactivity
  • OpenCLIP (ViT-B-32) — visual similarity embeddings from OpenAI's CLIP model
  • SQLModel / SQLite — actor metadata storage
  • numpy — in-memory cosine similarity search across 100k+ embeddings
  • rapidfuzz — typo-tolerant actor name search
  • TMDB API — actor data, headshots, and movie/show cast lists
  • Deployed on Railway

The development story

This project was built collaboratively with Claude (Anthropic's AI assistant) over the course of a few intensive sessions. The collaboration was genuine — not "AI writes code, human approves" but a back-and-forth where we workshopped ideas, debated approaches, and course-corrected together.

Some highlights from the process:

  • The ArcFace pivot. We initially used deepface with ArcFace embeddings for facial similarity. The results were terrible — Samuel L. Jackson's closest match was a young white woman. It turned out ArcFace is designed for face verification (is this the same person?), not visual similarity. Swapping to CLIP, which captures holistic visual similarity, immediately brought match scores from 35% to 89%.
  • The Railway deployment battle. Getting the dataset onto Railway's volume was an adventure involving inode limits (180k tiny files), ephemeral disk confusion, and multiple volume wipes. The solution was consolidating 180k individual embedding files into a single numpy array — the app now downloads a 206MB tarball on first boot.
  • TV cast discovery. We found that TMDB's regular credits endpoint only returns 4 actors for The Office (not even Steve Carell!). Switching to the aggregate_credits endpoint returned all 691 cast members across every season.

Dataset

The database currently contains 406,323 actors with 406,317 searchable embeddings. Data is sourced from TMDB's popular actors endpoint, cast lists from top-rated movies and TV shows, and a background crawler that systematically discovers actors across decades of film and television.

Building this dataset is a slow, respectful process. TMDB generously provides free API access, but their rate limits (40 requests per 10 seconds) mean collecting data on hundreds of thousands of actors takes days of continuous crawling. The current dataset represents multiple days of background processing — crawling credits from movies and TV shows spanning decades, downloading headshots, and generating CLIP embeddings for each one. For example, check out this last crawl session that just fetched info for actors in movies released from 1970-2026: INFO Discover crawl complete: 214030 new actors from 439382 titles in 251809s It took nearly 3 days to fetch and process data for 214k actors from ~440k movies.

About me

I'm Steven Zuber, a software engineer at SymbyAI and a long time artificial intelligence enthusiast. While I've used Claude in my professional work for years and have used Claude to help with dozens of small, one-off problems, this is my first project I've started from scratch with Claude Code. You can find this project's source code on GitHub.

If you have any feedback, suggestions for improvement, or just want to chat about the project, please feel free to check out the GitHub Discussions page for this project.