Familiar Actors

Technical Deep Dive

This page covers the engineering decisions behind Familiar Actors — what worked, what didn't, and why things are built the way they are.

The Embedding Pipeline

The core idea is simple: turn each actor's headshot into a list of 512 numbers (a "vector") that captures what they look like, then compare those lists to find similar-looking people.

Why CLIP, not a face recognition model?

Our first attempt used ArcFace via deepface — a model designed for face verification ("is this the same person?"). The results were terrible. Samuel L. Jackson's closest match was a young white woman named Camilla Belle. Known lookalike pairs like Bryce Dallas Howard and Jessica Chastain didn't match at all.

The problem: ArcFace is trained with angular margin loss that maximizes the distance between different identities. It's literally optimized to make different people look as different as possible. The similarity scores between different people were meaningless noise.

CLIP (Contrastive Language-Image Pre-training) captures holistic visual similarity — face shape, coloring, hair, expression, overall vibe. When someone says "that actor looks familiar," they're matching on this kind of high-level impression, not pixel-perfect facial geometry. Switching to CLIP immediately brought match scores from 35% to 89% for known similar actors.

Multi-photo averaging

A single headshot can be misleading — bad lighting, unusual angle, a styled look for a particular role. To get a more stable representation, we fetch up to 5 profile photos per actor from TMDB (ranked by community votes), generate an embedding for each, and average them. The math: L2-normalize each embedding so every photo contributes equally, compute the element-wise mean, then re-normalize the result back to the unit sphere.

Data Architecture

The system has three layers of data, each serving a different purpose:

The database (familiar_actors.db)

A SQLite database that stores metadata for every actor: their name, TMDB ID, image URLs, and file paths pointing to their headshots and embeddings. This is the source of truth — everything else can be regenerated from it (with time and API calls). It's small (~24MB) and gets committed to the repo as a backup.

Headshots (data/headshots/)

JPEG thumbnails (185px wide) downloaded from TMDB's image CDN. These serve two purposes: they're the input for CLIP embedding generation, and they could be used for local image display. The web app actually hotlinks TMDB's CDN for display (saving disk space on the server), so the local headshots are only needed when generating new embeddings. They're expendable — if deleted, they can be re-downloaded, but doing so takes days due to rate limits.

Embeddings (data/embeddings_clip/ and data/embeddings_avg/)

Each actor's visual representation as a 512-dimensional numpy array, saved as individual .npy files. embeddings_clip/ contains single-photo embeddings; embeddings_avg/ contains multi-photo averaged embeddings (higher quality, when available). The similarity index prefers averaged embeddings and falls back to single-photo.

The consolidated index (embeddings_index.npy + embeddings_ids.json)

For deployment, all individual embedding files are consolidated into a single numpy array (~200MB) plus a JSON file mapping array positions to database IDs. This exists because Railway's volume has inode limits that prevented deploying 180k+ individual files. The consolidation script (consolidate_index.py) rebuilds these from the individual files whenever the dataset changes.

The relationship

Database (source of truth)
  ↓ points to
Headshots (raw images) → fed into CLIP → Individual Embeddings
                                              ↓ consolidated into
                                         Index Files (for Railway)
                                              ↓ loaded at startup
                                         In-Memory Numpy Array (for search)

Locally, the app loads individual embedding files directly. On Railway, it loads the consolidated index. Either way, the end result is the same: a single numpy matrix in memory, ready for cosine similarity search.

Similarity Search

With 406,317 actor embeddings loaded into memory as a single numpy array, finding similar actors is a matrix multiplication:

similarities = embeddings @ query_vector

Since all embeddings are L2-normalized at load time, this dot product gives us cosine similarity directly. The result is a score for every actor in the database, computed in a single vectorized operation. Sorting and taking the top 10 is the slow part — and it's still under a millisecond.

What is L2-normalization?

Each CLIP embedding is a vector of 512 numbers — but those numbers can have different magnitudes depending on the image. One photo might produce a vector with values ranging from -2 to 3, another from -0.5 to 0.8. If we compared these raw vectors, the magnitude would affect the similarity score — a "louder" vector would appear more similar to everything.

L2-normalization fixes this by scaling every vector to have a length (magnitude) of exactly 1. Think of it as projecting every point onto the surface of a sphere. After normalization, the only thing that differs between vectors is their direction — which is exactly what we care about. Two actors who "point" in similar directions in this 512-dimensional space look similar.

The math is straightforward: divide each vector by its L2-norm (the square root of the sum of its squared components). Once every vector sits on the unit sphere, the dot product between any two of them equals their cosine similarity — a value from -1 (opposite) to 1 (identical). This is why a single matrix multiplication gives us all the similarity scores at once.

We also L2-normalize during multi-photo averaging: normalize each individual photo's embedding before averaging so that every photo contributes equally regardless of its raw magnitude, then re-normalize the average back onto the unit sphere.

For the current dataset size, plain numpy is more than fast enough. If the dataset grows to millions of actors, we'll swap the internals to FAISS (Facebook's similarity search library) for approximate nearest-neighbor search. The interface wouldn't change — just faster lookups.

Actor Name Search

The search bar uses a two-stage approach. First, a prefix match checks if any actor name starts with what you've typed — this is the fast path for normal autocomplete. If no prefix matches are found (likely a typo), it falls back to fuzzy matching via rapidfuzz's WRatio scorer, which handles character transpositions, partial matches, and token reordering.

All 406,323 actor names are loaded into memory at startup (~10MB). Both search stages operate entirely in-memory with no database queries, keeping autocomplete latency under 30ms.

Data Collection

TMDB's free API provides the actor data, headshots, and movie/TV cast lists that power this app. Building the dataset involves several data sources, each with different coverage:

Popular actors endpoint — the starting point, but caps at ~10k people and skews toward currently trending actors.
Credits crawling — fetching cast lists from top-rated movies and TV shows. This is how we find character actors and supporting cast who would never appear on the "popular" list.
Discover endpoint crawling — systematically iterating through decades of film and TV (1970–present), fetching the cast for every discovered title. This is the broadest source and runs as a background process over multiple days.

TMDB's rate limit of 40 requests per 10 seconds means the crawler respects a careful pace. Each movie requires one API call for credits, plus individual calls to download headshots. The current dataset took multiple days of continuous crawling to build.

Frontend Architecture

The UI is server-rendered HTML with HTMX for interactivity — no JavaScript framework, no build step, no client-side routing. When you click an actor card, HTMX sends a request to the server, receives a rendered HTML fragment, and swaps it into the page. The search input, tab switching, and autocomplete are vanilla JavaScript.

Browser history is managed via HTMX's hx-push-url attribute, which pushes each search URL to the browser's history stack. The server detects whether a request came from HTMX (partial page swap) or direct navigation (bookmark, back button) via the HX-Request header and returns either a fragment or a full page accordingly.

Deployment

The app runs on Railway with a persistent volume for the SQLite database and consolidated embedding index. The production deploy is lightweight — it doesn't include PyTorch or the CLIP model, since those are only needed for generating embeddings (done locally). The similarity search just loads the pre-computed numpy arrays.

On first deploy, the app downloads a ~200MB tarball from a GitHub Release containing the database and consolidated embeddings. This self-populating approach means the deployment is a single git push — no manual data transfer needed.

The inode lesson

Originally, each actor's embedding was stored as an individual .npy file — 180,000+ files totaling ~700MB. This worked locally but failed spectacularly on Railway, which has inode limits on its volumes. The solution was consolidating all embeddings into a single numpy array file (~200MB), bringing the file count from 180k down to 3 (database + embeddings array + ID mapping).

What's next

Photo upload search — skip the name entirely, upload a screenshot from what you're watching and find similar actors directly. CLIP already embeds images, so this is architecturally straightforward. The hard part will be implementing the image upload and processing pipeline. Would a simple drag-and-drop interface work? Totally. But if you want to snag a picture of your tv from across the room with your phone, it's a more complicated story.
"Known for" labels — show 1-2 movie/TV titles under each result so you can quickly identify where you've seen them.
Larger dataset — the crawler is still running, systematically expanding coverage across decades of film and television.