Lumina: Video & Image Search
January 1, 2026
Search engine for videos and images using OpenAI CLIP for vector embeddings. Indexed 2k+ YouTube videos (by frame) and 108k COCO images. Also includes Wikipedia search.
What is it?
I wanted to build a search engine that could find images and video frames from plain English without relying on tags. So instead of matching keywords against metadata, I converted both text and visual content into vectors and searched by similarity in embedding space.
The result is a system where I can type something like a scene description and get back matching images or video frames even if those exact words never appear anywhere in the source data. That was the part that made vector search feel genuinely powerful to me.
Two systems, one frontend
Under the hood, this was really two different retrieval systems sharing one UI. For image search, I used one CLIP-style model and one vector database. For video search, I used a different embedding model and a different database setup. The frontend hides that split, but the backend knows which pipeline to route each request through.
This was not the cleanest architecture in the world, but it reflected how the project actually evolved. I changed tools mid-build as I learned more, and instead of pretending that never happened, the repo still shows the transitions. That is pretty normal in personal projects once the experimentation gets real.
Video embedding pipeline
The video pipeline was one of the heavier data jobs I had built at that point. I scraped YouTube channel data with `yt-dlp`, downloaded short low-quality clips to keep the workload manageable, extracted frames with OpenCV, and then generated embeddings for those frames with CLIP.
I also had to think about failure recovery. Embedding thousands of videos is not something you want to restart from zero every time a script dies. So I added checkpointing after batches of work and introduced delays between requests to avoid acting like a bot that deserved to be throttled. Most of the compute ran on Kaggle’s free T4 GPU, which made the whole thing feel like a puzzle in budget-aware engineering.
Multi-stage ranking
Pure vector search is cool, but I learned pretty quickly that it is not always enough by itself. If I searched for a person or channel by name, the nearest visual match was sometimes not the exact thing I wanted. So I added a second ranking step that boosts results when text-based clues line up too.
That made the system feel much smarter in practice. First I do vector retrieval to find visually or semantically similar candidates. Then I adjust scores based on metadata like captions or channel names. It is a simple reranking idea, but it helps a lot because the final answer is based on both meaning and context instead of just one embedding score.
Infrastructure and the /proxy_image endpoint
I also ran into deployment details that sound small but matter a lot in production. Model downloads were heavy, so I baked the CLIP weights into the Cloud Run image instead of redownloading them on every deploy or cold start. That made the service much more predictable.
Another small but important fix was the `/proxy_image` endpoint. Some external image sources, especially thumbnails, do not behave nicely with browser-side CORS. Proxying those requests through my backend made the frontend far more reliable. It was one of those classic cases where a tiny endpoint removed a really annoying user-facing problem.
Key takeaways
- CLIP: dual encoder architecture, shared embedding space for text and images
- Kaggle free GPU workflow: T4, parallel workers across data slices, checkpoint/resume
- Qdrant vs Pinecone: self-hosted Docker vs managed cloud, the mid-project switch
- yt-dlp for structured YouTube data: flat playlist scraping, polite rate limiting
- Cloud Run base image pattern: pre-bake large model downloads to reduce deploy time