#3

Real-time Streaming Search Engine

December 12, 2025

PythonFastAPITypesenseGroqGeminiSSE

Perplexity.ai-style search with token-by-token streaming answers. Hybrid search combining keyword + semantic + BM25, with clickable source citations. Cost ~$20 on GCP.

What is it?

This was my attempt at building a Perplexity-style search engine that does more than just return links. I wanted users to ask a question and get a streaming answer with real source citations, while still keeping the retrieval side grounded in actual indexed content. That pushed me into hybrid search, embeddings, streaming APIs, and search ranking logic.

What I like about this project is that the repo still shows the real build journey. It was not one clean implementation. It was multiple versions, failed attempts, migrations, and fixes stacked on top of each other until the whole thing actually worked.

The step-by-step build (what actually happened)

I built this in stages because the first approach kept breaking. First I indexed Wikipedia and generated embeddings with SentenceTransformers. Then I tried using Vertex AI embeddings with ChromaDB, but quota limits made that route painful almost immediately. So I changed the plan, created GPU VMs only when I needed batch embedding jobs, wrote the vectors out, stored the outputs, and then tore the machines down after the job finished.

That pattern ended up saving money and making the pipeline more repeatable. Later I did another embedding pass to fill gaps, rebuilt the search collection cleanly, and switched to a more production-ready API with server-sent event streaming. The final result looks polished, but the actual process was mostly me solving one bottleneck at a time.

Hybrid search implementation

The interesting part here is that I do not trust just one retrieval method. Keyword search is great when the query uses exact words from the source text. Vector search is better when the wording is different but the meaning is close. So I run both for the same query and then merge the ranked results.

I used Reciprocal Rank Fusion for that merge. In simple terms, documents get a score based on how high they appear in each list, and documents that rank well in both lists rise to the top. I weighted lexical search a bit more heavily than semantic search because exact matches still matter a lot for factual questions. This made the final results feel much more balanced than pure vector search or pure keyword search on their own.

SSE streaming

I wanted the answer to feel alive while it was being generated, so I streamed the response token by token using SSE instead of waiting for the full answer. The backend emits chunks as the model generates them, and the browser uses `EventSource` to append them in real time.

I liked SSE here because it solved the exact problem without dragging in WebSockets. This is one-way server-to-client streaming, and SSE is perfect for that. It is simpler to deploy, simpler to reason about, and good enough for a search engine answer stream. Gemini was the main model, and I kept Groq as a fallback when the primary path failed.

Vector infrastructure and abandoned approaches

One of the more honest parts of this project is how many vector approaches I tried before settling down. I experimented with ChromaDB, LanceDB, Pinecone, and Qdrant in different projects and half-projects. Some had quota issues, some felt annoying to operate, and some were just extra moving parts I did not want to keep carrying.

Eventually I liked the idea of Typesense handling both text and vector retrieval in one place. That meant fewer services to maintain, fewer sync problems between systems, and one simpler deployment story. Sometimes the best architecture is not the most advanced stack. It is the one you can actually operate consistently.

Key takeaways

  • Hybrid search: RRF formula with alpha weighting, lexical vs semantic signal tradeoffs
  • GPU cost management: create VM for batch embeddings, delete when done, re-import from GCS
  • Typesense multi-search API: running lexical and vector queries in one round-trip
  • SentenceTransformers all-MiniLM-L6-v2: 384-dim, fast CPU inference, good quality tradeoff
  • Atomic collection swap in Typesense: alias pointing, zero-downtime schema migration
Try it live →Watch on YouTube →← all projects