Search Engine
Late November 2025
My very first project, a full search engine from the ground up. I built keyword indexing, ranking algorithms, and a clean frontend.
What is it?
This was my very first serious project, and I didn’t want it to be just another search box that throws links back at you. I wanted to ask a question in plain English and actually get an answer. So I built a Wikipedia search engine that supports two flows: a normal keyword search mode for speed, and an AI ask mode that reads the retrieved results and writes a grounded response.
The core idea was simple: first find relevant Wikipedia pages, then let the model explain them instead of hallucinating from memory. That one idea pushed me into search infrastructure, indexing, retrieval, prompt construction, and deployment way earlier than I expected.
How it works
I kept the backend simple on purpose. The FastAPI app exposes two main endpoints. `/search` goes straight to Typesense and returns regular keyword matches as fast as possible. `/ask` does one extra step: it first queries Typesense, takes the top few Wikipedia snippets, and sends those snippets into an LLM prompt so the answer is based on retrieved text instead of guessing.
In practice, the flow is: user asks a question, backend retrieves top documents, trims them down to a small context window, and sends that context to Gemini. If Gemini fails, I fall back to Groq. I liked this design because each piece had one job. Typesense handles retrieval. The LLM handles explanation. FastAPI just glues the pieces together.
The dataset: from 1M docs to 100K
At first I tried to be ambitious and index around a million Wikipedia articles. That failed pretty quickly because my machine and deployment budget were nowhere near big enough for that much data. Typesense started eating too much RAM, and I had to stop pretending the hardware problem would magically solve itself.
So I made a practical compromise. I filtered Wikipedia down to the top 100K most-linked articles. My reasoning was that if a page gets linked a lot, it is probably more central and more useful for general questions. That cut the memory footprint down hard while still keeping the dataset useful. It was one of my first real lessons that engineering is usually not about doing the biggest thing, but about doing the biggest thing your infrastructure can actually survive.
RAG architecture
This project was my first real hands-on introduction to retrieval-augmented generation. Instead of asking the model a question with no context, I first search a real document collection and then pass the best snippets into the prompt. That sounds obvious now, but building it myself made the value click instantly.
I kept the retrieved context small so the system stayed cheap and fast. The backend takes only the top few snippets, trims them, and feeds them to the model with a prompt that basically says: answer using this context. That does two things. First, it reduces hallucinations because the model has something concrete to read. Second, it makes the answer feel much more like a response built on sources rather than just random model confidence.
Why systemd instead of Cloud Run?
I originally liked the idea of using a fully managed deployment, but Typesense needs persistent local storage and behaves much better when it lives on a box that stays around. Cloud Run is great for stateless APIs. A search engine with an on-disk index is a different story.
So I deployed it on a regular GCP VM and used `systemd` to keep the services alive. The unit file was basic: start the service on boot, restart on failure, keep logs easy to inspect. It was not flashy, but it matched the workload. That mattered more than choosing the most modern-sounding platform.
Key takeaways
- RAG pipeline: how to combine keyword retrieval with LLM generation for grounded answers
- Typesense collection management: schema definition, memory constraints, rebuilding collections
- Dataset size vs infrastructure limits: reducing 1M Wikipedia docs to fit in VM RAM
- Systemd service deployment: unit files, Restart=always, running on GCP VMs
- Gemini + Groq failover pattern for LLM availability