MiniGPT: GPT-2 from Scratch

January 24, 2026

PyTorchRunPodGPT-2Transformers

Trained a GPT-2 based language model from scratch on RunPod (~$15). Strong in language modeling and factual recall, weaker in reasoning/math. Later fine-tuned into a chatbot.

What is it?

Here I wanted to stop just using language models and actually train one from scratch. Not fine-tuning. Not adapter tuning. Real pre-training from random initialization. I trained a GPT-2 scale model with 117M parameters mainly because I wanted to understand the full pipeline by doing it myself.

That meant I had to care about tokenization, model architecture, training stability, optimizer settings, checkpointing, GPU cost, and what the model can and cannot learn at that size. It was one of the most educational projects I have built.

The architecture

I followed the general GPT-2 small shape: 12 transformer blocks, 12 attention heads, and 768-dimensional embeddings. Each block has layer norm, self-attention, residual connections, another layer norm, and a feed-forward network.

What made this valuable was not just copying the shape. It was finally seeing how the pieces fit together in code. The architecture stopped feeling like a diagram from a paper and started feeling like something I could actually inspect, debug, and train.

Under the hood: multi-head self-attention

Self-attention clicked for me once I stopped treating it like magic. For every token, the model computes queries, keys, and values. A query is basically what this token is looking for. A key is what another token offers. A value is the information that gets passed along if there is a match. The attention weights decide which tokens matter most for the current token.

Multi-head attention just means the model does that pattern several times in parallel with smaller subspaces. One head might focus more on nearby syntax. Another might capture longer-range relations. After that, the outputs get concatenated back together. The `sqrt(d_k)` scaling is there to stop dot products from blowing up and making softmax too sharp too early.

Training from scratch on RunPod

The training objective was next-token prediction. The model sees a sequence and learns to guess the following token. If you do that over enough text, the model slowly absorbs grammar, patterns, and some factual structure as a side effect of compression.

I trained on RunPod because it gave me straightforward GPU access without too much setup overhead. To make the run fit comfortably, I used mixed precision, gradient checkpointing, AdamW, and a cosine learning rate schedule. Those are the kinds of settings that sound like random ML jargon until you need them to stop your training run from exploding or costing way more than necessary.

Fine-tuning into a chatbot

After pre-training, I fine-tuned the model on instruction-style data so it would behave more like a chatbot. Pre-training gives the model raw language competence. Fine-tuning teaches it how to answer in a format people actually want.

That distinction became really obvious after building both stages myself. A base model can know language patterns and still be awkward to interact with. Instruction tuning is what turns that raw ability into something that feels usable in a chat interface.

Key takeaways

Transformer architecture from scratch: attention, residual connections, layer norm placement
Multi-head self-attention: QKV projections, scaled dot-product, head concatenation
Byte-pair encoding tokenization: vocab building, encoding efficiency
Gradient checkpointing and mixed precision: training 117M params on a single GPU
Pre-training vs fine-tuning: what each phase teaches the model

Try it live →← all projects