Python Code Generation Assistant

December 16, 2025

PythonHugging FaceQLoRARunPod

I fine-tuned an LLM to generate clean Python code snippets. My first real fine-tuning experiment. I tried GCP, Vultr, and RunPod before getting it right.

What is it?

For this project, I fine-tuned Mistral-7B to generate Python code and then exposed it through a simple REST API. I wanted to understand the full path from model adaptation to deployment, not just run inference on a hosted API. That meant dealing with adapter training, memory constraints, GPU selection, prompt formatting, and serving on actual infrastructure.

What made this project fun was that it felt like a full-stack ML system. Not just training, and not just frontend. I had to care about the model, the serving layer, cold starts, GPU cost, and how the API would actually be used.

The fine-tuning config

I used Mistral-7B-v0.1 as the base model and trained it with LoRA adapters instead of full fine-tuning. The adapter setup used rank 16, alpha 32, and dropout 0.05. I targeted the main attention and MLP projection layers so the adapter had enough capacity to meaningfully change behavior without touching the full base weights.

That setup mattered because I was trying to hit a good tradeoff: strong enough adaptation to produce cleaner Python code, but light enough to train without needing a giant multi-GPU setup. The nice part is that the trained adapter is real and sits in the repo, so this was not just me describing a training setup in theory.

Under the hood: QLoRA

QLoRA is what made this practical for me. Full fine-tuning a 7B model would have needed way more VRAM than I wanted to pay for. With QLoRA, I load the base model in 4-bit form so the frozen weights take much less memory, and then I train only the small LoRA adapter layers on top.

In simple terms, the huge model stays mostly untouched, and I only learn a lightweight set of changes. That is why I could get this working under much smaller GPU limits. At inference time, the server loads the quantized base model plus the adapter, combines them, and generates normally. It is one of those ideas that sounds complicated until you see the workflow once, and then it becomes very intuitive.

The platform journey: GCP to Vultr to RunPod to GCP Cloud Run

The platform side was way messier than the model side. Vertex AI ran into quota issues. Vultr did not feel great for the exact training workflow I wanted. RunPod ended up being the most practical option because I could get GPU access quickly, launch a Jupyter environment, run the training job, and stop paying when I was done.

For serving, I came back to GCP because Cloud Run GPU support made the deployment story much cleaner than I expected. I could package the inference app, push it through Artifact Registry, and deploy it with an L4 GPU. That split ended up making sense: train where GPU access is easiest, serve where operations are easiest.

Inference API

The API itself is simple. Clients send a prompt and a few generation settings like `max_tokens`, `temperature`, and `top_p`. On the server side, I wrap the prompt into the instruction format the model saw during fine-tuning, because keeping the inference prompt close to training format usually gives cleaner behavior.

I also learned that serving models is not just about getting a response. Model load time matters. Cold start matters. GPU memory footprint matters. The first request is always slower because the model has to load, but after that the service settles and generation becomes reasonable. That is the kind of detail you do not appreciate until you deploy your own model instead of just calling someone else’s endpoint.

Key takeaways

QLoRA mechanics: NF4 4-bit quantization + LoRA adapters, how to run on consumer-grade VRAM
LoRA rank selection: r=16 with full target module coverage for code generation quality
GPU platform comparison: Vertex AI quota issues, RunPod for training, Cloud Run GPU for serving
PEFT library: PeftModel.from_pretrained, loading adapters onto quantized base models
GCP Cloud Run GPU: --gpu flag, nvidia-l4, max instance constraints, Artifact Registry workflow

Try it live →← all projects