How I Built Trix
Trix is the tiny AI assistant on this site that answers questions about me. Here is the full pipeline: how the dataset was created, how a small model was taught to behave like a large one, and how it all runs cheaply in the cloud.
The Full Pipeline
Dataset
2,105 prompt-response pairs
Deduplication
Semantic clustering
Distillation
12B → 270M
Deployment
Serverless GGUF
Real users, synthetic coverage, semantic compression
The dataset mixes real questions people asked on pooria.dev with augmented variants and LLM-generated edge cases. The final set was compressed with semantic clustering so the model sees diversity without memorizing duplicates.
Data Collection Pipeline
Collect
Real prompts from pooria.dev
Augment
Wording + chaining
Generate
14 LLMs cover edge cases
Embed
all-MiniLM-L6-v2
Cluster
Threshold 0.15
Respond
Gemma 3 12B
Real-Data Augmentation
Each real prompt was rewritten to create wording variations, reordered phrases, and chained multi-part questions. This kept the data grounded in real user behavior while increasing diversity.
Original: "What is Pooria studying?"
Variant: "Can you tell me what Pooria is majoring in at university?"
Chained: "Where did Pooria go to school and what is his favorite project?"
LLM Generation
14 different models were prompted to generate edge-case messages covering adversarial, vague, multilingual, and out-of-scope inputs.
Semantic Deduplication
Prompts were embedded with all-MiniLM-L6-v2 and clustered by semantic distance. A threshold of 0.15 compressed the set by about 2.2× while keeping one representative from each cluster.
Teaching a tiny model to think like a big one
Instead of ordinary supervised fine-tuning, distillation trains the student on the teacher's output distribution. The small model learns why the large model chooses each word, not just the final answer.
Teacher
Gemma 3 12B
12 billion parameters
Student
Gemma 3 270M
270 million parameters
RunPod
Trained on an RTX PRO 6000 for about 6 hours.
~$8 compute
Total cost to distill the final model.
GGUF + Quantize
Converted and quantized for llama.cpp inference.
Why Distillation Beats SFT
Supervised fine-tuning only penalizes the model when it picks the wrong word. Distillation also teaches the shape of the teacher's probability distribution, preserving nuance like synonyms, tone, and uncertainty.
Supervised Fine-Tuning
Distillation
In SFT, the label is a single correct token. Every other token is treated as equally wrong, even if it is a valid synonym.
In distillation, the student sees the teacher's full softmax distribution. It learns that "majors" and "focuses" are plausible alternatives, so its own outputs stay fluent and context-aware.
Serverless, quantized, fast cold starts
The quantized model is wrapped in a llama.cpp Docker container and deployed to Google Cloud Run. It scales to zero when idle and bundles weights inside the image for ~5 second cold starts.
Visitor
Sends a message
Cloud Run
Routes to container
llama.cpp
Streams tokens back