Behind the Bot

How I Built Trix

Trix is the tiny AI assistant on this site that answers questions about me. Here is the full pipeline: how the dataset was created, how a small model was taught to behave like a large one, and how it all runs cheaply in the cloud.

The Full Pipeline

Dataset

2,105 prompt-response pairs

Deduplication

Semantic clustering

Distillation

12B → 270M

Deployment

Serverless GGUF

1. Dataset

Real users, synthetic coverage, semantic compression

The dataset mixes real questions people asked on pooria.dev with augmented variants and LLM-generated edge cases. The final set was compressed with semantic clustering so the model sees diversity without memorizing duplicates.

917 Real collected prompts

2,835 After real-data augmentation

4,525 After LLM generation

2,105 Final semantic clusters

Data Collection Pipeline

Collect

Real prompts from pooria.dev

Augment

Wording + chaining

Generate

14 LLMs cover edge cases

Embed

all-MiniLM-L6-v2

Cluster

Threshold 0.15

Respond

Gemma 3 12B

Real-Data Augmentation

Each real prompt was rewritten to create wording variations, reordered phrases, and chained multi-part questions. This kept the data grounded in real user behavior while increasing diversity.

Original: "What is Pooria studying?"

Variant: "Can you tell me what Pooria is majoring in at university?"

Chained: "Where did Pooria go to school and what is his favorite project?"

LLM Generation

14 different models were prompted to generate edge-case messages covering adversarial, vague, multilingual, and out-of-scope inputs.

Jailbreaks

Follow-ups

Multi-language

Vague inputs

Hostile prompts

Typo-heavy

Semantic Deduplication

Prompts were embedded with all-MiniLM-L6-v2 and clustered by semantic distance. A threshold of 0.15 compressed the set by about 2.2× while keeping one representative from each cluster.

4,525 raw prompts

2,105 semantic clusters

2. Model Distillation

Teaching a tiny model to think like a big one

Instead of ordinary supervised fine-tuning, distillation trains the student on the teacher's output distribution. The small model learns why the large model chooses each word, not just the final answer.

Teacher

Gemma 3 12B

12 billion parameters

Produces soft target distributions over every token

Distillation loss

Student

Gemma 3 270M

270 million parameters

Learns to match the teacher's next-token probabilities

RunPod

Trained on an RTX PRO 6000 for about 6 hours.

~$8 compute

Total cost to distill the final model.

GGUF + Quantize

Converted and quantized for llama.cpp inference.

Why Distillation Beats SFT

Supervised fine-tuning only penalizes the model when it picks the wrong word. Distillation also teaches the shape of the teacher's probability distribution, preserving nuance like synonyms, tone, and uncertainty.

Supervised Fine-Tuning

"studies"

100%

"majors"

Distillation

"studies"

75%

"majors"

20%

"focuses"

In SFT, the label is a single correct token. Every other token is treated as equally wrong, even if it is a valid synonym.

In distillation, the student sees the teacher's full softmax distribution. It learns that "majors" and "focuses" are plausible alternatives, so its own outputs stay fluent and context-aware.

3. Deployment

Serverless, quantized, fast cold starts

The quantized model is wrapped in a llama.cpp Docker container and deployed to Google Cloud Run. It scales to zero when idle and bundles weights inside the image for ~5 second cold starts.

Visitor

Sends a message

Cloud Run

Routes to container

llama.cpp

Streams tokens back

Scaling

Max containers 1

Idle behavior Scale to zero

Cold start ~5s

Model bundled In image

Try Trix yourself

Ask a question on the home page and see the distilled model in action.

Ask Trix