Until you play with a transformer and watch it learn, you will never truly feel the AGI. The folks at OpenAI felt it firsthand, scaling from GPT-2 to GPT-3 to GPT-4, watching language models go from parlor tricks to something that felt like understanding. This page attempts to give you that same feeling: load Pythia-14M, a real 14M-parameter transformer from EleutherAI, fine-tune it on your own instruction-completion pairs, and watch a model that spits out gibberish start producing structured answers. All through SFT, completely in your browser.

What is SFT? Supervised Fine-Tuning takes a pretrained language model (which only knows how to predict the next token) and teaches it to follow instructions by training on (prompt, completion) pairs. It is the first step in the standard alignment pipeline: Pretrain → SFT → RLHF.
Browser requirements: This demo downloads ~54MB of model weights and runs inference/training in JavaScript. WebGPU is used when available (Chrome 113+); otherwise it falls back to CPU. Training on CPU is slower but functional. Tested on an Apple M1 Max laptop, where it runs smoothly with WebGPU enabled in Chrome.
LM Head mode: Only the output projection layer (50304 x 128 = 6.4M params) is trained. The transformer backbone stays frozen. Fast, uses GPU when available.

1. Load Model

2. Training Data (15 examples)

Feel free to add your own SFT examples.

PromptCompletion (target)

3. Before Training (baseline)

What the model generates before any fine-tuning:

Prompt Model response

4. Fine-Tune

LM Head only freezes the transformer backbone and only trains the final output projection layer that maps hidden states to vocabulary logits. Fast and sufficient when the pretrained representations already capture what you need. Full Model updates all parameters, including embeddings and every transformer layer. This lets the model learn deeper representations but is slower and more prone to overfitting on small datasets.

Hyperparameters

What to train
Full passes over the dataset
Head: 1e-3, Full: 1e-4 recommended
Optimization algorithm
Adam first moment decay
Adam second moment decay
AdamW decoupled weight decay
0 = disabled. Try 1.0 for full model

5. Test Model

Run the same baseline prompts through the current model. Use anytime to check progress.

Prompt Model response

Anything outside your training data will produce nonsense, but hey, at least it'll be better nonsense than the base model.

What Just Happened

You took a pretrained language model that only knew how to predict the next token and taught it to follow instructions. That is supervised fine-tuning: you provided (prompt, completion) pairs, computed a cross-entropy loss on the completion tokens only, and updated the weights with gradient descent.

This is the same first step used to build ChatGPT, Claude, and every other instruction-following LLM. The difference is scale: they use billions of parameters and millions of examples. The mechanism is identical.

SFT alone does not produce a safe or well-aligned model. It teaches format and surface-level instruction following, but not preference or judgment. That requires the next step in the pipeline: reinforcement learning from human feedback (RLHF), using algorithms like PPO or GRPO. But SFT is where the magic first becomes visible.

I hope you felt the AGI. If not fully, at least a little bit.