Everyone wants to train the big model. The instinct is understandable — larger models perform better, and performance is the point. But there is a discipline that precedes scale, and skipping it is how you end up running expensive jobs that teach you nothing. The discipline is starting small. Not because small is the goal, but because small is the only place where you can actually see what is happening.
Why Start Small
A 60-million parameter model trained on a single GPU takes minutes per run. A 7-billion parameter model takes days on a cluster. This difference in wall-clock time is not just an inconvenience — it is a fundamental constraint on how much you can learn. The researcher who can run ten experiments today knows ten times more than the one waiting on a single job.
The argument for starting small is really an argument about iteration density. Every training run is a hypothesis. You have a configuration — optimizer, learning rate, batch size, architecture — and you are asking whether it produces a model that learns. Small models let you run that hypothesis cheaply. The feedback loop is tight enough that you can develop genuine intuition: you begin to notice what healthy training feels like before it finishes, what warning signs look like in the first thousand steps, what a well-chosen learning rate does to the loss curve versus a poorly-chosen one.
There is also something clarifying about constraint. A model that cannot learn a task at 60M parameters usually reveals the problem more clearly than one that partially learns it at 7B. Small models fail loudly. The instabilities, the dead gradients, the sensitivity to initialization — all of it shows up fast and visibly. At scale, these same problems hide behind longer training and larger parameter budgets. You spend more to learn less.
HuggingFace's SmolLM2 family is worth studying as existence proof. Their 135M and 360M parameter models are competitive on standard benchmarks — HellaSwag, ARC, PIQA — against much larger models from earlier generations. They got there through careful data curation, architecture choices, and training recipe tuning that was almost certainly iterated at small scale first. The small model is not a practice target. It is the research vehicle.
The Optimizer Landscape
Before your first run, you need to choose an optimizer. The choice matters more than people expect, and it interacts with everything else: your learning rate, your batch size, your weight decay, your warmup schedule. Getting this wrong does not necessarily cause obvious failure — it can just make your model quietly worse in ways you only discover when you compare against a better-configured baseline.
AdamW is the correct starting point. It works by maintaining per-parameter adaptive learning rates — parameters that receive large gradients get smaller effective updates, and vice versa. The "W" distinguishes it from the original Adam: weight decay is applied directly to the weights rather than being folded into the gradient update, which matters for regularization. This is the optimizer on which most of what we know about transformer training has been established. When you read a paper and it says the model trained with AdamW, you know exactly what baseline they're working from. Start here.
The standard hyperparameters are not arbitrary. A peak learning rate between 1e-3 and 3e-4 covers the range that works for most small transformer configurations — start at 3e-4 and adjust from there. Weight decay of 0.1 is the consensus value across recent language model work; it provides enough regularization without aggressively suppressing learning. Gradient clipping at 1.0 is close to universal for transformers because it prevents the occasional large gradient from destabilizing training.
Muon is worth understanding as a challenger. It operates on the orthogonal complement of the weight update — rather than following the gradient directly, it finds the nearest orthogonal matrix update, enforcing a kind of geometric constraint on how weights change. In recent experiments on transformer pre-training, Muon has shown lower validation loss at equivalent compute compared to AdamW, particularly in the embedding layers. It is not yet established enough to be the default, but if you are doing serious ablation work, it belongs in the comparison.
The learning rate schedule is as important as the optimizer. Cosine decay with linear warmup is the standard for a reason. During warmup — typically 1-5% of total training steps — the learning rate increases linearly from near zero to its peak value. This gives the model time to orient its parameters before large updates begin; starting at full learning rate from random initialization can cause early instability that propagates through the entire run. After warmup, cosine decay brings the learning rate smoothly to near zero by the end of training. Linear decay and constant learning rate both leave performance on the table.
The Batch Size Equation
Batch size and learning rate are not independent knobs. They are coupled — changing one without adjusting the other is a common source of silent performance degradation that shows up in your loss curve as a vague underperformance with no obvious failure mode.
The right unit for thinking about batch size in language model training is not samples — it is tokens per update. Your global token batch size is the product of the number of GPUs, gradient accumulation steps, micro-batch size per GPU, and sequence length. At the SmolLM ablation baseline (8 GPUs, 16 accumulation steps, micro-batch 3, sequence length 4096), that works out to 1.5 million tokens per gradient update. SmolLM3's final 3B training used 2.3 million tokens per batch.
The coupling to learning rate follows an empirical rule: larger batches warrant lower peak learning rates. The intuition is about gradient noise. A small batch produces a noisier gradient estimate of the true loss surface — there is more variance between batches. A noisy gradient can still move you in the right direction on average, but it needs to be taken carefully; a large step in the wrong direction is worse than a small step in the right one. A large batch produces a cleaner, lower-variance gradient estimate, which means each update is more trustworthy and you can afford to step more aggressively — but you take fewer steps per token seen, so you typically reduce LR to compensate and ensure stable convergence.
The playbook data bears this out concretely. DeepSeek LLM 7B trains with a 9.4M token batch at LR 4.2×10⁻⁴. SmolLM3 3B trains with a 2.3M token batch at LR 2×10⁻⁴. OLMo 2 7B uses a 4.2M token batch at LR 3×10⁻⁴. Smaller batch, higher LR — consistently, across labs and architectures.
The practical implication: when you halve your batch size (because you have fewer GPUs, or because you want faster iteration), scale your learning rate up, not down. A common starting point is the square-root scaling rule: halve the batch → multiply LR by √2. Linear scaling (halve batch → double LR) is theoretically motivated but often too aggressive in practice and leads to instability early in training. The right scaling factor is itself something to ablate — but the direction is always the same.
One more thing the token batch size makes visible: sequence length is part of the equation. If you reduce your context length from 4096 to 2048 for a quick experiment, you have effectively halved your batch size in tokens, even if your micro-batch and accumulation settings are unchanged. Your LR may need to be adjusted accordingly, or your results from short-context experiments may not transfer cleanly to the full-context run.
Reading the Training Charts
Training a model without monitoring it is navigation without instruments. The charts are not administrative overhead — they are the primary feedback mechanism. Knowing what to look for in them is the core skill of training at any scale. The following visualizations show the three signals that matter most.
Loss curve. The training loss is the most fundamental signal. A healthy loss curve decays smoothly and consistently, with small random noise around the trend. Deviations from this pattern carry specific meaning.
Fig. 1 — Training Loss Curve
Healthy: loss decays smoothly with minor noise. The curve is convex early, flattening as the model approaches convergence. Gradient noise is expected but bounded.
A loss spike — a sudden upward jump followed by recovery — usually indicates a bad batch, a numerically unstable operation, or a learning rate that is slightly too high. One spike can be ignored. Recurring spikes at regular intervals suggest a systematic data problem. Spikes that don't recover mean your learning rate is too large and training is on the edge of instability.
A plateau means the model has stopped learning. This can happen when the learning rate decays too aggressively too early, when the dataset has been effectively exhausted, or when a bottleneck in the architecture is preventing further improvement. A plateau late in training is often expected; one that appears in the first 20% of the run is a problem worth diagnosing before continuing.
Gradient norm. The magnitude of your gradients tells you how aggressively the model is trying to update its weights at each step. Too large, and you risk instability; too small, and learning has effectively stopped.
Fig. 2 — Gradient Norm Over Training
Healthy: gradient norm is high early as the model makes large adjustments from random initialization, then stabilizes to a bounded range. The clipping threshold (dashed line at 1.0) is rarely triggered.
Gradient explosion — the norm suddenly jumping to 10× or 100× its normal value — means something has gone very wrong numerically. This is precisely what gradient clipping is designed to catch; if your clipping threshold is being triggered on nearly every step, that is a sign the learning rate is too high or the model architecture has a stability problem. Gradient collapse — norm dropping to near zero — means gradients are not propagating through the network. Common causes include dead ReLU units, improper initialization, or too-aggressive weight decay.
Ablation comparison. Once you have a stable baseline, the ablation is the instrument for understanding which choices actually matter. The chart below shows what a structured comparison looks like: one variable changed at a time, all else held constant, evaluated against final validation loss.
Fig. 3 — Ablation Comparison (Final Validation Loss)
Each bar represents a single training run with one variable changed from the baseline. Lower validation loss is better. The difference between best and worst config here is 0.4 nats — a meaningful gap at this scale.
The goal of the ablation chart is not to find the single best configuration. It is to understand the sensitivity of your training to each variable. A configuration that produces 0.3 nats better loss than the baseline is a meaningful finding. A configuration that produces 0.01 nats better loss is probably within noise. Knowing which levers matter — and how much — is the knowledge that transfers to scale.
The Ablation Mindset
Ablation is not a technique. It is an epistemic discipline. The rule is simple: change one thing at a time. If you change the optimizer and the learning rate simultaneously, and performance improves, you do not know why. You have learned that the combination is better, which is weaker knowledge than understanding which component drove the improvement. Compound changes produce compound ignorance.
The order in which you ablate matters. Start with the variables that have the largest expected effect on training stability: learning rate first, then warmup duration, then batch size, then optimizer choice. Architecture decisions — attention mechanism variants, positional encoding schemes, normalization placement — come after you have a stable training recipe, because architectural changes interact with all of the above. If you ablate architecture against an unstable baseline, your results will not generalize.
Every run needs a name, a config file, and at minimum these three tracked metrics: training loss, validation loss, and gradient norm. The name should describe the hypothesis being tested, not be a timestamp or an arbitrary identifier. Six months from now, a run called muon-lr3e4-cosine-60m is legible; a run called run_47 is archaeology. Log everything; delete nothing.
"SOTA small" is a specific target, not a vague aspiration. For a sub-1B parameter language model, competitive means scoring above 70 on HellaSwag, above 50 on ARC-Challenge, and above 75 on PIQA with 0-shot evaluation. These thresholds have moved over the last two years as training recipes have improved, but they are achievable at 135M–360M parameters if the data, optimizer, and schedule are well-chosen. The architecture matters less than people think at this scale; the recipe matters more.
The Roadmap
What I am building toward is not a single model. It is a research loop — a structured process for understanding what actually works in small transformer training, which I can then apply to increasingly capable architectures.
The first stage is a clean baseline: a GPT-2-style or LLaMA-style decoder-only transformer at 60M–300M parameters, trained on FineWeb or a similar high-quality web corpus, with the standard recipe established above. The goal is a model that trains stably, hits competitive benchmarks, and serves as the control for every subsequent experiment. No tricks, no novel components — just a solid, well-understood foundation.
The second stage is custom YAT — Yet Another Transformer — implementations. Not using someone else's RoPE, someone else's SwiGLU, someone else's normalization placement. Writing these from scratch matters because it forces a level of understanding that reading papers does not. When you implement rotary position embeddings and watch how they affect the attention patterns, you understand something about positional encoding that you cannot get from a formula. Each component gets ablated against the baseline at small scale before it touches a larger model.
The third stage is MTEB fine-tuning. MTEB — the Massive Text Embedding Benchmark — evaluates embedding models across retrieval, classification, clustering, and reranking tasks. Fine-tuning a pre-trained decoder model into a strong embedding model requires specific training objectives (contrastive learning, instruction-following for asymmetric retrieval) that are worth understanding deeply. The small pre-trained model becomes the starting point; the fine-tuning process is itself an ablation surface.
This is the same spirit as Chapter 5 — building in public, learning in public — but applied to research rather than product. The experiments will be logged and the findings will be written up. The dead ends are as informative as the successes.
What Small Teaches
A model that cannot generalize at 60M parameters will not magically generalize at 7B. The failure mode is the same; only the cost has increased. If your loss curve is spiky at small scale, it will be spiky at large scale — you will just spend more compute discovering this. If your architecture choice does not improve over baseline at 60M, it probably does not improve at 300M either.
The discipline of reading small trains the instinct for reading big. When you have watched a hundred loss curves at small scale, you recognize the patterns immediately at any scale. The gradient norm explosion you saw at step 200 of a 60M-parameter run is the same explosion you will see at step 2000 of a 7B-parameter run — just more expensive to recover from.
The previous chapter proposed that training might be understood as measuring the curvature of a learned geometry — that weights are not arbitrary numbers but the shape of the space the model has learned to inhabit. If that framing is right, then starting small is not a compromise. It is the appropriate instrument. Small models produce cleaner geometry. The signal is less noisy. The training charts are a direct readout of how the curvature is forming, where it is healthy, where it is distorted. The instincts you develop by reading them carefully are not small-model instincts. They are instincts about learning itself.