← Table of Contents
Chapter 11

The Tokens You Don't See

Derived from and in dialogue with The Smol Training Playbook by Hugging Face (Ben Allal, Tunstall, Tazi et al., 2025). This chapter is my own synthesis and interpretation.

Most discussions of language model training focus on what goes into the model — architecture, optimizer, learning rate. Fewer focus on what happens to the data before training begins. But the invisible decisions made at the sequence level — how documents are packed, where attention is allowed to flow, how boundaries between unrelated texts are handled — shape what the model learns as surely as any architectural choice.

The Packing Problem

Language models train on fixed-length sequences. The transformer's attention mechanism operates over a sequence of some length — 2048 tokens, 4096 tokens, whatever you've chosen. But the documents in your training corpus have no idea what your sequence length is. A research paper might span 12,000 tokens. A short Python function might be 80. A tweet might be 15.

The naive solution to variable-length documents is padding: take each document, add padding tokens to bring it up to the sequence length, and process each one individually. This is simple, but it is extraordinarily wasteful. If your average document is 400 tokens and your sequence length is 4096, you are wasting roughly 90% of your compute on attention over padding tokens that contribute nothing to learning. At the scale of billions of training tokens, this is not an acceptable inefficiency.

The real solution is packing: shuffle your documents, concatenate them end-to-end with an end-of-sequence (EOS) token between each, then chunk the result into fixed-length sequences. A document that ends partway through a chunk is continued in the next chunk. You fill every training sequence to capacity, and no compute is wasted.

Fig. 1 — Document Packing into Fixed-Length Sequences

Packing: four documents of different lengths are concatenated with EOS tokens and cut into two full-capacity sequences. No compute is wasted on padding.

In the HuggingFace playbook's ablation setup, sequences are packed to length 4096 tokens. About 80–90% of files in Common Crawl and GitHub are shorter than 2000 tokens, meaning the vast majority of documents are short enough that several will end up packed into the same training sequence. This is efficient — and it introduces a subtle problem.

The Cross-Document Contamination Problem

Standard causal attention allows every token to attend to all preceding tokens in its sequence — regardless of whether those tokens came from the same document or a different one packed next to it. A token at position 3000 in a training sequence can, under standard masking, attend to context from positions 1, 2, and 3 — even if those positions are from a completely different document than the one the token is actually part of.

This is not just conceptually odd. It introduces noise. Research by Zhao et al. found that allowing attention across document boundaries degrades model quality because the model receives spurious context — tokens from a recipe article attend to a nearby code snippet and learn spurious associations between them. The model's representation of each document gets contaminated by its neighbors.

An additional observation from the SkyLadder team: intra-document masking effectively reduces the model's average context length during training. With packing and standard masking, every token sees up to 4096 tokens of context, even early in training. With document masking, each token only sees the context within its own document — which, for most of the short documents in Common Crawl, means seeing far fewer than 4096 tokens. SkyLadder found that shorter effective context lengths are actually better for early training stability, which provides a second, independent reason to mask document boundaries.

Intra-Document Masking

The fix is conceptually simple: modify the attention mask so that tokens can only attend to previous tokens within the same document. When a training sequence contains four packed documents, each document operates as if it is the only one — tokens from document B cannot attend to tokens from document A, even if they're in the same sequence at earlier positions.

Fig. 2 — Attention Mask: Causal vs. Intra-Document

Can attend Blocked EOS boundary Current query token

Standard causal masking: every token can attend to all previous tokens in the sequence, including those from different packed documents. Cross-document attention introduces noise from unrelated content.

The attention mask is a matrix where entry (i, j) indicates whether query token i is allowed to attend to key token j. Standard causal masking makes this a lower-triangular matrix of ones. Intra-document masking sets entries to zero wherever tokens i and j come from different documents — even if j comes before i in the sequence. The rest of the causal constraint is preserved: a token still cannot attend to future tokens within its own document.

Operationally, implementing this requires knowing, for each position in a packed sequence, which document that position belongs to. This is tracked through a document ID tensor that travels alongside the token IDs. The attention implementation (typically FlashAttention at this scale) uses the document boundaries to construct the block-diagonal mask structure that intra-document masking requires.

When Masking Matters

The ablation results from the playbook tell a specific story: at short context, document masking has limited impact; at long context, it becomes crucial.

When HuggingFace ablated document masking during SmolLM3 pretraining at 4k context, they found identical loss curves and identical downstream evaluation scores compared to standard causal masking — with one small exception: a slight improvement on PIQA, a physical commonsense benchmark. The efficiency benefit (less spurious computation over document boundaries) is real, but the quality signal doesn't move much at short context because the noise from cross-document attention is relatively small when sequences are short.

This mirrors what Meta found training Llama 3. At short context lengths, document masking was essentially a wash. But when they extended to long context — 64k, then 128k tokens — document masking became significant. At those context lengths, each training sequence spans many more documents, and the noise from cross-document attention accumulates. Masking it out produces meaningfully better representations of long-range context.

The implication for training strategy: you should enable document masking from the start, even if it doesn't visibly help early. The cost of enabling it is negligible (a mask computation), and it sets you up correctly for long context extension later. Retrofitting it after the fact — turning on document masking midway through a training run — may produce inconsistent representations because part of the model's learning happened with different attention patterns.

Fig. 3 — Document Masking Impact: Short vs. Long Context

At 4k context, standard masking and document masking produce nearly identical downstream scores. The gap opens significantly at 64k context, where sequences span many more documents and cross-document noise accumulates.

Long Context Extension

Packing and masking are not just pretraining concerns — they are directly implicated in how you extend a model's context window after pretraining. Most current practice involves training on short sequences (4k tokens) for most of the pretraining budget, then extending to longer sequences (32k, 64k, 128k) in a continuation phase. SmolLM3 went from 4k to 64k to 128k this way.

During long-context extension, the effective batch size in tokens stays roughly constant (you still target the same number of tokens per gradient update), but now each sequence is much longer. Fewer documents fit into each sequence. With document masking already in place from pretraining, the transition is clean — the mask structure is already correct, and you're just extending the range over which within-document attention is allowed.

Without document masking, the extension would require either retrofitting (risky) or accepting that the model was trained with inconsistent attention patterns during pretraining. Neither is ideal. This is why HuggingFace adopted document masking for SmolLM3 throughout the full training run despite it not visibly helping at short context — the benefit pays off when you extend.

What This Means for the Roadmap

For the research loop described in Chapter 10 — baseline model, then YAT ablations, then MTEB fine-tuning — sequence preparation choices are not secondary. They are part of the recipe.

At 60M–300M parameter scale on 4k sequences, document masking will likely not change your benchmark numbers. But it costs almost nothing to implement, and it sets the foundation for a training recipe that will generalize correctly when you extend context. The embedding fine-tuning stage for MTEB in particular will benefit from models that have learned clean within-document representations — retrieval and reranking tasks depend on understanding documents as coherent units, not arbitrary windows of tokens.

Packing, meanwhile, is non-negotiable at any scale. Padding-based training is simply too inefficient to justify. The question is not whether to pack, but how to pack — and specifically whether to accompany packing with the mask that correctly constrains what each token is allowed to see. The answer is yes, and the implementation is straightforward enough that there is no good reason not to.

The tokens you don't see — the padding tokens you eliminated, the cross-document contexts you masked, the EOS boundaries you carefully placed — shape the learned geometry as much as the tokens you do. What you leave out of the attention computation is itself a design decision.