Pangram verdict · v3.3
We believe that this document is mainly AI-generated, with some AI-assisted content.
AI likelihood · overall
AIArticle text · 1,951 words · 6 segments analyzed
Ask GPT-4 how many r’s are in “strawberry” and it will confidently say two. The right answer is three. This isn’t because the model can’t count. It’s because it never sees the letters at all.Every Large Language Model (LLM) starts with the same operation: text comes in, gets chopped into chunks called tokens, and those chunks become integer IDs that index into an embedding matrix. The chunks aren’t characters and they aren’t words. They’re something more specific, and the specificity matters more than most people realize.What a “token” really isMost people first meet the word “token” through prices and limits: “1,500 tokens used”, “the context window is 128K tokens”. Those numbers are real, but they hide what a token actually is.A token is the smallest unit of input a specific model can perceive. Each model has its own fixed list of tokens, called its vocabulary, decided once at training time. GPT-4’s vocabulary isn’t Claude’s. Claude’s isn’t Llama’s.When you send text to a model, the text gets chopped into pieces from that model’s vocabulary, and each piece is swapped for an integer ID. Only those IDs ever reach the model. The model never sees text. It sees a sequence of integer indices into its own private alphabet.So tokens aren’t “roughly like words” or “kind of like characters”. They’re the atoms of perception for one specific model, and they’re the only language that model speaks. Two models fed the same English sentence will produce two different integer sequences, often of different lengths:"I love strawberry milkshakes!"GPT-4 I ·love ·str aw berry ·milk sh akes ! 9 tokensLlama 3 I ·love ·straw berry ·milk shakes ! 7 tokensEach chip is one token. · marks a leading space (so ·love is the token love, distinct from love). Splits are approximate; the interactive playground at the end of the post shows exact tokenization.The same sentence is nine tokens to GPT-4 and seven tokens to Llama 3. Not because Llama is smarter or the sentence changed, but because the two models have different vocabularies.
To GPT-4, the token ·straw doesn’t exist as a single chunk, so “strawberry” splits across three pieces. Llama 3’s vocabulary happens to include ·straw, so it gets through in two.Here’s GPT-4’s actual tokenizer running in your browser. Type anything: your name, a strange word, a sentence in another language. Each chip below is one token.How does a model end up with one specific vocabulary instead of another? The dominant algorithm is Byte Pair Encoding, or BPE.BPE, the algorithmBPE is an algorithm for deciding which subword chunks deserve to be tokens, given a corpusA corpus is the dataset of text used to train the tokenizer (and the model). Typically a giant mix of web pages, books, code, and other text. For modern models it’s measured in trillions of tokens. and a target vocabulary size. It starts small and grows the vocabulary one merge at a time, always merging the most frequent adjacent pair in the corpus.The whole algorithm fits on a sticky note.The setup. You have:A corpus to tokenize.A target vocabulary size $V$ (a number you choose; typical values are 30,000 to 100,000).You want to end up with a list of $V$ tokens such that common substrings (the, ing, to) get their own token, so common text compresses into short sequences. Rare substrings decompose into smaller pieces, down to single characters in the worst case, so nothing is ever out-of-vocabulary.The algorithm.Initialize the vocabulary as every distinct character in the corpus.Scan the corpus and count every adjacent pair of tokens.Take the most frequent pair, merge it into a new token, and add it to the vocabulary.Repeat steps 2 and 3 until the vocabulary has $V$ entries.That’s it. No clever scoring, no neural networkA computational model made of layers of trainable mathematical functions whose parameters are tuned to fit data. Modern LLMs are massive neural networks. BPE, by contrast, is plain bookkeeping with no learned parameters., no second pass. The “merge” in step 3 doesn’t do anything sophisticated. It just declares: from now on, whenever you see t followed by h in this corpus, treat them as one symbol called th.
Two details matter:The originals don’t disappear: when t and h get merged into th, all three are now in the vocabulary. If a word later happens to use t followed by some other character, the tokenizer can still represent it. The vocabulary grows monotonically.Pairs get re-counted after each merge: once th is a token, the next iteration might find that th + e is the new top pair → merge → the. Then + the → the. Multi-character common words emerge from running the same 4-step loop with no extra cleverness. The vocabulary builds combinatorially.A worked exampleLet me run it on a tiny corpus: just two words, cat appearing 3 times and mat appearing 2 times.cat × 3 mat × 2 The initial vocabulary is the 4 distinct characters that appear: c, a, t, m. Every word starts as a sequence of single-character tokens.Initial statecat ×3 c a tmat ×2 m a tVocabulary (4 tokens)c a t mIteration 1. Count every adjacent pair, weighted by word frequency:paircount(c, a)3(a, t)3+2 = 5(m, a)2Winner: (a, t) → at. The suffix at appears in both words, which is why it scores highest. Merge it:After merge (a, t) → atcat ×3 c atmat ×2 m atVocabulary (5 tokens)c a t m atIteration 2. Re-count:paircount(c, at)3(m, at)2(c, at) → cat wins because cat is the more frequent word. Merge:After merge (c, at) → catcat ×3 catmat ×2 m atVocabulary (6 tokens)c a t m at catAfter two merges the vocabulary holds 6 tokens: c, a, t, m, at, cat. Notice what just happened. The word cat now tokenizes to a single token. The word mat still takes two tokens (m + at), because BPE judged cat worth its own ID but not yet mat. In a larger corpus where mat was more common, it would eventually merge too.
This is exactly what real tokenizers look like: common words collapse to one token, rarer words decompose into shared subword pieces like the at suffix.A two-word corpus only takes the algorithm so far. Let’s step through a richer four-word corpus to watch meaningful subwords emerge.Vocabulary So the whole algorithm is bookkeeping. No machine learning, no scoring functions. The structure that emerges (suffixes like est, common words like low, eventually multi-character tokens for frequent words like the, ing, tion) is a direct snapshot of the corpus’s frequency statistics.Byte-level BPELook back at one line from the algorithm: “the initial vocabulary is every distinct character in the corpus”. That works fine if the corpus is plain English with no surprises. The moment you feed BPE the actual internet (Chinese, emoji, code, accented letters, rare Unicode codepointsUnicode’s numeric IDs for characters, written as U+XXXX in hex. E.g. U+0041 for A, U+1F353 for 🍓. About 150,000 codepoints in total, covering every script, symbol, and emoji.), the “distinct characters” set explodes, and worse: any rare codepoint the corpus didn’t include is still out-of-vocabulary at the character level.GPT-2 introduced a fix that’s now near-universal: don’t start with characters. Start with bytesA byte is just 8 bits, a number from 0 to 255. Everything stored on a computer (text, images, programs) ultimately lives as a sequence of bytes; text is just a particular interpretation of byte sequences via an encoding like UTF-8..There are exactly 256 possible byte values, so:The initial vocabulary is fixed at 256, regardless of corpus.Every byte is in the vocabulary, by definition.Any text representable on a computer is, by definition, a byte sequence.Out-of-vocabulary is eliminated by construction. The worst case for any input is “fall back to bytes”.The UTF-8 wrinkle. Most modern text is encoded as UTF-8A variable-length encoding that maps each Unicode character to 1 to 4 bytes. ASCII takes 1 byte, most European scripts 2, most Asian scripts 3, emoji 4.,
where each Unicode character becomes a sequence of 1 to 4 bytes:characterbytes (hex)bytesA411éC3 A92中E4 B8 AD3🍓F0 9F 8D 934ASCII is just “UTF-8 where every character is one byte”, so plain English text is unchanged. But 中 enters the tokenizer as the 3-byte sequence E4 B8 AD, not as a single character.After BPE training on a multilingual corpus, the merges could end up producing a single token for the sequence E4 B8 AD. Those three bytes always appear together in any valid UTF-8 encoding of 中. The byte triple gets compressed into a “character-shaped” token via merging, the same way est and low did in the English example. The algorithm doesn’t change. We just swapped the starting alphabet.Input: "Hello 🍓!"Character-levelH e l l o · ⚠ UNK ! 8 tokens🍓 isn't in the vocabulary. Replaced with <UNK>. The character is lost — the model can never recover it.Byte-levelH e l l o · F0 9F 8D 93 ! 11 tokens🍓 decomposes into 4 bytes F0 9F 8D 93. Every byte is in the vocabulary by construction. Nothing is lost.Same input, two tokenizers. The character-level one fails on any character it wasn’t trained to know. The byte-level one cannot fail.Byte-level BPE pays in tokens to win in coverage:The cost: non-ASCII text uses more tokens when the training corpus underrepresents the script. A Chinese sentence run through an English-heavy model decomposes into byte-level chunks rather than character-shaped tokens. Same string, more tokens. This is why API pricing tends to hit Chinese, Arabic, and Hindi harder than English.The guarantee: nothing is ever out-of-vocabulary. The starting vocabulary is fixed at 256 entries, every byte sequence is representable by construction, and there’s no <UNK> token to lose information to.
Once you internalize that the model literally never sees characters (only integer IDs corresponding to byte sequences that may or may not align with human characters), a bunch of LLM weirdness stops being mysterious. The strawberry problem is one of them. We’ll get there.Vocabulary size as a design knobVocabulary size $V$ (the number of distinct tokens in the model’s vocabulary) is a hyperparameter, meaning it is set by hand before training rather than learned from data. The obvious instinct is that bigger should be better, since common substrings collapse into single tokens and text compresses into shorter sequences. So why do real models stop at 32K to 256K? Why not a vocabulary of a million tokens, or ten million?The short answer: $V$ controls three different costs at once and only one benefit, and the cost quickly becomes severe.Alongside $V$ sits one other number that shows up in nearly every formula below: $d$, the model’s hidden dimension. It’s the width of every vector the model passes around internally. For a 7B-class model $d$ is around 4,096; for 70B-class models it grows to 8,192. Bigger $d$ gives vectors more room to encode meaning, but compute grows with $d^2$. Most of the formulas below are some flavor of $V \cdot d$.A quick clarification on those size labels: “7B” means the model has 7 billion learned parameters in total, “70B” means 70 billion. That total is a fixed budget the whole model has to share. Even the vocab tables we’re about to discuss come out of it: every parameter the designer spends on one part of the model is a parameter that cannot go to another part.The benefit: compression. Bigger $V$ means more common substrings get their own token, which means a given document encodes into fewer tokens. Shorter sequences are worth a lot:Less work per document: the model processes fewer tokens to read the same text.More content per budget: a fixed input window holds more real text.Lower compute cost: both training and inference scale with token count, so each gets cheaper.Cost 1: embedding matrices. Every token needs its own row in the embedding matrix, which has shapeA matrix’s shape names its row and column counts.