Skip to content
Salah Adawi Salah Adawi

How LLMs Work — A Visual Deep Dive

We believe that this document is primarily AI-generated with some human-written content.

Hacker News Article AI Analysis

Content Label

AI

AI Generated

88%

Human

12%

Window 1 - 100% AI-Generated
Skip to main content How LLMsActually WorkHuman: What is behind this text box?A complete walkthrough of how large language models like ChatGPT are built — from raw internet text to a conversational assistant. Based on Andrej Karpathy's technical deep dive. Training Tokens15T Parameters405B Text Data44 TB Token Vocabulary100K Downloadingthe InternetThe first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly what fits on a single hard drive — representing ~15 trillion tokens.Key Insight The quality and diversity of this training data has more impact on the final model than almost anything else. Garbage in, garbage out — but at a trillion-token scale. Click any stage to read more detail🌐 Common Crawl2.7B web pages · Raw HTML · Since 2007A non-profit organization that crawls the web and freely provides its data. Their bots follow links from seed pages, recursively indexing the internet. The raw archive is petabytes of gzip'd WARC files containing raw HTML.🚫 URL FilteringBlocklists · Malware · Spam · Adult contentBlock-lists of known malware sites, spam networks, adult content, marketing pages, and low-quality domains are applied. Entire domains can be removed. This is the cheapest filter so it runs first.📄 Text ExtractionHTML → clean text · Remove navigation & CSSRaw HTML contains <div> tags, CSS, JavaScript, navigation menus, and ads. Parsers extract just the meaningful text content. This is harder than it sounds — heuristics decide what's "content" vs "chrome".🌍 Language FilteringKeep pages ≥65% English · Language classifierA language classifier estimates the language of each page. Pages with less than 65% target-language content are dropped.
Window 2 - 100% AI-Generated
This is a design decision — filter aggressively for one language or train multilingual.♻️ DeduplicationExact & fuzzy matching · Reduce repetitionIdentical or near-identical pages appear millions of times on the internet (copied articles, boilerplate). Training on the same text repeatedly causes memorization. Dedup uses MinHash and exact-match techniques to remove duplicates.🔒 PII RemovalNames · Addresses · SSNs · EmailsPersonally Identifiable Information is detected and either redacted or the page is dropped. Regex patterns and ML classifiers find phone numbers, emails, Social Security numbers, physical addresses, and named individuals.✅ FineWeb Dataset44 TB · 15 Trillion tokens · High qualityThe final filtered dataset. Articles about tornadoes in 2012, medical facts, history, code, recipes, science papers — the full breadth of human knowledge expressed in text. This becomes the training corpus. Chapter 1 · Pre-Training · Stage 2TokenizationNeural networks can't process raw text — they need numbers. The solution is tokenization: breaking text into "tokens" (sub-word chunks) and assigning each an ID.GPT-4 uses a vocabulary of 100,277 tokens, built via the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols), then iteratively merges the most frequent adjacent pairs — compressing the sequence length while expanding the vocabulary.Why not just use words? Words have infinite variants. "run", "running", "runner" would be 3 separate entries. Subword tokens share roots: "run" + "ning", "run" + "ner". This also handles new words, typos, and multiple languages efficiently. BPE in ActionBPE Tokenization StepsInteractive diagram showing how Byte Pair Encoding progressively merges characters into subword tokens Step 1 of 5Live TokenizerTry the examples below or type your own text.
Window 3 - 100% AI-Generated
Hover (or focus) any token to see its ID.Enter text to tokenize Tokens: 0Characters: 0Ratio: 0 chars/tokenExplore tokenization across GPT-4, Claude, Llama and more → tiktokenizer.vercel.app Chapter 1 · Pre-Training · Stage 3Training theNeural NetworkThe Transformer neural network is initialized with random parameters — billions of "knobs". Training adjusts these knobs so the network gets better at predicting the next token in any sequence.Every training step: sample a window of tokens → feed to network → compare prediction to actual next token → nudge all parameters slightly in the right direction. Repeat billions of times.The loss — a single number measuring prediction error — falls steadily as the model learns the statistical patterns of human language.Scale GPT-2 (2019): 1.6B params, 100B tokens, ~$40K to train. Today: same quality for ~$100. Llama 3: 405B params, 15T tokens. Modern frontier models: hundreds of billions of parameters, trillions of tokens. Transformer ArchitectureTransformer ArchitectureSelect a training stage to see model output qualityTraining Loss ↓4.8Cross-entropy loss500Training stepModel Output at This Stagethe model has learning but confustion still the wqp mxr model bns to predict... What the model is learning At step 1: pure noise. By step 500: local coherence appears. By step 32K: fluent English. The model is learning grammar, facts, reasoning patterns — all implicitly from token prediction. Chapter 1 · Pre-Training · Stage 4Inference &Token SamplingOnce trained, the network generates text autoregressively: feed a sequence of tokens → get a probability distribution over all 100K possible next tokens → sample one → append → repeat.This process is stochastic — the same prompt generates different outputs every time because we're flipping a biased coin. Higher-probability tokens are more likely but not guaranteed to be chosen.Temperature controls randomness. Low temperature (0.1) → model always picks the top token. High temperature (2.0) → uniform chaos.
Window 4 - 100% AI-Generated
0.7–1.0 is the sweet spot for coherent-but-creative text.Key Mental Model The model doesn't "think" about what to say. It computes a probability distribution over all possible next tokens and samples from it. Every word is a coin flip — just a very informed one. Token Sampling DemoWatch the model choose the next word. Each bar shows the probability of a candidate token.The sky appears blue Temperature (randomness) 0.8 Next token candidates Chapter 2 · The Base ModelThe InternetSimulatorAfter pre-training, you have a base model — a sophisticated autocomplete engine. It's not an assistant. It doesn't answer questions. It continues token sequences based on what it saw on the internet.Give it a Wikipedia sentence and it'll complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent — whatever was statistically common in its training data.The base model's knowledge lives in its 405 billion parameters — a lossy compression of the internet, like a zip file that approximates rather than perfectly stores information.Base Model BehaviorFew-Shot PromptingHello: Bonjour | Cat: Chat | Dog: Chien | Teacher:→ Professeur  ✓ correctMemorizationZebras (/ˈzɛbrə, ˈziːbrə/) are African equines with distinctive......black-and-white striped coats. There are three living species: the Grévy's zebra, plains zebra, and mountain zebra...↑ Verbatim Wikipedia recall from weightsHallucinationThe Republican Party nominated Trump and [running mate] in the 2024 election against...→ ...Mike Pence, facing Hillary Clinton and Tim Kaine...→ ...Ron DeSantis, against Joe Biden and Kamala Harris...↑ Knowledge cutoff → plausible confabulationIn-Context Learning Base models can perform translation, classification, and Q&A via few-shot prompts — no fine-tuning needed. The model infers the task from the pattern of examples in its context window. Chapter 3 · Post-TrainingBuilding the AssistantThe base model is a token simulator.
Window 5 - Human
To turn it into a helpful assistant, we need post-training — a much cheaper but equally critical stage. This is where the model learns conversations.Supervised Fine-Tuning (SFT)Human labelers create a dataset of ideal conversations, following detailed labeling instructions: be helpful, be truthful, be harmless. The model is then trained on these conversations — not from scratch, but by continuing to adjust the pre-trained weights on this new data.Modern SFT datasets (like UltraChat) have millions of conversations — mostly synthetic (LLM-generated), with human review. The model learns by imitation: it adopts the persona of the ideal assistant reflected in the data.HumanWhat is 2 + 2?Assistant2 + 2 = 4. Is there anything else you'd like help with?HumanWhat if it was multiplication instead?Assistant2 × 2 = 4 as well — the same result! For multiplication, 2 × 2 means adding 2 to itself once, giving you 4.What you're really talking to ChatGPT is a statistical simulation of the human labelers OpenAI hired — experts following labeling instructions. When it answers a coding question, it's imitating what a skilled developer-labeler would write. Conversation Token FormatEvery conversation must be encoded as a flat token sequence. Special tokens mark the structure:<|im_start|>user<|im_sep|> What is 2 + 2? <|im_end|> <|im_start|>assistant<|im_sep|> 2 + 2 = 4. <|im_end|>Then RLHF refines the assistant's behavior further:RLHF — Reinforcement Learningfrom Human FeedbackHuman raters rank multiple model responses. A reward model learns to predict human preferences. The language model is then trained via reinforcement learning to generate responses the reward model scores highly.✓ PreferredHere are the top 5 landmarks in Paris: 1) Eiffel Tower — iconic iron lattice structure... 2) The Louvre — world's largest art museum...✗ RejectedParis has many landmarks. You should visit the Eiffel Tower. There is also a museum called the Louvre. Also Notre-Dame Cathedral is there...Why RLHF matters SFT teaches the model what to say.