Making a vintage LLM from scratch - Cr;Lf;

C crlf.link ↗

▲ 103 points • 29 comments • by croqaz • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

0 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 6 of 6

SEGMENTS · AI 0 of 6

WORD COUNT 1,928

PEAK AI % 0% · §6

Analyzed

Jun 12

backend: pangram/v3.3

Segments scanned

6 windows

avg 321 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,928 words · 6 segments analyzed

Human AI-generated

§1 Human · 0%

2026 May 25, Mon 50 min In this blog post, I will share the adventures I had creating my own LLM, from (almost) scratch, trained only on old texts.I made my own base-training and fine-tuning scripts, data processing pipelines and custom datasets.("almost from scratch" means I did use existing programming languages and libraries, I didn't write in Assembly, just like anyone else who builds an AI "from scratch"...) The model can be found on HuggingFace: https://huggingface.co/croqaz/vintage-LLM-340m-v1-base ;All the code is open source at: https://github.com/croqaz/vintage-LLM ;If you want to check bigger Vintage models, see my previous post: Vintage LLM models. The idea Three months ago at the end of February I discovered a few Reddit posts by Hayk Grigorian, where he described creating his temporal gated language model. I was absolutely fascinated. Training an LLM only on 1800s London texts, 90GB dataset:https://reddit.com/r/LocalLLaMA/comments/1pkpsee/training_an_llm_only_on_1800s_london_texts_90gb LLM trained from scratch on only 1800s London texts brings up a real protest from 1834:https://reddit.com/r/LocalLLaMA/comments/1mvnmjo/my_llm_trained_from_scratch_on_only_1800s_london Obviously I read other posts from other people that made their own LLMs, but maybe I wasn't ready to do it myself, or the model they were working on wasn't that interesting. Anyway, the thought of having my own Victorian chat bot... fuckin' epic !! Since then, I worked on my own "Vintage LLM" every single day. Without exceptions. Even when I was sick. In the meantime, a lot more historic LLMs have been released like: Violet-1B4-Chat, Mr. Chatterbox, GPT-1900, Talkie and TypewriterLM-base.

§2 Human · 0%

What, why, where and how? What?This is a time-locked LLM/ historical LLM, English only, and its knowledge cutoff is year 1900.(Limiting to a specific year is error prone, but I did my best effort).It is based on Llama architecture and has 340M (0.3B) params. Why?Because I can only learn if I do it myself and it's a super fun project. Where and how?I made my own dataset, my own processing and training code.The code is semi-vibe-coded with whatever LLM I had with VS-Code and PI (OpenRouter models).I checked and validated every single function and I deeply understand what every single code file is doing.The dataset processing took the most and I tried all sorts of things that didn't work, and I wasted a ton of time. Complicated solutions are the worst... I processed all the data on my own PC and I trained smaller versions of the LLM on my PC (Cachy OS Linux, AMD Ryzen 7 9700X CPU, 64GB RAM, Radeon RX 9070 16GB VRAM).As for the larger 340M model, I trained it on RunPod, ThunderCompute and Vast.ai. It would have taken forever on my PC. The total cost of this project was: ~$80, GPU costs only.That's because I have a decent PC to process the data. If I had more RAM, I could have processed some of the data much faster, especially when it comes to de-duplicating texts in memory. Disclaimer: This is a toy/ hobby LLM (but I treat it very seriously).It will hallucinate and generate historic semi-accurate content which, at the time was considered normal but by today's standards is considered: toxic, offensive and unsafe. This is expected, because I didn't do any alignment. Aligning (or censoring) the model requires significant effort and it would ruin the historic accuracy.Also, I can't guarantee that my model is strictly limited to the year 1900 (even if I did my best) eg: as to perform the "Albert Einstein test". The plan I use AI everyday at work and I understand how it works, but I never built an LLM myself.

§3 Human · 0%

I ran specific AI training and fine-tuning pipelines at work, I built tiny neural networks in C and Python in the past, but when I started this project I didn't know how people are usually building LLMs. I searched for a week and I chatted with multiple bots to get different points of view (like I always do when I research a topic). In short, to build an LLM you need 4 things: the data -- an LLM has no discernment or understanding. It will learn from anything you tell it to, good or bad. This is the longest process. tokenization -- the Tokenizer is a little program that converts words or letters into numbers (tokens). LLMs don't understand words, they only understand numbers. pre-training -- it's a confusing expression and it means "base-training", where the LLM learns to autocomplete text. If you're going for a 300m+ params, this is the most expensive process. fine-tuning -- where the LLM learns how to chat in turns, question & answer. Well, there's a bit more to it than these simple steps, but I won't go super deep in this article. Now, let's look at each step in more detail. Initial experiments It's worth mentioning that I made lots of mistakes and I experimented with some datasets and model architectures before I settled on the "big" model.("big" in quotes here, because you know, compared to larger models like Talkie-13B or TypewriterLM-7.24B, my model is just a toy) Some details about my v2 toy EleutherAI/pythia-14m that I trained on my own PC:https://github.com/croqaz/vintage-LLM/tree/e272b94fcf96316f874babbed549d20809fe5a39/models/m-v2If you look at the validation loss and perplexity SVG files, you'll see huge jumps, because I didn't randomize the file chunks and the dataset files were tokenized in alphabetic order, and it happened that the clean book files were first and as soon as the model started to get exposed to the Time-Capsule dataset, it gradually become worse and worse because that dataset was bad with lots of weird OCR artefacts, broken words & sentences and so on.

§4 Human · 0%

Mistakes were made ... but I learned from them. I was stuck for a while, until I started to filter out the bad documents. Data The data processing was the longest and most boring process by far, and I hope you understand why...There are plenty of modern, high quality datasets with data scraped from the internet, but I didn't want my LLM to learn about computers, atomic bombs and space-ships, so I had no choice but to make my own.Luckily there are some datasets available, but they are pretty bad quality and most of my work was to de-duplicate, filter the really bad texts and enhance some of the existing texts. The historic datasets are quite limited, the old books are all we have unless we discover more old books and someone scans them, so we have to use what we have. A few datasets worth mentioning are: Project Gutenberg, Oxford Text Archive, Internet Archive books, TheBritishLibrary/blbooks, storytracer/LoC-PD-Books, dell-research-harvard/AmericanStories, dell-research-harvard/NewsWire, Heritage Made Digital Newspapers (HMD). I did my best to find out the year and language of each dataset, so I can limit everything to English before year 1900.I completely ignored documents that don't specify a year, or I can't find date in the texts, even if they had good quality, just to be safe.As a side-project, I created the Book-Metadata HF dataset with lots of old books, title, author, book ID and source: https://huggingface.co/datasets/croqaz/book-metadata ; My goal was find out the year of all Gutenberg books, but I only found 5300 books where I'm 100% sure of the year. Again, this took absolutely forever and as I am writing this blog post, I'm still not completely finished.If I'll ever train another LLM, I'll have more and better data next time. Initially I wanted several de-duplication methods, including MinHash and embeddings vector similarity. If you don't know what it means, don't worry, I won't go into details. It was way too slow and expensive and I had to abandon it.

§5 Human · 0%

Just to give you a taste of how slow this was, I had a beefy DEV server calculate embeddings for my short text datasets and I did 10% of the dataset in a week with the server running day & night. The server had a RTX 3090 GPU that I shared and so on. In the end, I de-duplicated based on the normalized text (lower-case text with all spacing removed). Basically the text "hello world" is identical to the text " Hello World " (notice the spaces and title case), and the text is only saved once in my dataset. From the start, I knew that the data is the most important: garbage in -> garbage out. I did lots and lots of experiments and iterations, I played with DBs like Qdrant, Zvec, Lance, ValKey, and LevelDB for storing the datasets. I dropped Qdrant because the DB disk size was huge even before I started adding many entries.I dropped Zvec because I had no way to cycle the DB entries, basically once you save them, there's no way to explore the DB. I created an issue for this. Zvec is also super new and I should probably give it more time to mature.I dropped Lance because of the versioning, the DB becomes slower and slower once you start adding more than a few million entries. This could be my fault, I'm sure I can find out a way to do this better.I dropped ValKey because I ran out of memory, after injesting something like 10 mil records, the server started to OOM crash on my PC, but I still had to injest way more data. Other than that, ValKey was really great.I ended up using LevelDB (which was used by local wallet apps to store BitCoin and Ethereum transactions), so I know it can scale on my PC. I injested 12 mil rows without any issues and with minimal CPU or RAM usage. LevelDB can be slow at times, but it's consistently reliable.If I had a better PC, or a super computer, I would probably have used ValKey all the way. To get a sense of the quality of the texts, I was first looking at the length and unique characters of each document. For the first stage I decided to use short texts (up to 32k long) and for the second, long texts up to 10MB.

§6 Human · 0%

English shouldn't use more than 30-50 symbols normally; if a chunk of text has 100 or more unique symbols, it's not English, so I discarded those. And if a text has only 8 unique symbols, it doesn't make any sense, so I removed those too. I applied 3 more filters. Super easy metric, the compression ratio of ZLIB. Text that is too short and diverse will have a big value and text that is super repeated will be really small value. # ZLIB compression ratio # A good window is 0.5...0.7; def compression_ratio(text) -> float: raw = text.encode("utf-8") compressed = zlib.compress(raw) return len(compressed) / len(raw) compression_ratio("Lorem ipsum dolor sit amet") # 1.3 compression_ratio("other, and other and other" * 100) # 0.01 -- very repeated compression_ratio("The President has nominated Thomas Johnson, William Cranch, and Charles\nSimms, Judges of the district of Columbia.\n\nOn Saturday last, Thomas Jefferson, at\npresent Vice President of the United States,\nand President of the Senate, took leave of\nthat body on which occasion he delivered\nthe following address:\n\nGentlemen of the Senate,\n\nTo give the usual opportunity") # 0.64 -- Regular text Also the Shannon entropy: # Shannon Entropy # The estimated entropy rate for printed English is approximately 4.2...5.5; def char_entropy(text) -> float: counts = Counter(text) total = len(text) entropy = 0.0 for count in counts.values(): p = count / total entropy -= p * log2(p) return entropy char_entropy(("a " * 10 + "!")) # 1.22 -- Too low char_entropy("Lorem ipsum dolor sit amet") # 3.6 -- A bit low char_entropy("IN the High court of Chancery for the Rich\nmond District,\nBetween\nHenry Banks plaintiff,\nAnd\nNathaniel Anderson, Robert Pollard.") # 4.5 -- Regular english char_entropy(''.join(chr(i) for i in range(200))) # 7.6 -- Super high entropy And my own quality detection filter.