Training an LLM in Swift, Part 1: Taking matrix multiplication from Gflop/s to Tflop/s | Cocoa with Love
Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,977 words · 6 segments analyzed
In this article, I try to get my own handwritten matrix multiplication code running as fast as possible for training a Large Language Model (LLM) in Swift. The aim is to give some insight into the key steps for optimizing mathematics code in Swift. I also hope that these examples will offer a sense of scale about the capabilities of the different units on Apple Silicon – CPU, SIMD, AMX and GPU. This will be the first in series where I look at training neural networks in Swift on Apple Silicon. Future articles will look at the maybe-too-many frameworks Apple offer for machine learning on the Mac. Those established frameworks are what you should really use for matrix multiplication and machine learning (they’ve spent a few more years optimizing matrix kernels than I have). But until then, I’m having fun writing everything for myself in a “no frameworks, no libraries” plain code approach. And I’m not just writing matrix multiplication kernels. The sample app will use these kernels as part of a full LLM implementation and the numbers I’ll quote will be for entire forward and backward training iterations. The reference implementation for this series will be Andrej Karpathy’s llm.c (a plain C implementation of a GPT2-compatible model). It’s a fairly basic model but it does contain all the necessary components and is representative of real-world workloads. That means it’s time for my favorite game: optimize Swift until it’s faster than C. Contents Backstory llm.c Basic Swift Span, Egg and Span Chill vibes They see me rollin' Multi-threaded Top-secret hacks Maximum power draw Test harness Conclusion Backstory About two years ago, I dug up my engineering thesis from the early 2000s. It’s an image recognizer written in C++ and uses a neural network for classifying images. I wanted to get my old code running again but I hadn’t worked on ML code in a long time. It got annoying and I gave up. For all the discussion around LLMs in early 2024, it felt like no one was training neural networks on the Mac. At least, not in languages like Swift. I played with some python libraries like PyTorch and TensorFlow but python never does the calculations itself – it operates more like an orchestrator of another computational engine under the hood – and the separation left me feeling like I wasn’t in control.
A month later, Andrej Karpathy released llm.c. This reached me in a way that other machine learning content didn’t because nothing is hidden. It is around 1000 lines of plain C and (although it’s filled with some pretty cryptic variable names) it’s relatively readable. So naturally, I immediately rewrote it in Swift. And it was a lot of fun to play with. Of course, playing with the code required some work to make it run fast. Some foreshadowing, here: the initial Swift implementation was really super slow. But optimization is a constant process: there’s always something more you can try. Which finally brings me to this article: I’m going to walk through the different explorations I wrote then (and a couple I’ve added in the last week) to make an LLM train fairly quickly without resorting to using a library. Most of the code will be in Swift (although I’ll show a Metal implementation at the end). By the way, I will not be explaining how a neural network or an LLM works. If you’re interested, Karpathy’s video Let’s build GPT: from scratch, in code, spelled out. is practically the definitive guide to learning how GPT-like LLMs work and his earlier series starting with The spelled-out intro to language modeling: building makemore covers plenty of introductory concepts in a 5 video series if you want a more introductory lesson. Of course, both are in Python, so please come back here when you’re ready to see how we can do things in Swift. llm.c Machine learning is essentially the application of model weights to input data (called the forward pass, a.k.a. inference), then the calculation of error gradients and an update to those weights (the backward pass). We typically package these calculations together and try to make them run as fast as possible. These packages of operations might be called: “linear tensor projection”, “matrix multiplication”, or even a series of “vector dot products” (depending on how big or small you slice the units of work). It’s ultimately a loop that performs z += x * y a lot of times. Since these matrix multiplications represent so much of the work in machine learning, I’m going to focus on the code that does this. I will be updating the rest of the implementation as I go, but only using the same improvements I’m showing to matrix multiplication.
Let’s start by looking the matmul_forward from llm.c which is the core matrix multiplication used on the forward pass. It iterates over the input (inp), multiplies by model weights (weight), and adds the result to the running total (val). void matmul_forward(float* out, const float* inp, const float* weight, const float* bias, int B, int T, int C, int OC) { for (int b = 0; b < B; b++) { for (int t = 0; t < T; t++) { int bt = b * T + t; for (int o = 0; o < OC; o++) { float val = (bias != NULL) ? bias[o] : 0.0f; for (int i = 0; i < C; i++) { val += inp[bt * C + i] * weight[o*C + i]; } out[bt * OC + o] = val; } } } } Maths aside: The code treats the input values inp as a matrix of (B * T) x C and the weights weight as a matrix of OC x C to create a new matrix of dimension (B * T) x OC by calculating inp x transpose(weight) where each cell in the output is the sum of products of a row from inp and a row from weight (normally, row-major multiplication is row by column but we’re treating weight as rotated so it’s row by row). The four layers of loops add some visual complexity but in reality, that val += inp[bt * C + i] * weight[o*C + i]; line is the heart of a neural network. Like I said: z += x * y a lot. How much? The val line contains 2 floating point operations but Karpathy says the number of floating point operations in a full training iteration should be roughly 6 x N x D where N is the number of weights in the model (124,475,904 in our case) and D is B * T = 4 * 64 = 256 for our app.
So we’re talking about 6 x 124,439,808 x 256 ≈ 1.911×10¹¹ ≈ 0.2 trillion floating point operations per training iteration. So it’s got to run quick. Model Tokens/s Training iterations/s llm.c 0.92 0.174 The plain C code runs easily in a Swift Package. I’ve fixed the C implementation to always run at -O3 optimization level (regardless of Xcode settings). Even at this optimization level, the C implementation manages just one training iteration every 7 seconds and inference at less than 1 token per second. A wonderful proof of concept but 10 times slower than would ever be useful. Basic Swift I’ve tried my best to keep the basic Swift version as true to the C version as possible: static func matmul_forward(out: inout [Float], inp: [Float], weight: [Float], bias: [Float]?, B: Int, T: Int, C: Int, OC: Int) { for b in 0..<B { for t in 0..<T { let bt = b * T + t for o in 0..<OC { var value = bias?[o] ?? 0 for i in 0..<C { value += inp[bt * C + i] * weight[o * C + i] } out[bt * OC + o] = value } } } } Since the C code is inherently “unsafe”, I went ahead and gave the Swift code the same advantage by setting it to run with -remove-runtime-asserts (removing the runtime checking on array indices) and made sure to always run the app in “Release” configuration. So the Swift and C implementations should be fairly equivalent, right? Don’t run in Debug. I will only be quoting Release configuration numbers. While I have run sections of this in Debug, I’ve never waited around for a full 20 iteration training run in Debug. I usually keep the Scheme in Xcode set to “Release” – even during debugging. If you read the backstory, I’ve already mentioned: this was “extremely slow”.
Model Tokens/s Training iterations/s Training versus llm.c llm.c 0.926 0.175 100% Basic Swift 0.054 0.014 7.3% The Swift code is between 15 and 20 times slower. That’s an LLM producing 1 token every 19 seconds. Running 20 training iterations on this engine takes nearly 30 minutes. What on Earth is going on? This performance represents about 2.8 Gflop/s. In 1999 Apple ran ads for the PowerMax G4 claiming its 1 Gflop/s capability made is a weapon in the eyes of the US military. Now 2.8 Gflop/s is completely unacceptable. Span, Egg and Span Inspecting in Instruments, by far the biggest time impact in the previous run is _ArrayBuffer.beginCOWMutation(). Swift thinks someone else could be using our Arrays and even though they are unique (so we’re not getting array copies), the uniqueness checks alone are our biggest overhead. Huh? Sometimes you run into issues that might just be a bug. This might be one of them. When I first worked on this code in 2024, my memory is that this wasn’t a problem. I don’t know if there’s been a regression or a safety hole was covered over leading to _ArrayBuffer.beginCOWMutation() clogging up performance. This problem also moves around when I disable function inlining with inline(none) so it feels like the optimizer is just not able to do its job properly. In any case, we can’t use Array and get the performance we need. Fortunately, Swift 6.2 gives us a reliable fix with basically zero overhead: MutableSpan. static func matmul_forward(out: inout [Float], inp: [Float], weight: [Float], bias: [Float]?, B: Int, T: Int, C: Int, OC: Int) { var out = out.mutableSpan for b in 0..<B { for t in 0..<T { let bt = b * T + t for o in 0..<OC { var value = bias?[o] ??
0 for i in 0..<C { value += inp[bt * C + i] * weight[o * C + i] } out[bt * OC + o] = value } } } } All I’ve added is the var out = out.mutableSpan line at the top, shadowing out with its own mutableSpan. I’ve also applied the same pattern throughout the file. Model Tokens/s Training iterations/s Training versus llm.c llm.c 0.926 0.175 100% Basic Swift 0.054 0.014 7.3% Basic Swift + mutableSpan 0.056 0.042 24.0% Curiously, while this change didn’t have much impact on the forward pass, it did make the training iterations (forward + backward + update) more than 3 times faster. Chill vibes But we need to turn our attention to why the forward pass is slow. Instruments confirms what we already know: the hottest line on the forward pass is the center of our loop value += inp[bt * C + i] * weight[o * C + i]. It’s time to face a hard truth: C has some compiler optimization flags that Swift doesn’t have. In this specific case, C has -ffast-math (which is often a part of -O3, depending on compiler). What does this mean? C is letting the compiler use fused-multiply-addition (FMA) commands which perform a floating-point multiply and addition in a single command and generally doesn’t worry too much about precise correctness. +0xa34 fmadd s0, s17, s16, s0 +0xa38 ldr s17, [x20, #0x4] +0xa3c fmadd s7, s17, s16, s7 +0xa40 ldr s17, [x21, #0x4] +0xa44 fmadd s4,