Skip to content
HN On Hacker News ↗

idlemachines

▲ 134 points 47 comments by smaddrellmander 4w ago HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with some AI-generated content detected

31 %

AI likelihood · overall

Mixed
78% human-written 22% AI-generated
SEGMENTS · HUMAN 6 of 7
SEGMENTS · AI 1 of 7
WORD COUNT 1,715
PEAK AI % 87% · §1
Analyzed
May 1
backend: pangram/v3.3
Segments scanned
7 windows
avg 245 words each
Distribution
78 / 22%
human / AI fraction
Verdict
Mixed
Pangram v3.3

Article text · 1,715 words · 7 segments analyzed

Human AI-generated
§1 AI · 87%

Multiclass output? Softmax. Normalising probabilities? Softmax. Attention weights? Softmax. Partition function? You guessed it, Softmax. This function comes up everywhere, but how often have you really thought about what's going on inside?What does softmax actually do to your distribution?The softmax function is deceptively simple:softmax(xi)=exi∑jexj\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine.One useful way to think about softmax is that it maps vectors into a very specific geometric object: the probability simplex. For an n-dimensional output, this is the set of all vectors where each entry is non-negative and everything sums to 1. In 3 dimensions, this looks like a triangle sitting in 3D space; in higher dimensions, it's the same idea generalised. Softmax takes an unconstrained vector in Rn\mathbb{R}^n and smoothly projects it onto this simplex. The constraint that all outputs must sum to 1 is exactly what creates the interactions between dimensions that we'll see later in the Jacobian.Let's visualize what this actually does in a real language model scenario - predicting the next token after "the cat sat on a":Distribution ShiftLeft: raw logit values for candidate tokens. Right: probabilities after softmax. The highest logit ("mat" at 3.2) gets dramatically amplified to 48% probability, while others are suppressed. The transformation turns unbounded scores into a probability distribution that sums to 1.The transformation is pretty dramatic. The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimates — it's very opinionated about which class should win.

§2 Human · 26%

We can see this "winner takes most" behavior even more clearly with a batch of vectors:Softmax FocusEach column is one sample, each row is a class. Left: logits with relatively similar magnitudes (green). Right: probabilities after softmax (purple). Notice how probability mass concentrates sharply on the highest logit value in each sample as we take the softmax over each column.This focusing behavior is funny, on one hand it's what makes softmax so powerful for classification. By making the outputs more decisive it makes it easier to predict classes and train the model. But on the other hand we're baking in a sense of difference between options that isn't always present in the original logits. That said, a trained model has learned to produce logits where these differences are meaningful.Numerical Stability: We need to talk about overflowA naive implementation works fine for small inputs, but as with any function with an exponential we should hear some alarm bells ringing.

§3 Human · 2%

In the sigmoid we had a single exponential, here we have N of them.If we feed in x = [1000, 1001, 1002] and x_i = 1002 we get e^{1002} = inf, and the sum is also inf, so we get inf / inf = nan. Not good. This isn't a graceful failure, this is a catastrophic failure.The interconnected nature of the softmax makes this worse as well, in a sigmoid we have a single input to a single output, so if any activation results in an overflow we only get the one NaN. With the softmax each output is a function of all the inputs, so if any input overflows we get NaN for all outputs. (NB: this is a simplification of the sigmoid, once we backpropagate a NaN into the network it can cause NaNs to spread).The fix - shifting the inputsWith an exponential in general we are worried about supplying large positive inputs, very negative inputs are less of a problem as they just tend towards 0 and most frameworks will just underflow to 0 without causing a catastrophic failure.The other thing to note is that the softmax is invariant to shifts in the input, because it ultimately just gives us a normalised distribution and we only really care about the relative sizes.So we can use the identity that for an exponential function, and some constant C: exi+c=exi⋅ece^{x_i + c} = e^{x_i} \cdot e^c and apply this to the softmax function we find softmax(xi)=exi+c∑jexj+c=exi⋅ec∑jexj⋅ec=ecec⋅exi∑jexj=exi∑jexj\mathrm{softmax}(x_i) = \frac{e^{x_i

§4 Human · 5%

+ c}}{\sum_{j} e^{x_j + c}} = \frac{e^{x_i} \cdot e^c}{\sum_{j} e^{x_j} \cdot e^c} = \frac{e^{c}}{e^{c}} \cdot \frac{e^{x_i}}{\sum_{j} e^{x_j}} = \frac{e^{x_i}}{\sum_{j} e^{x_j}}So we can shift every x by any constant c without changing the output of the function. Then naturally we can ask what value of c is best to avoid overflow? The answer is c = -\max(x), so we shift the inputs down by the maximum value, ensuring that the largest input to the exponential is 0, and thus we can guarantee no overflow.This is super easy to write in numpy, all we need is</> Python1def stable_softmax(x, axis=-1): # axis=-1 means the last dimension 2 shift_x = x - np.max(x, axis=axis, keepdims=True) 3 exp_shift_x = np.exp(shift_x) 4 return exp_shift_x / np.sum(exp_shift_x, axis=axis, keepdims=True)The trick means that no values are ever larger than 0, and because it's all about the ratio of exponentials we don't change the output. The keepdims=True ensures that the shapes broadcast correctly when working with batches. You might point out there's always a concern that if our maximum is an outlier, we might be shifting the rest of the values down and losing precision. This is valid. But, if this is the case the normal softmax enhances this difference in the same way, and solving it requires fixing the outlier rather than the softmax function.The JacobianThis is where it gets a bit more interesting. The softmax is a vector function, so we have to think about the Jacobian rather than the simple derivative. For anyone unfamiliar with the term, the Jacobian refers to the matrix of all first order partial derivatives of a vector function. So for a function f:Rn→Rmf: \mathbb{R}^n \to \mathbb{R}^m the Jacobian is an m x n matrix where the entry in the ith row and jth column is the partial derivative of the ith output with respect to the jth input.

§5 Human · 6%

(An example of a function like this would be a linear layer mapping from an n dimensional input to an m dimensional output, interestingly the Jacobian of this function is just the weight matrix).The most important property of softmax — and the one people tend to miss — is that it couples all dimensions together. Increasing one input doesn’t just increase its own output — it necessarily decreases the others, because the outputs must still sum to 1. This is very different from elementwise functions like ReLU or sigmoid. This coupling is exactly what shows up in the Jacobian: the diagonal terms represent how each output responds to its own input, and the off-diagonal terms capture the competition between different entries.Computing the JacobianBack to the softmax, we have a function that maps from ℝⁿ to ℝⁿ, so the Jacobian is an n x n matrix. The entries of this matrix are given by: Jij=∂softmax(xi)∂xjJ_{ij} = \frac{\partial \mathrm{softmax}(x_i)}{\partial x_j} i.e. the derivative of the ith output with respect to the jth input.There are two natural cases to consider here, the case where i = j and the case where i != j. For the case where i = j (on the diagonal) we have:Jii=∂s(xi)∂xi=s(xi)⋅(1−s(xi))J_{ii} = \frac{\partial \mathrm{s}(x_i)}{\partial x_i} = \mathrm{s}(x_i) \cdot (1 - \mathrm{s}(x_i))where we are using s(x_i) as a shorthand for softmax(x_i).To see why: applying the quotient rule to exi∑kexk\frac{e^{x_i}}{\sum_k e^{x_k}} gives ∂si∂xi=exi∑kexk−exi⋅exi(∑kexk)2=exi∑kexk(1−exi∑kexk)=si(1−si)\frac{\partial

§6 Human · 3%

s_i}{\partial x_i} = \frac{e^{x_i} \sum_k e^{x_k} - e^{x_i} \cdot e^{x_i}}{(\sum_k e^{x_k})^2} = \frac{e^{x_i}}{\sum_k e^{x_k}} \left(1 - \frac{e^{x_i}}{\sum_k e^{x_k}}\right) = s_i(1-s_i)This is a nice expression, and we can see that the derivative is always positive, and tends towards 0 as the output of the softmax tends towards either 0 or 1. This should also be familiar as it's the same expression as the derivative of the sigmoid function, which makes sense as the softmax is a generalisation of the sigmoid to multiple dimensions.For the case where i != j (off the diagonal) we have:Jij=∂s(xi)∂xj=−s(xi)⋅s(xj)J_{ij} = \frac{\partial \mathrm{s}(x_i)}{\partial x_j} = -\mathrm{s}(x_i) \cdot \mathrm{s}(x_j)Here, when we differentiate si=exi∑kexks_i = \frac{e^{x_i}}{\sum_k e^{x_k}} with respect to xjx_j (where j≠ij \neq i), only the denominator changes: ∂si∂xj=0⋅∑kexk−exi⋅exj(∑kexk)2=−exi∑kexk⋅exj∑kexk=−si⋅sj\frac{\partial s_i}{\partial x_j} = \frac{0 \cdot \sum_k e^{x_k} - e^{x_i} \cdot e^{x_j}}{(\sum_k e^{x_k})^2} = -\frac{e^{x_i}}{\sum_k e^{x_k}} \cdot \frac{e^{x_j}}{\sum_k e^{x_k}} = -s_i \cdot s_jThis derivative is always negative, and tends towards 0 as the outputs of the softmax tend towards either 0 or 1.Combining the two cases

§7 Human · 4%

we write the whole Jacobian as:Jij=diag(s)−s⋅sTJ_{ij} = \mathrm{diag}(s) - s \cdot s^Twhere diag(s) is a diagonal matrix with the outputs of the softmax on the diagonal, and s·s^T is the outer product of the softmax output with itself.The structure: diagonal plus rank-1There’s an important structural detail hidden in this expression. The second term is an outer product, which means it has rank 1. So the full Jacobian is a diagonal matrix with a rank-1 correction. This is exactly why we can compute the backward pass efficiently. Instead of working with an n x n matrix, we only need a dot product and a few elementwise operations. The structure of the Jacobian is doing all the work for us.Why size mattersIn this form we should note something that should worry any engineers out there. What shape is the Jacobian? It's actually n x n, and if we think about the size of this in a typical transformer model where n is the sequence length in the attention mechanism, or is the vocab size in the final output layer. In either case this risks becoming a huge matrix. We'll see in the next section how to avoid fully materialising this.The backwards passIn the backwards passes of a neural network, we have an upstream gradient dLds which we want to propagate back through the softmax to get the error with respect to the inputs x. By the chain rule we have:dLdx=JT⋅dLds\frac{dL}{dx} = J^T \cdot \frac{dL}{ds}where J^T is the transpose of the Jacobian. This is the canonical way to write the backwards pass, but as we noted above the Jacobian is an n x n matrix which is potentially huge.