Types and Neural Networks

B brunogavranovic.com ↗

▲ 83 points • 27 comments • by bgavran • 5w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

2 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,795

PEAK AI % 7% · §2

Analyzed

Apr 21

backend: pangram/v3.3

Segments scanned

5 windows

avg 359 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,795 words · 5 segments analyzed

Human AI-generated

§1 Human · 5%

Posted on April 20, 2026 [This is cross posted to the GLAIVE blog ]Neural networks are used to generate increasingly more code in languages which enable highly generic and provably correct programming: Idris, Lean, and Agda, for example.However, most frontier models generating the code – Large Language Models – separate the process of training from the process of typechecking. They are trained to produce output of a fixed type: List Token. To get valid code, that output is then during post-training parsed, in multiple ad-hoc ways, into the types of a particular language.What are these ad-hoc ways? Do they work, and should we expect them to? And more importantly, can we rebuild LLMs from the ground up so that they’re trained to produce typed output?Enforcing types after trainingToday, LLMs are trained to predict the next token in a given training corpus, resulting in functions of typeLLM : List Token -> D (List Token)where D (List Token) here denotes a distribution over a list of tokens. LLMs consume an input prompt and produce this distribution one token at a time – sampling from the next-token distribution and feeding the result back in. For the purposes of this blog post, we’ll consider Token to simply be Maybe Char, i.e. either a character the model is able to generate, or a STOP token which signals the termination of the generative process.1The approaches in this section all take this trained network as fixed. They modify only the inference procedure, and fall along two axes: Granularity: how often is the typechecker consulted? When the model emits a STOP token, or after every token? Bandwidth: what does the typechecker communicate back? A structured error message, or merely a binary accept/reject signal? These approaches sit at opposite extremes along these axes, and neither fundamentally solves typed generation, but we describe them here because they help us understand where the issue is.1. Try; Compile; If Error RepeatThis is what most programmers do. They type raw text into the editor; the compiler either processes it into structured data, or returns an error the programmer has to internalise before resubmitting.

§2 Human · 7%

The same loop can be performed with LLMs.2 Given the task of producing a value of type List Int, the model generates a candidate – say, “[ 1 , 2 , 3 ] STOP” – and the controller hands it to a typechecker. If it passes, we’re done. If not, we feed the error back and ask the model to try again. Along our two axes, this approach has low granularity (the typechecker is consulted only after the STOP token is produced) and high bandwidth (errors are structured messages the model can reason about).But there is a problem. Suppose the task is instead to produce a value of Either Char Double, and the model begins with “[ 1 , 2” – already unrecoverable, since no value of this type starts with [. Nothing notices until the full generation completes! At that point the entire sequence is rejected and the model starts over. This is incredibly wasteful.It can be remedied by tightening the loop – generating fewer tokens before each check – and in the limit this is what experienced dependently-typed programmers do: make a small move, see the compiler’s response, make the next move. But tightening only addresses granularity. The LLM, unlike the programmer, does not carry what it learns across sessions: nothing in this approach updates the weights, meaning that the next session starts from the same state as the last.2. Constrained decodingA different approach is to consult the typechecker before each token is sampled, and mask out any tokens that would lead to an ill-typed result.For instance, if the model is generating a value of type List Int and has already produced the tokens “[ 1 , 2 , 3”, we set the probability for alphabetic characters to zero and sample from the remains.3 Despite the fact that type inference is in general not decidable, this approach is well-known, and has been applied to JSON schemas and restricted fragments of various type systems.4 Along our axes, this sits at the opposite corner: maximum granularity (every token), and minimum bandwidth (one bit per token). Here, output is guaranteed to typecheck! However, there is a big problem: this only prevents the model from saying certain things, and does not change what the model wants to say.

§3 Human · 1%

To see this, consider the case where the model has learned to assign high probability to alphabetic characters in our example. Masking them out forces it to sample tokens it considers unlikely. At each step, the mask checks only whether a token can lead to a well-typed output, without accounting for how much of the model’s probability mass remains on each branch. This can trap generation into branches where only low-probability completions exist, producing output that typechecks but is increasingly nonsensical.5Like with the previous approach, no gradient flows through the mask, so the model cannot learn to adapt and assign a higher probability to densely populated branches, as the weights are never updated.Learning types during trainingThe above approaches, as strange as it may seem, work. Frontier models using retry loops are making genuine progress: FrontierMath scores climbed from under 2% upon release to almost 50% in less than two years, and so did the scores for ARC-AGI-2 and SWE-bench. In chess, GPT-4 reaches roughly 1371 ELO (intermediate player) as a byproduct of general training.These are significant results. But in domains where we know how to train models that utilise structure during training, we see even more dramatic performance improvements. Chess is one of those. The system AlphaZero utilises the game structure during training and reaches superhuman ELO (>3400) with ~30x fewer parameters than GPT-4 (<60 million vs 1.8 trillion). And AlphaZero never makes an illegal move, compared to GPT-4 which makes one in 30% of games.What about dependently-typed programming languages? These are an incredibly rich domain, and one can wonder what happens if we tightly integrate their compilers into neural network training. Unfortunately, this hasn’t happened yet: we don’t really know how to do that. Most current models generating typed code are playing the game without being told its rules. What is surprising is how good they are.If chess is any indication of what happens to performance when you train using structure, encoding the rules of a language is worth figuring out. The only question is: how do you differentiate through type systems?

§4 Human · 1%

How do you learn to produce a typed output, given that types are discrete, non-differentiable objects?Differentiating through structureTo understand differentiation, let’s start with a simpler question. How do you write a function of the following type in Haskell, modelling a simple choice of output structure: a coproduct?f :: (x, p) -> Either a b f = ?If you try to write it down, no matter how creative you get, it will end up having the shapef (x, p) = if c (x, p) then Left (f_l (x, p)) else Right (f_r (x, p))for some predicate c : (x, p) -> Bool and two maps f_l :: (x, p) -> a, f_r :: (x, p) -> b. There is no other option.6 To produce a Left or a Right, you must commit to one or the other by partitioning the input via c. When it returns True you go left, otherwise you go right.This decomposition is not an accident of Haskell – it is one of the properties of the category \(\mathbf{Set}\), which is an extensive category.7 And this property is central to research on differentiating maps into coproducts, most notably CHAD and Higher Order AD of Higher Order Functions. They use this theorem, and seemingly – without retries or constrained decoding – show us how to build a network whose output is a structured type, at train time.But as we’ve seen above – these approaches require us, the programmer, to partition the input space, instead of having the neural network automatically learn it. This means that, given a network \(f : X \times P \to A + B\), if your model starts out guessing the wrong branch \(A\), it can never fix it through gradient updates, because it was never able to learn it to begin with.8The same story applies to autodiff through other structures, such as pairs, inductive types, and function objects. All of these relinquish the model’s ability to learn the structure. This is an important perspective, but one that’s overlooked in the aforementioned literature. Of course, this is justified since their focus is autodiff, and not learning. But I have not seen this distinction acknowledged, and it deserves to be said explicitly.

§5 Human · 0%

Often, we don’t want just a method for propagating gradients through a known program, instead we want the ability to learn the program.Differentiating with respect to structureThere is a different way we can go about learning a program \(f : X \times P \to A + B\). Instead of fixing a partitioning, what if we learn it? This is coincidentally the answer to our question of how to backpropagate gradients while making discrete choices: we never differentiate through discrete choices to begin with. Instead, we collect evidence for each discrete choice, normalise it, and then sample to decide what next.That is, we use three maps: A differentiable map \(f_c : X \times P \to D(\text{Bool})\) A differentiable map \(f_A : X \times P \to A\) A differentiable map \(f_B : X \times P \to B\) Then, to produce an output of type \(A + B\) given some \((x, p) : X \times P\), we apply \(f_c\) to \((x, p)\), and then sample from the resulting distribution. The resulting boolean tells whether to use \(f_A\) or \(f_B\). Another way to say this: \(f\) is now a map of type \(X \to D(A + B)\), producing a distribution over both the choice of the output types, and their values.The question is now: how do we differentiate this, and train it using supervised learning? Perhaps surprisingly, this is straightforward. We differentiate only through the branch taken, and use cross entropy loss for comparing the produced distribution with the actual one. This means that, if we’re ready to provide these three neural networks, we can obtain an effectful neural network of type \(f : X \times P \to A + B\).This approach produces well-typed output by construction. And unlike the previous approach, it allows the network to learn the right output type. It also directly subsumes the previous approach: setting the sampling temperature to t=0 recovers argmax, i.e. differentiation through structure.Even more, this coproduct can be thought of as an instance of a dependent pair indexed by a finite set. This generalises in a number of different ways, and can be iterated to produce an array of typed structures, all in a completely principled way!