An idiot's guide to lead optimisation for proteins

M magnusross.github.io ↗

▲ 175 points • 16 comments • by magni121 • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

1 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,936

PEAK AI % 1% · §4

Analyzed

May 13

backend: pangram/v3.3

Segments scanned

5 windows

avg 387 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,936 words · 5 segments analyzed

Human AI-generated

§1 Human · 0%

Or, understanding the Cradle-1 pipeline. Lead optimisation is the step in drug design where you take a molecule that sort of works and try to make it actually work, and it’s arguably the step where most real-world design campaigns succeed or die. Due to the influence of a couple of my pals, I have recently become interested in using machine learning to do lead optimisation for proteins. Said pals have been kind enough to indulge my extremely beginner level questioning over the few weeks. I’m going to use this post to share what they have taught me, in the hope it might in turn help you understand a bit more about this fascinating area. Like any field, there are some established principles that are never spelled out explicitly in the literature, which can make it confusing for newcomers. We’re going to try to understand how we could actually build a real system for lead optimisation but studying one that has been shown to work well in real life. Firstly, before we get into it, what actually is a protein? I think to answer this in full we would need multiple textbooks/degrees. Since basically all of my knowledge of biology comes from one read through of Philip Ball’s How Life Works I am convinced the answer to any question about biology is arbitrarily arcane and complex, so for now it will suffice to say that proteins are a class of molecule which are an integral part of basically all the processes that sustain life. Proteins are chains of smaller molecules called amino acids of which there are 20 different types.1 As such we can represent a protein as a string of characters, with a character standing for a different amino acid, using all the letters of the alphabet apart from BJOUXZ.2 For example, we can write myoglobin, a protein responsible for shuttling oxygen around our cells, as: MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG

For certain combinations of amino acids, this chain folds up into a fixed shape, which allows that protein to perform a function.

§2 Human · 0%

Predicting the shape of the fold from its sequence is, to put it mildly, highly non-trivial, and is the problem the AlphaFold-2 model “solved” to some extent to win the Nobel Prize in Chemistry. It is important to understand that there are many possible combinations of amino acids, and most of them will not fold into a regular, predictable shape, and do not do anything useful.3 The folded myoglobin looks something like this:

Typically proteins consist of about 300 amino acids, but the largest, which is comically named PKZILLA, contains upwards of 40,000, and they can be as small as 20. There are estimated to be between 80,000 and 400,000 different proteins that perform functions in human cells. Note that when the amino acids are bonded to form the protein each one is known as a residue. The proteins that exist in nature are the result of evolution. Because of this we can group proteins into families which have some similarity to each other; within these families a large proportion of the sequence representing each protein will be the same. So now we very roughly know what a protein is, what is protein design? The ultimate goal of protein design4 is to generate new molecules that perform a certain function. This function can range from catalysing a specific chemical reaction to binding to a disease causing molecule. Lead optimisation is one of the most important steps in the design process. We assume that we have an existing template molecule that is functional to some extent, but is not sufficient for our ultimate goal. This molecule might have been the result of a previous failed design campaign, generated de novo (i.e. from scratch) by some other model, or chosen from nature. The process of lead optimisation involves proposing changes to this initial molecule with the ultimate goal of improving its properties for the task at hand. In practice, the function of the protein for this task can rarely be summarised by a single number and we usually care about several properties at once and these can trade off against each other. For simplicity, we’ll mostly talk as if we’re optimising a single property in this post, but keep in mind that the real picture is multi-objective.

§3 Human · 0%

The process of lead optimisation proceeds by proposing a set of candidate changes, testing them in the lab, and then proposing more changes after integrating the results of the lab tests. The way this traditionally worked was by using directed evolution, which essentially involves introducing random mutations, testing them and keeping the ones that improve function and mutating them further in a loop until we get to sufficient performance. Given that we have massive databases of proteins, and the awesome power of deep learning, these days we can probably do better.5 And indeed some people have! Cradle is a bio-tech startup that sells a system for ML-based protein lead optimisation. They’re somewhat unusual for a bio-ML company in operating their own wet lab, which they claim gives them a unique ability to keep the loop between model suggestions and experimental feedback tight. Their system seems to be a market leader, and they are working with a number of the world’s largest pharmaceutical companies (e.g. Novo Nordisk, Bayer, J&J). They have demonstrated impressive results across a number of different contexts. Cradle recently published a white-paper, describing how their system works at a high level. Here it is:

Confused? I was too! But don't worry, by the end of this post mini-series we'll have a good understanding of what's going on here, and why each of these colourful boxes is necessary. In this part we will cover the structure of the base model, the process of pushing it towards making more useful suggestions using fine-tuning, and how we can use it to estimate the properties of a given protein. In the next edition, we'll cover how we can actually use this model to generate proteins for further testing. The base model There are a lot of models shown on the big diagram, a somewhat scary number, but actually most of them are tweaked versions of the same fundamental backbone: a transformer-based protein language model. If we can understand that we can go a long way to understanding the pipeline. The standard language modelling paradigm (that powers Claude, ChatGPT etc) is fundamentally based on next-token prediction; a token is a word, or part of a word, and we task models to predict the next word given proceeding context. Typically there are 10,000s of possible tokens corresponding to the large number of words in our vocabulary.

§4 Human · 1%

We can apply this paradigm directly to modelling proteins by substituting our large vocabulary of words for our 20 different types of amino acids, since proteins are just sequences of these. To train our model, we take a dataset of protein sequences, and cover up some of the residues, and get the model to predict which amino acid was in that position based on the others around it. We can do this either by covering up the end of the sequence in the same way we would do in a standard language model, which is known as next-token prediction, or by covering up a residue in the middle of the sequence which is known as masked language modelling. The base model in the Cradle pipeline takes the latter approach, so we will focus on that. The problem essentially looks something like this: M K T A [?] G L S E R ... | L ██████ 0.42 V ███ 0.28 I █ 0.15 ...

The base model produces a distribution of probabilities over amino acids likely to fill the missing position. The first step of the Cradle pipeline is to take this model structure, and train it on a dataset of tens of millions of natural proteins. This is what is happening at the top of the diagram in the section labelled pre-training. The idea here is to produce a model that has learned something about the features of natural proteins. Such a model is actually useful in its own right because it gives us a way of understanding if certain edits to a protein are likely to ruin it, that is to say to cause it to become disordered or not expressed or whatever. The model gives us suggestions for residues we could place at a specific position to make a natural protein. Say we mask a G in a known protein and the model predicts that that position was 0.001% likely to be a W but 40% likely to be a V, we can conclude that the protein with the W is unlikely to be functional. The model is saying that it has never seen a protein that looks like that in nature, and we believe that since many millions of years of evolution has produced the set of proteins that exist, those that don’t are unlikely to be useful. We should note that this is not something we know for sure to be true.

§5 Human · 0%

It is possible that there may be proteins that are functional that are very different from natural ones; these are going to be very difficult to model with machine learning because, by definition, we have very little data about them. Now we have an understanding of what the base model does (it predicts how “natural” specific edits to individual amino acids are) we have covered the top three boxes. We can now move on to the next section, which has a lot more going on. Evotuning The space of natural proteins is very large and diverse. The process of lead optimisation is about improving the function of a protein for a specific use case. When using a model trained on the space of all proteins, the model is unaware of the specifics of the function we are trying to optimise for; it is too general. It is possible for the base model to make suggestions that are natural, but would be totally unhelpful for our use case. How can we push the model to make more relevant suggestions? The answer is to use fine-tuning. Fine-tuning is the process of adapting a more general model to perform well on a more specific task by training it on a representative set of data from that task. In lead optimisation, we start with a single template protein which already functions to some extent, and we want to improve this function further. We want to form a fine-tuning task to push the model to suggest proteins that are likely to be functional. We can do this by finding all the natural proteins that are likely to be evolutionarily related to the template, and training the model on them. The idea is that if they are evolutionarily related, they likely share some function. We want the model to “focus” on this area of the full protein space. We form the set of evolutionarily related proteins by using something called a Multiple Sequence Alignment (MSA). Understanding MSAs fully would be its own whole thing, but the rough idea has two parts: first, we search a huge database of protein sequences for ones that are statistically likely to share a common ancestor with our template (these are called homologs). Then we align them, i.e. line them up residue by residue so that positions playing the same structural or functional role sit on top of each other. This is complex because not all of the homologs will necessarily be the same length, so we need to account for insertions and deletions.