Protein design with RL from AI feedback
Overview
AlphaFold2 allows us to predict protein structure from sequence. Additionally, inverse models have also been developed, which predict the sequences of naturally occurring proteins from their structure. However, both of these models are limited in many protein engineering contexts, where we have some particular structural feature that we want to generate in an engineered protein.
A crucial problem here is that we don’t know exactly what protein structures are possible (i.e. for which structures there exists some sequence), and our structure-to-sequence models do not yet have the capability to find similar structures, for which sequences exist, to a target structure. As Po-Ssu Huang says, “creating novel backbones for which there exist a foldable sequence remains one of the greatest challenges in protein engineering”. Additionally, aside from this target design feature, the rest of the protein structure is often irrelevant, and thus we would like a way to search for proteins based on a fuzzy structural search. One approach to this has been searching over a latent design space (e.g. from a VAE) to find proteins with the desired structural feature.
Here, I propose the approach of using an RL protein design agent along with AlphaFold2 to assemble proteins conditioned on loose structural constraints. In this process, the RL agent generates an amino acid sequence incrementally, receiving feedback on how well the structure (from AlphaFold2) of the current prototypical sequence meets the design criteria after each incremental generation step.
Experiment 1
I first trained the agent to apply a minimal number of edits to a starting sequence to maximize the AlphaFold pLDDT score, which quantifies how well-defined the protein’s structure is. I only looked at short sequences of 20 residues. I used a 2-layer transformer, and joinly embedded the initial and starting sequence on a per-residue basis. After training for 16h on 8 4090s, the model increases the aggregate pLDDT score by 25% within an edit distance of 4.
Gallery
As seen in the gallery, the model’s strategy heavily favors adding Leucine (L) in particular locations.
Observations and musings on online learning
Possibly the most interesting part of this experiment is how much online learning was happening during training. I see a lot of talk these days about how deep neural networks can’t do online learning, for example the paper Loss of plasticity in deep continual learning from Rich Sutton’s group, which finds that just adding L2 weight decay vastly improves online learning (and while this result is super cool, I find it ironic given the title and the general message I gathered that the paper was trying to convey: deep learning is doomed). Indeed, I used L2 weight decay, and I saw that training the agent with SGD on experiences from an episode led to much more consistent pLDDT scores later in the episode than if there was no SGD and the agent was purely in inference mode. My somewhat hot take is that the most vanilla form of deep learning can’t do online learning, but that simple modifications exist (e.g. L2 weight decay, injecting noise) that can essentially fill this gap.
More reflection on this approach
One can think of this as either (1) a sequence generator as a model-free agent in a simulated environment that is AlphaFold2 or (2) a model-based RL agent where the policy is produced by the sequence generator and the model is AlphaFold2, and the objective is to produce some output sequence that meets the design criteria when it is actually synthesized in a lab. When thinking of this as (2), one can see a sort of capability bootstrapping, where we can increase intelligence related to proteins without having to collect any more measurements from the real world. This is along the same line of thinking as Yoshua Bengio’s idea of using a world model to develop a very large amortized inference machine.