Machine-Learning System Can Rapidly Predict the Way Two Proteins Will Bind
Antibodies are small proteins formed by the immune system with the capability of attaching to specific parts of a virus to offset it.
As experts continue to fight SARS-CoV-2, the virus that triggered COVID-19, one possible defense route is a synthetic antibody that binds with the spike proteins of the virus to stop the virus from penetrating a human cell.
To build an effective synthetic antibody, scientists have to understand precisely how that binding will take place. Proteins, with lumpy 3D structures comprising many folds, can adhere together in millions of combinations, so discovering the right protein complex among virtually countless contenders is very laborious.
To simplify the process, MIT scientists developed a machine-learning model that can directly predict the complex that will develop when two proteins stick together. Their method is between 80 and 500 times faster than modern software approaches, and repeatedly predicts protein structures that are closer to real structures that have been witnessed experimentally.
This method could help researchers better comprehend certain biological processes that involve protein interactions, such as DNA replication and repair; it could also accelerate the process of producing new medicines.
Deep learning is very good at capturing interactions between different proteins that are otherwise difficult for chemists or biologists to write experimentally. Some of these interactions are very complicated, and people haven’t found good ways to express them. This deep-learning model can learn these types of interactions from data.
Octavian-Eugen Ganea, Study Co-Lead and Postdoc, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
The study’s co-lead author is Xinyuan Huang, a graduate student at ETH Zurich. MIT co-authors include Regina Barzilay, the School of Engineering Distinguished Professor for AI and Health in CSAIL, and Tommi Jaakkola, the Thomas Siebel Professor of Electrical Engineering in CSAIL and a member of the Institute for Data, Systems, and Society. The study is to be showcased at the International Conference on Learning Representations.
Protein AttachmentThe model the scientists created, referred to as Equidock, concentrates on sturdy body docking — which happens when two proteins bind by translating or rotating in 3D space, but their shapes do not bend or squeeze.
The model assumes the 3D structures of two proteins and turns those structures into 3D graphs that can be sorted by the neural network. Proteins are produced from chains of amino acids, and each of those amino acids is denoted by a node in the graph.
The team combined geometric knowledge into the model, so it comprehends how objects can differ if they are translated or rotated in 3D space. Furthermore, the model has integrated mathematical knowledge that guarantees the proteins always bind in the same manner, regardless of where they exist in 3D space. This is the same way proteins dock in the human body.
With this information, the machine-learning system detects atoms of the two proteins that are most probable to interact and produce chemical reactions, called binding-pocket points. Then, it uses these points to position the two proteins together into a complex.
If we can understand from the proteins which individual parts are likely to be these binding pocket points, then that will capture all the information we need to place the two proteins together. Assuming we can find these two sets of points, then we can just find out how to rotate and translate the proteins so one set matches the other set.
Octavian-Eugen Ganea, Study Co-Lead and Postdoc, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
One of the greatest challenges of designing this model was surpassing the lack of training data. As there is very little experimental 3D data for proteins, it was particularly crucial to add geometric knowledge into Equidock, Ganea says. Without those geometric checks, the model might collect false correlations in the dataset.
Seconds vs. HoursOnce the model was qualified, the scientists compared it to four software approaches. Equidock could predict the final protein complex within one to five seconds. All the baselines required a lot longer - between 10 minutes and an hour or more.
In quality measures, which compute how narrowly the predicted protein complex matches the actual protein complex, Equidock was frequently compared with the baselines, but it occasionally failed them.
We are still lagging behind one of the baselines. Our method can still be improved, and it can still be useful. It could be used in a very large virtual screening where we want to understand how thousands of proteins can interact and form complexes. Our method could be used to generate an initial set of candidates very fast, and then these could be fine-tuned with some of the more accurate, but slower, traditional methods.
Octavian-Eugen Ganea, Study Co-Lead and Postdoc, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
Besides using this technique with traditional models, the researchers want to include specific atomic interactions into Equidock so it can deliver more exact predictions. For instance, occasionally atoms in proteins will bind through hydrophobic interactions, which include water molecules.
Their method could also be used for the development of tiny, drug-like molecules, Ganea explains. These molecules attach with protein surfaces in particular ways, so quickly establishing how that attachment takes place could cut the timeline of drug development.
Going forward, they plan to improve Equidock so it can prepare predictions for versatile protein docking. The biggest difficulty is insufficient data for training, so Ganea and his colleagues are aiming to produce synthetic data they could use to enhance the model.
This study was funded, partly, by the DTRA Discovery of Medical Countermeasures Against New and Emerging (DOMANE) threats program, the Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, the Abdul Latif Jameel Clinic for Machine Learning in Health, the Swiss National Science Foundation, and the DARPA Accelerated Molecular Discovery program.
Journal Reference:Octavian-Eugen, G., et al. (2021) Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking. Advanced Materials.
Source: https://mit.edu