Erasmus Mundus Joint Master - ChEMoinformatics+ : ProtGPT2: Designing Novel Proteins with Deep Learning

by: Thalita Cirino do Nascimento Track « Chemoinformatics and Physical Chemistry », Milan-Strasbourg, 2022

My interest in chemistry began in high school, specifically in understanding the origin and evolution of life. While reading about related research, I came across an interview with Noelia Ferruz from the Institute of Molecular Biology of Barcelona (IBMB). What intrigued me about Noelia’s work was her innovative approach inspired by how Nature has evolutionary ‘designed’ a variety of proteins with different functions and topologies.

Indeed, peptides and protein structures have evolved through mutations and recombination that accumulated during over more than 3 billion years. This allows nature to explore a huge protein sequence space to find new biological functions. She combined this understanding of molecular evolution with natural language processing (NLP) systems, such as GPT-3, which can generate human-like text after ‘reading’ millions of web pages and books. Since a protein is represented by a sequence of letters, a model similarly able to be trained on a massive database of over 50 million natural protein sequences was developed: the ProtGPT2 [1]. It implicitely integrates patterns and rules about how amino acids are strung together and, unlike previously designed de novo structures, ProtGPT2’s proteins resemble the complexity of natural proteins with folding patterns and longer loops necessary for interacting with other molecules and functionalization (Figure 1). However, database searches revealed that the artificially generated and natural proteins are distantly related, more as a third-degree cousin than as a sibling. This suggests that ProtGPT2 is not simply copying existing proteins but combines amino acid building blocks in new ways.

Therefore, ProtGPT2 shows potential as a generative model capable of rapidly exploring new areas of the protein sequence space. Through numerous computational predictions, Noelia Ferruz team provides encouraging evidences that a large proportion of these sequences may fold into stable and functional structures resembling those found in Nature. While beyond the scope of the 2022 study, such experimental confirmation is needed to draw some conclusions about the folding and activities of ProtGPT2’s generated proteins.

Figure 1. An overview of the protein sequences space. Each node represents a sequence. Two nodes are linked when they are sufficiently homologous. Colors depict the different structural domains and examples of AlphaFold predicted structures of protGPT2 generated sequences are given with their respective number: all β structures (751), α/β (4266 and 1068), membrane protein (4307), α+β (486), and all-α (785). ProtGPT2 generated sequences are represented by white nodes. The PDB code of the most homologous natural structure is given, with the corresponding identity percentage. Also, the AlphaFold confidence score (pLDDT) is provided.

1. Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022). https://doi.org/10.1038/s41467-022-32007-7