Blog: ProtGPT2 : Designing Novel Proteins with Deep Learning

  1. Accueil
  2. > Blog
  3. > ProtGPT2 : Designing Novel Proteins with Deep Learning

by : Thalita Cirino do Nascimento Track « Chemoinformatics and Physical Chemistry », Milan-Strasbourg, 2022

My interest in chemistry began in high school, specifically in understanding the origin and evolution of life. While reading about related research, I came across an interview with Noelia Ferruz from the Institute of Molecular Biology of Barcelona (IBMB). What intrigued me about Noelia’s work was her innovative approach inspired by how Nature has evolutionary ‘designed’ a variety of proteins with different functions and topologies.

Indeed, peptides and protein structures have evolved through mutations and recombination that accumulated during over more than 3 billion years. This allows nature to explore a huge protein sequence space to find new biological functions. She combined this understanding of molecular evolution with natural language processing (NLP) systems, such as GPT-3, which can generate human-like text after ‘reading’ millions of web pages and books. Since a protein is represented by a sequence of letters, a model similarly able to be trained on a massive database of over 50 million natural protein sequences was developed : the ProtGPT2 [1]. It implicitely integrates patterns and rules about how amino acids are strung together and, unlike previously designed de novo structures, ProtGPT2’s proteins resemble the complexity of natural proteins with folding patterns and longer loops necessary for interacting with other molecules and functionalization (Figure 1). However, database searches revealed that the artificially generated and natural proteins are distantly related, more as a third-degree cousin than as a sibling. This suggests that ProtGPT2 is not simply copying existing proteins but combines amino acid building blocks in new ways.

Therefore, ProtGPT2 shows potential as a generative model capable of rapidly exploring new areas of the protein sequence space. Through numerous computational predictions, Noelia Ferruz team provides encouraging evidences that a large proportion of these sequences may fold into stable and functional structures resembling those found in Nature. While beyond the scope of the 2022 study, such experimental confirmation is needed to draw some conclusions about the folding and activities of ProtGPT2’s generated proteins.

Figure 1. Un aperçu de l’espace des séquences de protéines. Chaque nœud représente une séquence. Deux nœuds sont liés lorsqu’ils sont suffisamment homologues. Les couleurs représentent les différents domaines structurels et des exemples de structures prédites par AlphaFold des séquences générées par ProtGPT2 sont donnés avec leur numéro respectif : structures tout-β (751), α/β (4266 et 1068), protéine membranaire (4307), α+β (486), et structures tout-α (785). Les séquences générées par ProtGPT2 sont représentées par des nœuds blancs. Le code PDB de la structure naturelle la plus homologue est indiqué, avec le pourcentage d’identité correspondant. Le score de confiance AlphaFold (pLDDT) est également indiqué.

1. Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun 13, 4348 (2022). https://doi.org/10.1038/s41467-022-32007-7