Erasmus Mundus Joint Master - ChEMoinformatics+ : Deep Learning with SMILES

by: Abdulfatai Lawal Track « Chemoinformatics and Physical Chemistry », Milan-Strasbourg, 2022

Simplified molecular input line entry system (SMILES), which was proposed by Weininger [1], is currently widely recognized and used as a standard representation of compounds for modern chemical information processing. It is relatively compact, readable and editable both by computers and human beings, which makes it especially suitable for being created and/or curated by advanced computer programs.

The SMILES proposed by Weininger has some drawbacks such as it being proprietary and is based on the Valence Bond Theory (hence inherits the imperfections of the theory). Despite the drawbacks mentioned above, there have been modifications in SMILES and algorithms used in deep learning applications, some of which Includes; SMILES Pair Encoding, Multiple SMILES based Augmentation, DeepSMILES, OpenSMILES, CurlySMILES, SELF-referencing embedded string (SELFIES) etc.

In comparison with other methods of machine learning, Deep Learning has a much more flexible architecture so it is possible to create a Neural Network architecture tailor-made for a specific problem [2], this has led to superior performance in areas such as image and voice recognition, natural language processing among others. In the field of chemoinformatics, it has shown remarkable success in bioactivity prediction, de novo molecular design, predicting reactions and retrosynthetic analysis.

The success of deep learning techniques in natural language processing (NLP) makes use of text-based molecular representations and perhaps due to this, Simplified molecular input line entry system (SMILES)-based deep learning models are emerging as an important research topic in cheminformatics with applications already in virtual screening of chemical compounds and identification of functional substructures [3] (Figure 1).

Figure 1. Strategy for applying one-dimensional CNN to SMILES linear representations of chemical compounds and the extraction of learned filters to discover the chemical motifs. [3]
Hirohara, M., Saito, Y., Koda, Y. et al. BMC Bioinformatics 19 (Suppl 19), 526 (2018). https://doi.org/10.1186/s12859-018-2523-5

References:
1. D. Weininger, J. Chem. Inf. Model., 1988, 28, 1 31–36 DOI: https://doi.org/10.1021/ci00057a005

2. Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. and Blaschke, T. Drug Discovery Today, 2018, 23, 6, 1241-1250, DOI: https://doi.org/10.1016/j.drudis.2018.01.039.

3. Hirohara, M., Saito, Y., Koda, Y. et al. BMC Bioinformatics 19 (Suppl 19), 526 (2018). https://doi.org/10.1186/s12859-018-2523-5.