by: Xinyue Gao, Track « In Silico Drug Design », Strasbourg-Milan-Paris, 2022
Chemical databases have become indispensable resources empowering research across pharmaceutical, biotech, and materials science domains. By aggregating vast collections of compounds and associated data, these databases allow scientists to efficiently explore chemical space to identify new drug candidates, optimize materials properties, and understand fundamental molecular interactions.
Among available chemical repositories, ZINC20 stands out for its commitment to comprehensiveness, advanced search capabilities, and accessibility. As discovering new biologically active small molecules relies on screening diverse compound libraries, ZINC20’s extensive collection of over 1.4 billion compounds gives researchers an unprecedented starting point.
However, sheer size poses steep computational challenges. Traditional fingerprint-based similarity searches scale are scaling linearly with databases size, but database sizes are currently growing by orders of magnitude. Approximate feature-tree methods enable fast exploration of huge tangible chemical spaces by representing compounds, but to the cost of a maybe less expressive molecular representation.
ZINC20 balances these trade-offs via SmallWorld – an algorithm that indexes explicit molecular graphs for rapid similarity calculations. By precomputing synthetically accessible organic molecule graphs, SmallWorld can look up a query graph and quickly traverse the map to identify nearest neighbors in graph edit distance space. This retains full structure details while allowing sub-second searches across databases of >100 billion compounds.
Complementing SmallWorld, ZINC20 also incorporates Arthor – a custom toolkit for ultrafast substructure and pattern matching. Arthor represents molecules in a compact binary format optimized for regex-style queries. By distributing across a compute cluster, Arthor can search for complex molecular patterns in just seconds.
These innovations allow ZINC20 to make chemical space exploration a truly interactive experience. Researchers can quickly retrieve analogues in response to biological data, interactively explore structure-activity hypotheses, and easily purchase compounds for testing. Virtual screening workflows also become nimbler and more comprehensive.
However, users should be aware that ZINC20 focuses on commercially available content. So, while coverage spans billions of novel compounds, ZINC20 may lack some public or poorly documented molecules. Integrating additional databases like ChEMBL, DrugBank, and PubChem can help fill the gaps.
ZINC20 makes cross-database integration straightforward via a suite of flexible web APIs. Users can access compounds, substructures, similarity calculations & more programmatically. And the full database is downloadable to allow creating custom workflows locally.
ZINC20 also sets best practices with rigorous attention to data quality and standardization. Compounds are regularly updated and annotated with curated purchasability data to simplify acquisition. Structures and calculated physiochemical properties are checked against reference datasets. And community feedback helps continue improving ZINC20.
This communal spirit epitomizes the promise of open chemical databases – democratizing access to fuel more inclusive research. By providing robust tools freely to all scientists without restrictions, resources like ZINC20 empower investigations that would otherwise be infeasible. And support for virtual screening of huge make-on-demand catalogs allows pursuing risky but high-reward hypotheses relying on novel chemistry.
Ultimately, ZINC20’s technical innovations and commitment to accessibility usher in a new era for computational drug discovery. As datasets continue growing in the big data regime, advanced machine learning approaches are becoming essential. Resources like ZINC20 that lower barriers for exploring billions of compounds will only increase in strategic value. As computational power catches up with data volumes, comprehensive high-quality open databases is expected to enable a new wave of therapeutics to enhance health and longevity worldwide.
References:
Irwin, John J., Khanh G. Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R. Wong, Munkhzul Khurelbaatar, Yurii S. Moroz, John Mayfield, and Roger A. Sayle. “ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery.” Journal of Chemical Information and Modeling 60, no. 12 (December 2020): 6065–73. https://doi.org/10.1021/acs.jcim.0c00675.
Nicola, George, Tiqing Liu, and Michael K. Gilson. “Public Domain Databases for Medicinal Chemistry.” Journal of Medicinal Chemistry 55, no. 16 (August 23, 2012): 6987–7002. https://doi.org/10.1021/jm300501t.