idw – Informationsdienst Wissenschaft

Nachrichten, Termine, Experten

Grafik: idw-Logo
Science Video Project
idw-Abo

idw-News App:

AppStore

Google Play Store



Instanz:
Teilen: 
15.10.2025 13:04

Chemical language models don't need to understand chemistry

Johannes Seiler Dezernat 8 - Hochschulkommunikation
Rheinische Friedrich-Wilhelms-Universität Bonn

    Language models are now also being used in the natural sciences. In chemistry, they are employed, for instance, to predict new biologically active compounds. Chemical language models (CLMs) must be extensively trained. However, they do not necessarily acquire knowledge of biochemical relationships during training. Instead, they draw conclusions based on similarities and statistical correlations, as a recent study by the University of Bonn demonstrates. The results have now been published in the journal Patterns.

    Large language models are often astonishingly good at what they do, whether that's proving mathematical theorems, composing music, or drafting advertising slogans. But how do they arrive at their results? Do they actually understand what constitutes a symphony or a good joke? It is not so easy to answer that question. „All language models are a black box,“ emphasises Prof. Dr Jürgen Bajorath. „It's difficult to look inside their heads, metaphorically speaking.“

    Nevertheless, Jürgen Bajorath, a cheminformatics scientist at the Lamarr Institute for Machine Learning and Artificial Intelligence at the University of Bonn, has attempted to do just that. Specifically, he and his team have focused on a special form of AI algorithm: transformer CLM. This model works in a similar way to ChatGPT, Google Gemini and Elon Musk's 'Grok' that are trained using vast quantities of text, enabling them to generate sentences independently. CLMs, on the other hand, are usually based on significantly less data. They acquire their knowledge from molecular representations and relationships, e.g. the so-called SMILES strings. These are character strings that represent molecules and their structure as a sequence of letters and symbols.

    Systematic manipulation of training data

    In pharmaceutical research, scientists often attempt to identify substances that can inhibit certain enzymes or block receptors. CLMs can be used to predict active molecules based on the amino acid sequences of target proteins. “We used sequence-based molecular design as a test system to better understand how transformers arrive at their predictions,” explains Jannik Roth, a doctoral student working with Bajorath. “After the training phase, if you introduce a new enzyme to such a model, it may produce a compound that can inhibit it. But does that mean that the AI has learned the biochemical principles behind such inhibition?“

    CLMs are trained using pairs of amino acid sequences of target proteins and their respective known active compounds. In order to address their research question, the scientists systematically manipulated the training data. “For example, we initially only fed the model specific families of enzymes and their inhibitors,” explains Bajorath. „When we then used a new enzyme from the same family for testing purposes, the algorithm actually suggested a plausible inhibitor.“ However, the situation was different when the researchers used an enzyme from a different family in the test, i.e. one that performs a different function in the body. In this case, the CLM failed to correctly predict active compounds.

    Statistical rule of thumb

    “This suggests that the model has not learned generally applicable chemical principles, i.e. how enzyme inhibition usually works chemically,” says the scientist. Instead, the suggestions are based solely on statistical correlations, i.e. patterns in the data. For example, if the new enzyme resembles a training sequence, a similar inhibitor will probably be active. In other words, similar enzymes tend to interact with similar compounds. „Such a rule of thumb based on statistically detectable similarity is not necessarily a bad thing,’ emphasises Bajorath, who leads the area „AI in Life Sciences and Health“ at the Lamarr Institute. „After all, it can also help to identify new applications for existing active substances.“

    However, the models used in the study lacked biochemical knowledge when estimating similarities. They considered enzymes (or receptors and other proteins) to be similar if they matched 50–60 percent of their amino acid sequence, and accordingly suggested similar inhibitors. The researchers could randomize and scramble the sequences at will, as long as sufficient original amino acids were retained. However, often only very specific parts of an enzyme are necessary for it to perform its task. A single amino acid change in such a region can render an enzyme dysfunctional. Other areas are more important for structural integrity and less relevant for specific functions. “During their training, the models did not learn to distinguish between functionally important and unimportant sequence parts,” emphasises Bajorath.

    Models simply repeat what they have read before

    The results of the study therefore show that the transformer CLMs trained for sequence-based compound design lack any deeper chemical understanding, at least for this test system. In other words, they merely recapitulate, with minor variations, what they already have picked up in a similar context at some point. “This does not mean that they are unsuitable for drug research,” emphasises Bajorath, who is also a member of the Transdisciplinary Research Area (TRA) “Modelling” at the University of Bonn. „It is quite possible that they suggest drugs that actually block certain receptors or inhibit enzymes.“ However, this is certainly not because they understand chemistry so well, but because they recognise similarities in text-based molecular representations and statistical correlations that remain hidden from us. This does not discredit their results. However, they should not be overinterpreted either.'

    Participating institutions and funding

    The work was financially supported by the German Academic Scholarship Foundation.


    Wissenschaftliche Ansprechpartner:

    Prof. Dr. Jürgen Bajorath
    Life Science Informatics
    University of Bonn
    Phone: +49 (0)228/73-69100
    Email: bajorath@bit.uni-bonn.de


    Originalpublikation:

    Jannik P. Roth, Jürgen Bajorath: Unraveling learning characteristics of transformer models for molecular design, Patterns, https://doi.org/10.1016/j.patter.2025.101392, URL: https://www.cell.com/patterns/fulltext/S2666-3899(25)00240-5


    Weitere Informationen:

    https://lamarr-institute.org/de/


    Bilder

    Prof. Dr. Jürgen Bajorath and doctoral student Jannik P. Roth from Life Science Informatics at the University of Bonn.
    Prof. Dr. Jürgen Bajorath and doctoral student Jannik P. Roth from Life Science Informatics at the U ...

    Copyright: Photo: Gregor Hübl/University of Bonn

    Schematic representation of a transformer model for predicting new compounds from protein sequence data.
    Schematic representation of a transformer model for predicting new compounds from protein sequence d ...

    Copyright: Grafics: J. P. Roth and J. Bajorath


    Merkmale dieser Pressemitteilung:
    Journalisten, jedermann
    Chemie, Informationstechnik, Medizin
    überregional
    Forschungsergebnisse, Wissenschaftliche Publikationen
    Englisch


     

    Prof. Dr. Jürgen Bajorath and doctoral student Jannik P. Roth from Life Science Informatics at the University of Bonn.


    Zum Download

    x

    Schematic representation of a transformer model for predicting new compounds from protein sequence data.


    Zum Download

    x

    Hilfe

    Die Suche / Erweiterte Suche im idw-Archiv
    Verknüpfungen

    Sie können Suchbegriffe mit und, oder und / oder nicht verknüpfen, z. B. Philo nicht logie.

    Klammern

    Verknüpfungen können Sie mit Klammern voneinander trennen, z. B. (Philo nicht logie) oder (Psycho und logie).

    Wortgruppen

    Zusammenhängende Worte werden als Wortgruppe gesucht, wenn Sie sie in Anführungsstriche setzen, z. B. „Bundesrepublik Deutschland“.

    Auswahlkriterien

    Die Erweiterte Suche können Sie auch nutzen, ohne Suchbegriffe einzugeben. Sie orientiert sich dann an den Kriterien, die Sie ausgewählt haben (z. B. nach dem Land oder dem Sachgebiet).

    Haben Sie in einer Kategorie kein Kriterium ausgewählt, wird die gesamte Kategorie durchsucht (z.B. alle Sachgebiete oder alle Länder).