Article
Open access
Published: 12 June 2024

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Communications Chemistry volume 7, Article number: 134 (2024) Cite this article

2165 Accesses
5 Altmetric
Metrics details

Subjects

Abstract

Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the ‘no-free lunch’ theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a ‘goldilocks zone’ for each model type, in which dataset size and feature distribution (i.e. dataset “diversity”) determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.

CancerGPT for few shot drug pair synergy prediction using large pretrained language models

Article Open access 19 February 2024

Leveraging large language models for predictive chemistry

Article Open access 06 February 2024

Augmenting large language models with chemistry tools

Article Open access 08 May 2024

Introduction

The past two decades have seen the meteoric rise of machine learning (ML) as a powerful tool in the field of early drug discovery¹. Harnessing the ability of ML algorithms to generate predictive models has enabled researchers to virtually screen novel molecules for target-based activity, as well as for predicted absorption, distribution, metabolism, excretion, and toxicology (ADME/Tox) properties that would potentially preclude candidate molecules from use as potential therapeutics². This in silico screening strategy is aimed at decreasing the rounds of in vitro testing required in preclinical stages and improving the overall cost-effectiveness of preclinical drug discovery^1,2,3,4. Some of the most successful ML efforts in drug discovery use quantitative structure-activity relationship (QSAR), quantitative structure-property relationship (QSPR) as well as classification models or other ligand-based strategies to this end^5,6. Traditionally, such ligand-based modeling involves the training of ML algorithms such as Random Forest (RF), Support Vector Regression (SVR)^{7,8,9,10,11,12,13}, Naïve Bayes (NB)^{14,15,16,17,18}, K nearest neighbor (KNN)¹⁹, and Deep Neural Nets (DNN)^{20,21,22,23,24,25,26,27,28} on 2D structural fingerprints (ECFP6, MACCS²⁹), physiochemical descriptors (RDKit, Mordred³⁰ and many others), or some combination thereof³¹.

These traditional ML approaches often require a significant amount of data before their predictive ability reaches significance, reducing their application to targets or datasets with a substantial number of molecules. Recent prediction methods have utilized Transfer Learning and Multi-task output to take advantage of larger dataset sizes that may exist for biologically related targets, an effort that has been bolstered by the introduction and expansion of state-of-the-art large-language models (LLMs) like the one used by the popular chatbot, ChatGPT³². These new modeling architectures are likely rapidly overtaking traditional approaches for performing a variety of cheminformatics analyses. Recurrent neural networks (RNN), and long short-term memory (LSTM) networks^33,34 have been found to be very useful in a variety of prediction and optimization tasks³⁵. More recently, simplified molecular line entry system (SMILES)³⁶ strings have been used as input for Sequence-To-Sequence (Seq2Seq) and Transformer models³⁷. SMILES represent a natural format for Seq2Seq modeling because the linear encoding of 2D structural information is a good analog for the “word and sentence” structure of Seq2Seq models³⁸. However, the performance of Transformer-based architectures in drug discovery has been limited due to the average size of regression and classification datasets; these model architectures require large training sets, encompassing millions of compounds, before state-of-the-art performance emerges³⁸. Available structure-activity relationship (SAR) datasets for drug targets, however, often number in the tens of compounds, rendering them unusable for Transformer or even classical ML algorithms. One strategy to navigate this problem is to pre-train Transformer models on large datasets, and then fine-tune them for a single target or endpoint of interest^39,40. Another approach is to use a modeling technique specifically developed for small datasets, such as few-shot⁴¹ or zero-shot learning⁴².

As the size of regression and classification datasets can also vary greatly from target to target, and transfer or meta-learning allows for the use of smaller datasets, it can be tempting to apply these newer modeling techniques to all targets in drug discovery. However, few, if any, direct comparisons have been made between classical ML algorithms and these newer models across a wide spectrum of dataset sizes. Here, we evaluate three methods of ligand-based ML modeling at multiple scales of data, including small, medium, and large dataset sizes of different chemical diversities, to find a model selection heuristic for drug discovery. Unsurprisingly, we show that few-shot-learning classification (FSLC) models outperform both transformer (MolBART) and classical ML algorithm support vector classification (SVC) models when trained on small datasets ( < 50 compounds), and SVC models had more predictive power than either the MolBART or FSLC models when the training set exceeded 240 compounds. However, in the “medium” dataset range between 50 and 240 compounds, the advantage of MolBART or SVC modeling becomes dependent on the composition of the dataset, rather than the size. Increasing molecular diversity, quantified by increasing unique Murcko scaffolds in the dataset, favors MolBART modeling over SVC in this middle ground. Because of the “just right” nature of these observations, which consider both size and structural diversity for optimum modeling, we have termed our heuristic the “Goldilocks learning paradigm.” and developed a predictive model to aid in the selection of the modeling method based on inputs of dataset size and diversity. We then tested this paradigm further by modeling five kinases, with vastly different dataset sizes and complexity, that are implicated in the pathology of Alzheimer’s disease (AD). We ultimately show which of the ML approaches performs the best, and in the process, we identify some new inhibitors for MARK1.

Results

Transformer models outperform traditional modeling

Our previous use of ML to model enzyme inhibition has relied upon two-dimensional molecular descriptors (e.g., ECFP6) to generate molecular fingerprints, and a suite of traditional ML algorithms to create predominantly classification models^43,44,45. More recently we have applied these ML models for regression models for predicting compound activity^46,47. This approach is completely ligand-based, with the descriptors representing the presence or absence of substructures within the molecule. Each algorithm generates its own model for activity, for which a nested 5-fold cross-validation strategy is performed for hyperparameter optimization and to internally validate the models. The models are then either used to predict the activity of new compounds. To investigate the impact that dataset size has on traditional modeling, we trained SVR models on 2401 datasets of various sizes using a nested 5x cross-validation strategy (see: Large-Language Model Dataset Curation).

To determine whether large-language transformer models can outperform traditional ML methods, we fine-tuned a pre-trained large-language model called MolBART on the 2401 individual-target datasets from ChEMBL and explored the predictive power of the model in comparison to SVR (Supplementary Data 1). While the comparison to SVR was performed in the original publication, less than 100 datasets were investigated³⁸. We first determined whether MolBART would “ignore” small datasets in favor of large datasets, as presumably, MolBART could achieve overall low error rates if it focused on large datasets while ignoring smaller ones. We found that MolBART test prediction statistics were relatively insensitive to the number of endpoints in the target dataset with a correlation coefficient of 0.068 (Fig. 1). This suggests the utility of transfer learning as both large and small datasets can be used for training the model without the unbalanced dataset “overwhelming” the predictive power of smaller dataset endpoints. We next investigated whether the training set size impacted the predictive power of the fine-tuned MolBART model vs. the individual SVR models. The standard cross-validated SVR models have increased R² as the number of endpoints for a particular target increases. In contrast, the MolBART R² is independent of the number of target endpoints, suggesting the model is taking advantage of transfer learning for low-number datasets (Fig. 2A). Increasing the diversity generally results in a decrease in R² for the SVR models but not for MolBART, suggesting the latter can handle more diverse datasets better (Fig. 2B). Comparing two different endpoints for the SVR and MolBART models (Fig. 3) shows the accuracy of predictions with a small dataset size (e.g. opioid receptors, training set size = 80 endpoints, test set size = 20 endpoints) compared to a larger dataset size (Nicotinamide phosphoribosyltransferase, training set size = 2249 endpoints, test set size = 562 endpoints).

**Fig. 1: Correlation plots between MolBART R², SVR R², the molecular diversity of the training set, and the number of molecules in the training dataset.**

**Fig. 2: Comparison of MolBART and SVR model R² with dataset size and molecule diversity.**

**Fig. 3: Example true vs. predicted -log(M) values for a small and large dataset for MolBART and SVR.**

The top 10 datasets with the largest differences between MolBART R² and SVR R² all have <100 training datapoints, suggesting that datasets with the smallest number of endpoints gain the most predictive benefit from transfer learning acquired using pre-trained transformers (Table S1). To identify other factors that might impact the accuracy of prediction, we investigated the structural diversity of each of the training datasets for each of the individual targets. We chose to examine the distribution of molecules within each dataset and the number of unique Murcko scaffolds present⁴⁸. Datasets with a wider array of distinct scaffolds cover more chemical property space and therefore may lead to more generalizable ML models. An established method for identifying the structural diversity of a dataset is to use a Cumulative Scaffold frequency Plot (CSFP)⁴⁹ which compares the percentage of molecules within a dataset that share the same scaffold (Fig. 4 shows how the graph changes as diversity decreases). This is usually plotted as a fraction of molecules, sorted by frequency from the most frequent scaffold to the least frequent scaffold vs. the percentage of unique scaffolds. The scaffolds for each molecule in a dataset were determined using the RDKit Cheminformatics package, with the MurckoScaffoldSmilesFromSmiles function in the rdkit.Chem.Scaffolds.MurckoScaffold module. These plots are similar in structure to ROC curves in that each axis goes from 0 to 100%. A perfectly diverse dataset with all unique scaffolds would be a straight diagonal line, while a dataset comprised of only one scaffold would encapsulate the entire area of the plot. We defined a diversity metric based on the area under the CSFP curve (AUC). We defined our diversity metric as

$${{{{{\rm{div}}}}}}=2(1-{{{{{\rm{AUC}}}}}})$$

**Fig. 4: Sample CSFP plot curves of a non-diverse dataset (MAP Kinase MNK1) and a diverse dataset (Protein Kinase C).**

Using this metric a perfect diversity score would have a div = 1. If all of the molecules have the same scaffold the dataset would not be diverse at all and so diversity = 0. Figure 1 shows the predictive ability of MolBART independent of target diversity with a correlation of 0.13. However, there is a strong negative correlation between target diversity and the predictive ability of SVR models (Fig. 2B). As the diversity of a target dataset increases there is more structural information within it that needs to be incorporated into the SVR model, making accurate predictions less likely.

Exploring the correlation between R², dataset diversity, and the number of molecules in a dataset reveals a “sweet spot” in which our pre-trained transformer excels, as well as where traditional ML modeling with ECFP6 tends to outperform the fine-tuned MolBART model. Essentially, when diversity is high and the number of molecules in the training datasets is low (~ <240 datapoints), MolBART however excels at predictive power over SVR. However, as the size of the dataset increases, traditional ML modeling (exemplified by SVR in this case) outperforms MolBART (Fig. 5).

**Fig. 5: Correlation plots between MolBART R²–SVR R², molecular diversity per training set (diversity), and the number of molecules per training set.**

Few-shot-learning classifiers outcompete large-language models and classical machine learning with extremely low data

While LLMs are performative with high diversity and low dataset numbers, oftentimes predictive modeling is required at even smaller dataset size extremes. Until recently modeling for targets with little data (≤10 known actives and ≤20 total datapoints, herein referred to as micro-data) has been unusable due to limited information in these small datasets. Recently, few-shot and zero-shot learning models have shown state-of-the-art performance in text generation, image classification⁵⁰, and ML model classification predictions for micro-data, paving the way for ML to apply in data-poor situations (for example, these approaches could be used with PROTACS datasets for which there is currently very limited ADME data⁵¹, or for dark kinases also with limited data⁵²). However, the decision point on when to use few-shot-learning models vs. traditional ML modeling or LLMs for modeling training data based on the number of datapoints has not been extensively investigated. Given MolBART’s apparent advantages over the classical learning model at low dataset numbers, we decided to implement and test a prototypical network few-shot-learning classifier (FSLC) to benchmark it (Fig. S1; see few-shot-learning model architecture and training). As FSLC are classification models, we built SVM classification models and fine-tuned the molBART pre-trained model for classification to better compare the model types.

Few-shot-learning models benefit from training on similar tasks to what they will predict, and thus we extracted a subset of kinase-specific target datasets (371 target datasets) from the original 2401 datasets. From these, we eliminated kinase datasets that had less than 20 active and 20 inactive compounds and were left with a total of 95 kinase datasets for training. 64 kinases were used as the training set, with 14 kinases held out for validation and 14 held out for testing. The FSLC is first trained on the 64 separate kinases. To simulate a micro-dataset, 2–20 datapoints are sampled from each training kinase dataset as examples (the support set), and the model uses these datapoints as references for predictions (the test set). Varying sample numbers were drawn to determine the effect of class imbalance and dataset size on the FSLC (Table 1). Once trained, the pre-trained FSLC can predict new molecules similarly for the test set kinases (see methods for training and testing details). Traditional ML models (e.g. SVM) were trained using the same datapoints and dataset size.

Table 1 Support set (“training” set) and query set (test set) compositions for different shots during training of FSLC

Full size table

The pre-trained FSLC could predict correctly even under extremely small support sets of a presented 1 active, 1 inactive dataset (Table 2). In comparison, SVR failed to learn under micro-datasets until at least 5 actives and 5 inactives were presented to the model (Table 3). As the dataset size approached 20 (10 actives and 10 inactives), the classical ML model and the FSCL rapidly converged, suggesting that FSLC are powerful below the 20 datapoint mark, but lose comparative power once the datasets are large enough.

Table 2 Summary of the classification metrics for FSLC trained with graph descriptors for 14 validation tasks

Full size table

Table 3 Summary of the classification metrics for the SVC models trained with ECFP6 descriptors for 14 validation tasks

Full size table

Our results suggest that the different learning modalities have strengths and weaknesses at different levels of data information. To determine if this relationship is computationally predictable, we trained an ML model to predict which approach (MolBART, SVM, or FSLC) is likely to have the highest predictive power using Fast Interpretable Greedy-Tree Sums (FIGS)⁵³. FIGS is a generalized classification and regression tree (CART) model, which creates highly interpretable decision trees from decision leaves in the model. We split the 95 kinase datasets (described above) into a training and test set. We set the hyperparameter number of trees = 3 and used only dataset size and molecular diversity as the input, and the output was a multi-class decision (0,1,2 for molBART, SVM, and FSLC respectively). Using the holdout test set of datasets that were not seen during training, the FIGS classifier was able to predict the correct winning ML model type (Table 4, ROC 0.74). The decision tree produced by the FIGS classifier (Fig. S2) gives a heuristic decision tree, suggesting that relative model performance can be predicted based on dataset size and diversity alone.

Table 4 FIGS classifier statistics when predicting which machine-learning model (MolBART, SVM, FSLC) will perform best when using dataset size and dataset diversity as input

Full size table

From these results, we propose a simple heuristic for model selection based on the dataset size: For datasets <50 molecules, FSLCs dominate. From 50-240 molecules and particularly for diverse datasets, LLMs such as MolBART outperform other modeling types. Finally, as dataset size increases>240 molecules, traditional ML algorithms such as SVR models with ECFP-feature inputs are recommended.

The Goldilocks zone for model selection: discovery of novel kinase inhibitors for Alzheimer’s disease

To compare these ML model types and show the application of where each of them excels, respectively, we built and applied ML models to discover new kinase inhibitors for AD. The mechanisms underlying the AD pathophysiology are still unclear. Aggregation of tau and amyloid beta (Aβ) proteins as well as decreased acetylcholine are the focus of many studies⁵⁴. Neurofibrillary tangles (NFTs) are one of two hallmark plaques in AD^55,56,57 and are comprised of a hyperphosphorylated version of the microtubule-stabilizing protein tau (Ptau)⁵⁸. Ptau dissociates from the microtubule, migrating away from the axon, and forming insoluble paired helical filaments in the cytoplasm of the soma. This leads to destabilization and loss of the cytoskeletal microtubule, part of a cascade of events that leads to neuronal death⁵⁸. Three major classes of kinases can phosphorylate tau⁵⁹: proline-directed kinases like glycogen synthase kinase 3 beta (GSK-3β) or cyclin-dependent kinase 5 (CDK5)^60,61, non-proline-directed kinases such as tau-tubulin kinases (TTBK)⁶² or microtubule affinity regulated kinases (MARK)⁶³, and tyrosine kinases such as Fyn⁶⁴ or Abl kinases⁶⁵. Using IC₅₀ data from the ChEMBL database and ECFP6 Fingerprints with 1024 bits, we built ML classification and regression models for GSK3β, ABL1, FYN, CDK5, and MARK1. The classification models for FYN, CDK5, and MARK1 were built with activity thresholds of 1 µM, meaning compounds with IC₅₀ values lower than 1 µM are classified as active (Table 5 and Table S2). The kinase inhibitor datasets spanned several orders of magnitude in size, from 18-2969 datapoints and differing molecular diversity, making them an excellent dataset to investigate the Goldilocks zone model selection process. The datasets for both GSK3β and ABL1 contained sufficient entries of low-nanomolar inhibitors that we could build good classification models (according to the cross-validation statistics generated) based on the lower activity threshold of 100 nM (Table 5 and Table S2). The MARK1 dataset was by far the smallest dataset and therefore has the least predictive power for classical ML algorithms, which is reflected in the internal cross-validation scores for the classification models (Table S2). The fine-tuned MolBART model was further trained on each of these datasets, following the same training procedure as performed in the original fine-tuned training of the model.

Table 5 Training and test sets for classification models from publicly available databases

Full size table

The differences in the predictive power of the three model types, track with what we have seen from our own internal inhibition testing and model predictions for these four kinases and from the earlier comparative analysis (Table S3) with the FYN and CDK5 models performing less well. MARK1 presents a unique challenge, as it has sparse data and no external test set (Table 5). Because we did not have an external test set for our MARK1 model, and no measure of confidence for our classical ML model predictions, we performed a high-throughput screen of FDA-Approved compounds for MARK1 activity to create a prospective test set (Fig. 6A, B). We virtually screened the FDA-approved library with our classical models, fine-tuned MolBART model, and our pre-trained Kinase FSLC model to determine model performance with the sparse MARK1 dataset. The MedChemExpress FDA-Approved and Pharmacopeial Drug Library (HY-L066) was experimentally screened for MARK1 inhibition using the Promega ADP-Glo Kinase Assay. We selected a subset of hits from our high-throughput screen, along with a few compounds from our internal library, to be tested for IC₅₀ value determination (Fig. 6C), and then used those results as a test set for our MARK1 models.

**Fig. 6: Developing a Test Set for MARK1 Inhibition.**

Out of the 13 tested compounds, 5 (baricitinib, AT9283, ON123300, upadacitinib, and tofacitinib citrate) were true novel MARK1 active inhibitors that to our knowledge had not been described previously with this activity (Fig. 6C and Table S4). Baricitinib is a JAK1 and JAK2 inhibitor⁶⁶. AT9283 is an aurora kinase and JAK2 inhibitor^67,68. ON123300 is a CDK4, PI3K/AKT/mTOR inhibitor^69,70,71. Upadacitinib is a JAK-1 inhibitor⁷². Tofacitinib is a JAK1 and JAK3 inhibitor⁷³. We have also compared the maximal Tanimoto similarity (using MACCS key fingerprints) of the compounds to the MARK1 training set, with the closest being 0.79 (tofacitinib), while the similarity of the majority of the other hits ranged from 0.56-0.75, suggesting some structural diversity compared to the training dataset. We next visualized these molecules using t-SNE (Fig. 7 and Table S5). Each of the kinase datasets reveals distinct coverage of chemical space (Fig. 7A). When we compare the MARK1 hits discovered, they fall in various regions of chemical space that do not overlap with the known MARK1 training data, indicating the machine-learning model can find unique chemical structures.

**Fig. 7: t-SNE plots of the MACCS key fingerprints of the kinase datasets and the discovered MARK1 inhibitors.**

Although the classical ML model SVR performed the best followed by MolBART and the FSLC on the external test sets, the inverse performance was seen for the MARK1 prospective test set (Table 6). The SVR model predicted no actives for MARK1. MolBART, even though it was only trained on a test set of 18 compounds, was capable of discovering 3 of the 5 novel inhibitors. This tracks with the notion that pre-trained LLMs can be used to improve predictive power in sparse data situations. FSLC, despite performing worse overall when trained on large datasets, excelled at discovering novel MARK1 inhibitors, finding all 5 inhibitors with high precision (Table 6).

Table 6 Truth table of the predictions of SVC, MolBART, and FSLC on the MARK1 prospective test set

Full size table

Discussion

There have been several large-scale comparisons of ML models using on the order 1000 s datasets^{74,75,76,77,78}. We have previously described using over 5000 datasets from ChEMBL with the ECFP6 fingerprint descriptor and provided a comparison of various ML algorithms. In this case, the model performance was assessed using five-fold cross-validation metrics as well as F1 score, Cohen’s kappa, and Matthews correlation coefficient. We created ranked normalized scores for the metrics for all methods and showed that they appeared comparable while the Bayesian algorithm and support vector classification were the best performing⁴³. Other very large-scale evaluations include that of Novartis Institute for Biomedical Research which used 8558 proprietary Novartis assays to generate Random Forest Regressor models for their datasets⁷⁹.

Recent advances in ML have shown that transformer-based large-language models scale with the size of the data and number of parameters⁸⁰. While QSAR modeling has been performed with large datasets, the total number of endpoints and thus number of datapoints for training a single model have remained relatively small in comparison to the available data. In this study, we first evaluated the effect of dataset size using 2401 datasets on SVR models and compared performance with fine-tuning a pre-trained MolBART model. We found that the R² improved with dataset size for SVR but there was no effect of dataset size on MolBART. As dataset diversity increases the SVR model R² decreases whereas with the MolBART model R² was independent of diversity. This suggests that the pre-training of the MolBART model allowed it to capture relevant information for all QSAR tasks, giving it a higher predictive power over the SVR models on smaller datasets, which must learn all the relevant QSAR information from the individual datasets. As the dataset size increases, however, SVR models become more specialized in comparison to the multi-endpoint predicting MolBART model, allowing it to capture more nuanced information for each single dataset. Interestingly, the diversity of the dataset correlated negatively with the predictive power of SVR models, suggesting that similar feature representation may play a stronger role in the support vectors generated for the model decision boundary.

Often when we embark on drug discovery projects for new targets there is generally little if any data available to build ML models. In these cases, we traditionally would use approaches like pharmacophores to perform virtual screening based on molecular shape for a few hits^5,6. In addition, when we are asked by our colleagues or collaborators to build ML models, we are also often queried on how many molecules are required in order to build a ‘useful model?’. We have now explored an ML approach that handles extremely low data called FSCL⁵⁰. Initially, we used a subset of human kinases and simulated small datasets, and directly compared how this FSLC method performed relative to SVR. We found that FSCL performed well with small datasets (5 actives and 5 inactives) and as these numbers doubled its advantages decreased in comparison to both MolBART and SVR models.

Our results suggest that each model type (classical SVM, pre-trained LLM, and the FSCL) occupies a niche in which its predictive power excels over alternatives and that a Goldilocks zone exists for different ML model types. When the dataset is small (<50 molecules), FSLC performs best, capable of generalizing class-features from a small set of representations. When the datasets are diverse with a range of 50-240 molecules, pre-trained LLMs such as MolBART tend to outperform other ML approaches, leveraging the transfer learning from pre-training to outperform SVM while learning more from the slightly larger dataset size compared to FSLC. Finally, as the molecule dataset size increases past 240 datapoints, then classical ML methods such as SVR will dominate. The ability to predict which model algorithm is most likely to outperform the others using just the dataset size and dataset diversity as input suggests this relationship generalizes across most target datasets, giving us a heuristic with which to decide “how should I model my data?”.

We further demonstrate the relevance of these findings using the identification of inhibitors for GSK3β, ABL1, FYN, CDK5, and MARK1 for AD. Using the extreme of a small training dataset of 13 compounds and FSLC, We identified 5 new inhibitors of the kinase MARK1 (most of which are also known as JAK inhibitors) which regulates microtubule dynamics in neurons⁸¹ and phosphorylates tau which results in cellular transport inhibition, impacts postsynaptic molecular makeup and can induce either spine enlargement or tau toxicity as well as spine decay-depending on expression levels and duration⁸². Targeting tau phosphorylation is a valid target for AD that can form neurofibrillary tangles inside neurons and lead to neuronal death⁵⁸. The MARK1 dataset provided an example to compare the FSLC, MolBART, and classical ML approaches. Our results demonstrated that for a small dataset, FSLC performed the best at identifying the active molecules, outperforming MolBART and classical ML methods. MolBART was able to predict 3 of the 5 inhibitors while SVM predicted 0 inhibitors correctly, demonstrating the Goldilocks zone application.

Important limitations of this study to consider are that we have focused on kinase inhibitors which: 1. are all very similar; 2. there is more data than for other targets; and 3. there is generally a high degree of similarity between targets. To counter this, we would add that in the case of MARK1 there was limited data available, and it was an accessible assay which we could in turn generate new data for. We would suggest that future work could explore our approach with additional targets outside of kinases. For example, we have previously used ML to perform virtual screening to identify new ligands for various GPCRs^83,84. In addition, we did not explore other molecular descriptors or take into account any of the target 3D structures in this study. We solely focused on ChEMBL and BindingDB as sources of public data for modeling and there may be other published datasets we could include to increase our dataset size in the future. It is also important to remember we are curating data from many publications so there will be considerable variability and experimental error for some assays which we also need to consider for each dataset as well as dataset composition bias. The approach we have taken could also be used to explore and compare many other ML approaches beyond the few described here.

While there is currently considerable interest in methods such as LLMs, they may not be a panacea for our ML modeling of drug discovery relevant datasets just yet for the reasons mentioned. While one advantage of them is that they enable transfer learning across datasets and the construction of a single model that can output predictions for all the targets. We now describe where this algorithm may be the most useful, where the datasets may be midsize and diverse. It will be important to continue to explore whether this Goldilocks paradigm holds up as we look at larger datasets in the future. In the meantime, we have provided some recommendations that may assist others in their selection of which ML methods may be appropriate for datasets of various sizes and diversity.

Experimental section

General dataset curation

Publicly available datasets for Fyn, MARK1, Abl, and GSK3β, were downloaded from ChEMBL and BindingDB^85,86, then standardized with our proprietary “E-Clean” software to remove salts, and neutralize charges, and assign InChIKeys and canonical SMILES using open-source RDkit functions. Continuous activity values were converted to −log[M], and duplicate molecules were averaged based on InChIKey. The Assay Central software⁴³ further standardizes datasets using Indigo Toolkit 7 to, dearomatize, standardize, and reposition stereo bonds, standardize or flag erroneous charges, flag erroneous valences, remove isotopes, remove dative and hydrogen bonds, flag multicomponent chemicals, and remove any remaining duplicates.

Large-language model dataset curation

Over 5000 datasets were curated from ChEMBL⁸⁶ as described previously⁴³, and original activity thresholds were retained. Datasets with fewer than 20 total datapoints were removed, resulting in 2401 individual-target endpoint datasets for K_i, IC₅₀, or EC₅₀ activity. The endpoints for the aggregated dataset were randomly split into training/validation/test sets by following a 70/5/25 split. All datasets were then subjected to the “cleaning” procedures described above.

Few-shot-learning dataset curation

Datasets for 371 individual kinases were downloaded from ChEMBL. Datasets with fewer than 20 datapoints for either active or inactive compounds were removed, leaving only 95 kinase datasets. These were binarized on a threshold of 100 nM (IC₅₀, ≤100 nM or −log[M] ≥ 7), then randomly split into train/validation/test sets by following a 70/15/15 split. All datasets were then subjected to the “cleaning” procedures described above.

Assay central model building

Our proprietary AC software⁴³ uses multiple classic ML algorithms that are integrated into our web-based software to build classification models using the following algorithms: deep learning, adaboost classifier, Bernoulli Naïve Bayes, K nearest neighbors (kNN) classifier, logistic regression, random forest classifier, Support vector classification (SVC) and XG boost (XGB) classifier. For model building of continuous data, we have implemented multiple regression algorithms which include adaboost regression, Bayesian regression, elastic net regression, kNN regression, random forest regression, support vector machine regression, and XGboost regression. In all cases, nested 5-fold cross-validation was performed except for deep learning for which we removed 20% of the training set, in a stratified manner for the classification models, and these were used as external test sets for models trained on the remainder of the data.

Large-language model architecture and training

To explore the effect of “tuning-dataset” size for a pre-trained LLM transformer model, we fine-tuned the base Chemformer pre-trained model provided by Irwin et al.³⁸ on progressively larger sets of data. The Large-Language Model is a molecule pre-trained Bidirectional Auto-regressive Transformer (molBART). The BART architecture uses both encoder and decoder layers of the Transformer models, allowing the model to learn contextual molecule encodings (using the encoder) while the auto-regressive decoder module learns molecular structure. After pre-training molBART, the bidirectional encoder can be fine-tuned for downstream tasks such as property predictions (e.g., IC₅₀ prediction of molecules against a specific target). The fine-tuning of the pre-trained model was performed using PyTorch using the Lightning⁸⁷ framework for 150 epochs and each of the 2401 endpoint datasets was used in the fine-tuning.

Few-shot-learning model architecture and training

Few-shot-learning classification (FSLC) models were originally introduced for multi-class image and text classification⁵⁰. However, our application is different as we only have two classes for the majority of the tasks (active or inactive). The few-shot-learning network was trained on a select number of tasks (e.g. kinase dataset), which have sufficient data and tested on the remaining kinases. Shot is the number of molecules sampled per class, and added to the support set during each round of training, also known as episode.

FSLC consists of three modules the first two take in molecular descriptors and create embeddings for the molecules. The last module creates “prototypes” for each class by taking the average of molecular features and making predictions, similar to the previous one/few-shot-learning models^42,88,89. FSLC uses episodic learning to match training and testing conditions. In each episode, the training algorithm samples two batches of labeled, non-overlapping datapoints from each class in each task. One batch is used as the support set, S, to create prototypes. The other batch is the query set, B, which is used to calculate loss and update the network’s parameters. g′ and f^′, and g and f are the same embedding modules. During testing, support and query sets are sampled for each unseen kinase. Embeddings are created for each new compound, and binary labels are predicted based on the distance between the query compounds and the class prototypes. We used the same FSLC architecture described in Vella and Ebejer⁴². The first module, g′, is either a fully connected feed-forward network (FNN) or a graph convolutional network (GCN) depending on the chemical descriptors used. When we trained an FSLC with ECFP chemical descriptors generated with rdkit (2048 bit vectors and radius = 5), we used an FNN architecture. We used GCN when we used the graph convolution-learned embeddings⁹⁰. The second module, g is a long short-term memory (LSTM) network with iterative refinement (IterRefLSTM). The details of IterRefLSTM can be found in Altae-Tran et al.⁸⁹. Briefly, the IterRefLSTM uses a context-based attention mechanism to refine the initial embeddings iteratively and simultaneously for the support and query embeddings. The third module is a Prototypical Network (PN) which creates class prototypes and predicts the label of unseen datapoints based on the Euclidean distance. A more detailed explanation can be found in the original Prototypical Networks paper by Snell, Swersky, and Zemel⁵⁰. We trained a number of FSLCs using 1 and 10 shots. FSLCs were trained until the loss was < 1e⁻⁶. We used the negative log-likelihood loss and Adam optimizer⁹¹ to optimize the model parameters. We used the AUC of the PRC to evaluate the model’s general performance similar to Altae-Tran et al.⁸⁹ and Vella and Ebejer⁴². We tested the models by randomly sampling each test task 1000 times. The average statistics scores were calculated for each task and for all models.

In vitro screening

Chemical library

The FDA-Approved & Pharmacopeial Drug Library (MedChemExpress, Monmouth Junction, NJ) contains 2743 compounds that were approved by institutions (FDA, PDMA, EMA, or NMPA), or contained in pharmacopeia (JP, BP, EP, or USP). The portion of the library supplied at 10 mM in DMSO (2637 compounds) was screened for MARK1 inhibition. 200 nL of each compound was dispensed into a 384-well assay plate (Perkin Elmer 6008280) using a Mosquito HTS (TTP Labtech Melbourn, UK), and stored at −80 °C until use.

Experimental procedures

Compounds were evaluated for MARK1 inhibition using the ADP-Glo™ Assay with the MARK1 Kinase Enzyme System from Promega. Reactions were performed using 10 µM ATP, 0.2 µg/µL CHKtide substrate, 50 µM DTT, and 5 ng MARK1 enzyme, final concentration. All but plate HYCPK12959 contained 0.01% Triton-x 100. The final concentration of the compounds was 385 µM in 3.8% DMSO. Compounds were pre-incubated with enzyme for 10 min at room temperature before the addition of ATP and substrate. The kinase reaction was incubated for 60 min at room temperature, followed by a 40-min incubation with ADP-Glo reagent, then a 30-min incubation with Kinase Detection Reagent. The resulting luminescent signal was read on a SpectraMax iD5 (Molecular Devices, San Jose, CA), using an integration time of 1000 ms.

Data analysis

Statistical analysis was performed in Excel and GraphPad Prism 9.5.1. Z’ factor was calculated using the formula ${Z}^{{\prime} }=1-(3x({\sigma }_{p}+{\sigma }_{n}))/|{\mu }_{p}-{\mu }_{n}|$.

Data availability

All relevant data are available from the authors upon written request.

References

Ekins, S. et al. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 18, 435–441 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ekins, S., Lane, T. R., Urbina, F. & Puhl A. C. In silico ADME/tox comes of age: twenty years later. Xenobiotica 1–7, https://doi.org/10.1080/00498254.2023.2245049 (2023).
Cheng, F., Li, W., Liu, G. & Tang, Y. In silico ADMET prediction: recent advances, current challenges and future trends. Curr. Top. Med. Chem. 13, 1273–1289 (2013).
Article CAS PubMed Google Scholar
Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).
Article CAS PubMed Google Scholar
Ekins, S., Mestres, J. & Testa, B. In silico pharmacology for drug discovery: applications to targets and beyond. Br. J. Pharm. 152, 21–37 (2007).
Article CAS Google Scholar
Ekins, S., Mestres, J. & Testa, B. In silico pharmacology for drug discovery: methods for virtual ligand screening and profiling. Br. J. Pharm. 152, 9–20 (2007).
Article CAS Google Scholar
Bennet, K. P. & Campbell, C. Support vector machines: hype or hallelujah? SIGKDD Explor. 2, 1–13 (2000).
Article Google Scholar
Christianini, N. & Shawe-Taylor, J. Support Vector Machines and Other Kernel-based Learning Methods. (Cambridge University Press, 2000).
Chang, C. C. & Lin, C.-J. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, https://www.csie.ntu.edu.tw/~cjlin/libsvm/ (2011).
Lei, T. et al. ADMET evaluation in drug discovery. Part 17: Development of quantitative and qualitative prediction models for chemical-induced respiratory toxicity. Mol. Pharm. 14, 2407–2421 (2017).
Article CAS PubMed Google Scholar
Kriegl, J. M., Arnhold, T., Beck, B. & Fox, T. A support vector machine approach to classify human cytochrome P450 3A4 inhibitors. J. Comput. Aided Mol. Des. 19, 189–201 (2005).
Article CAS PubMed Google Scholar
Guangli, M. & Yiyu, C. Predicting Caco-2 permeability using support vector machine and chemistry development kit. J. Pharm. Pharm. Sci. 9, 210–221 (2006).
PubMed Google Scholar
Kortagere, S., Chekmarev, D., Welsh, W. J. & Ekins, S. Hybrid scoring and classification approaches to predict human pregnane X receptor activators. Pharm. Res. 26, 1001–1011 (2009).
Article CAS PubMed Google Scholar
Wang, S. et al. ADMET evaluation in drug discovery. 16. Predicting hERG blockers by combining multiple pharmacophores and machine learning approaches. Mol. Pharmaceut. 13, 2855–2866 (2016).
Article CAS Google Scholar
Li, D. et al. ADMET evaluation in drug discovery. 13. Development of in silico prediction models for P-glycoprotein substrates. Mol. Pharm. 11, 716–726 (2014).
Article CAS PubMed Google Scholar
Nidhi, Glick, M., Davies, J. W. & Jenkins, J. L. Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases. J. Chem. Inf. Model 46, 1124–1133 (2006).
Article CAS PubMed Google Scholar
Azzaoui, K. et al. Modeling promiscuity based on in vitro safety pharmacology profiling data. ChemMedChem 2, 874–880 (2007).
Article CAS PubMed Google Scholar
Bender, A. et al. Analysis of pharmacology data and the prediction of adverse drug reactions and off-target effects from chemical structure. ChemMedChem 2, 861–873 (2007).
Article CAS PubMed Google Scholar
Shen, M., Xiao, Y., Golbraikh, A., Gombar, V. K. & Tropsha, A. Development and validation of k-nearest neighbour QSPR models of metabolic stability of drug candidates. J. Med. Chem. 46, 3013–3020 (2003).
Article CAS PubMed Google Scholar
Schmidhuber, J. Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015).
Article PubMed Google Scholar
Capuzzi, S. J., Politi, R., Isayev, O., Farag, S. & Tropsha, A. QSAR modeling of Tox21 challenge stress response and nuclear receptor signaling toxicity assays. Front. Environ. Sci. 4, https://doi.org/10.3389/fenvs.2016.00003 (2016).
Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. https://arxiv.org/abs/1409.0575 (Arxiv, 2015).
Zhu, H. et al. Big data in chemical toxicity research: the use of high-throughput screening assays to identify potential toxicants. Chem. Res. Toxicol. 27, 1643–1651 (2014).
Article CAS PubMed PubMed Central Google Scholar
Clark, A. M. & Ekins, S. Open source Bayesian models: 2. Mining a “big dataset” to create and validate models with ChEMBL. J. Chem. Inf. Model. 55, 1246–1260 (2015).
Article CAS PubMed Google Scholar
Ekins, S., Clark, A. M., Swamidass, S. J., Litterman, N. & Williams, A. J. Bigger data, collaborative tools and the future of predictive drug discovery. J. Comput. Aided Mol. Des. 28, 997–1008 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ekins, S., Freundlich, J. S. & Reynolds, R. C. Are bigger data sets better for machine learning? Fusing single-point and dual-event dose response data for Mycobacterium tuberculosis. J. Chem. Inf. Model. 54, 2157–2165 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ekins, S. The next era: deep learning in pharmaceutical research. Pharm. Res. 33, 2594–2603 (2016).
Article CAS PubMed PubMed Central Google Scholar
Baskin, I. I., Winkler, D. & Tetko, I. V. A renaissance of neural networks in drug discovery. Expert Opin. Drug Discov. 11, 785–795 (2016).
Article CAS PubMed Google Scholar
Kuwahara, H. & Gao, X. Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach. J. Cheminformatics 13, 27 (2021).
Article CAS Google Scholar
Moriwaki, H., Tian, Y.-S., Kawashita, N. & Takagi, T. Mordred: a molecular descriptor calculator. J. Cheminformatics 10, 4 (2018).
Article Google Scholar
Kausar, S. & Falcao A. O. Analysis and Comparison of Vector Space and Metric Space Representations in QSAR Modeling. Molecules 24, 1698 (2019).
Liu, Y. et al. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiol. 1, 100017 (2023).
Article Google Scholar
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. LSTM: a search space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28, 2222–2232 (2017).
Article PubMed Google Scholar
Urbina, F. et al. UV-adVISor: attention-based recurrent neural networks to predict UV-Vis spectra. Anal. Chem. 93, 16076–16085 (2021).
Article CAS PubMed PubMed Central Google Scholar
Blay, V., Li, X., Gerlach, J., Urbina, F. & Ekins, S. Combining DELs and machine learning for toxicology prediction. Drug Discov. Today 27, 103351 (2022).
Article CAS PubMed PubMed Central Google Scholar
Weininger, D. S. M. I. L. E. S. Introduction and encoding rules. J. Chem. Inf. Comput Sci. 28, 31 (1988).
Article CAS Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv https://arxiv.org/abs/1810.04805 (2018).
Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn. Sci. Technol. 3, 015022 (2022).
Article Google Scholar
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).
Yenduri, G. et al. Generative pre-trained transformer: a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. In IEEE Access. vol. 12, pp. 54608–54649 (2024).
Stanley, M. et al. FS-Mol: a few-shot learning dataset of molecules. In: NeurIPS 2021 https://openreview.net/forum?id=701FtuyLlAd (2021).
Vella, D. & Ebejer, J.-P. Few-shot learning for low-data drug discovery. J. Chem. Inf. Model. 63, 27–42 (2023).
Article CAS PubMed Google Scholar
Lane, T. R. et al. Bioactivity comparison across multiple machine learning algorithms using over 5000 datasets for drug discovery. Mol. Pharm. 18, 403–415 (2021).
Article CAS PubMed Google Scholar
Lane, T. R. et al. Machine learning models identify new inhibitors for human OATP1B1. Mol. Pharm. 19, 4320–4332 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zorn, K. M. et al. Multiple machine learning comparisons of HIV cell-based and reverse transcriptase data sets. Mol. Pharm. 16, 1620–1632 (2019).
Lane, T. R., Harris, J., Urbina, F. & Ekins, S. Comparing LD₅₀50/LC(50) Machine learning models for multiple species. J. Chem. Health Saf. 30, 83–97 (2023).
Vignaux, P. A. et al. Validation of acetylcholinesterase inhibition machine learning models for multiple species. Chem. Res. Toxicol. 36, 188–201 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bemis, G. W. & Murcko, M. A. The properties of known drugs 1. molcular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Article CAS PubMed Google Scholar
Langdon, S. R., Brown, N. & Blagg, J. Scaffold diversity of exemplified medicinal chemistry space. J. Chem. Inf. Model. 51, 2174–2185 (2011).
Article CAS PubMed PubMed Central Google Scholar
Snell, J., Swersky, K. & Zemel, R. S. Prototypical networks for few-shot learning. NeurIPS Proceedings. https://papers.nips.cc/paper_files/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html (2017).
Caron, G. et al. Steering new drug discovery campaigns: permeability, solubility, and physicochemical properties in the bRo5 chemical space. ACS Med. Chem. Lett. 12, 13–23 (2021).
Article CAS PubMed PubMed Central Google Scholar
Berginski, M. E. et al. The Dark Kinase Knowledgebase: an online compendium of knowledge and experimental results of understudied kinases. Nucleic Acids Res. 49, D529–D535 (2020).
Article PubMed Central Google Scholar
Shuo Tan, Y. et al. Fast interpretable greedy-tree sums. Preprint at https://ui.adsabs.harvard.edu/abs/2022arXiv220111931S (2022).
West, S. & Bhugra, P. Emerging drug targets for Abeta and tau in Alzheimer’s disease: a systematic review. Br. J. Clin. Pharm. 80, 221–234 (2015).
Article CAS Google Scholar
Hanger, D. P., Hughes, K., Woodgett, J. R., Brion, J. P. & Anderton, B. H. Glycogen synthase kinase-3 induces Alzheimer’s disease-like phosphorylation of tau: generation of paired helical filament epitopes and neuronal localisation of the kinase. Neurosci. Lett. 147, 58–62 (1992).
Article CAS PubMed Google Scholar
Vanden Dries, V. et al. Amyloid precursor protein reduction enhances the formation of neurofibrillary tangles in a mutant tau transgenic mouse model. Neurobiol. Aging 55, 202–212 (2017).
Article CAS PubMed Google Scholar
Engel, T., Goni-Oliver, P., Lucas, J. J., Avila, J. & Hernandez, F. Chronic lithium administration to FTDP-17 tau and GSK-3beta overexpressing mice prevents tau hyperphosphorylation and neurofibrillary tangle formation, but pre-formed neurofibrillary tangles do not revert. J. Neurochem. 99, 1445–1455 (2006).
Article CAS PubMed Google Scholar
Simic, G. et al. Tau protein hyperphosphorylation and aggregation in Alzheimer’s disease and other tauopathies, and possible neuroprotective strategies. Biomolecules 6, 6 (2016).
Article PubMed PubMed Central Google Scholar
Martin, L. et al. Tau protein kinases: involvement in Alzheimer’s disease. Ageing Res. Rev. 12, 289–309 (2013).
Article CAS PubMed Google Scholar
Llorens-Martín, M., Jurado, J., Hernández, F. & Avila, J. GSK-3β, a pivotal kinase in Alzheimer disease. Front. Mol. Neurosci. 7, 46 (2014).
PubMed Google Scholar
Kimura, T., Ishiguro, K. & Hisanaga, S. Physiological and pathological phosphorylation of tau by Cdk5. Front. Mol. Neurosci. 7, 65 (2014).
Article PubMed PubMed Central Google Scholar
Tomizawa, K., Omori, A., Ohtake, A., Sato, K. & Takahashi, M. Tau-tubulin kinase phosphorylates tau at Ser-208 and Ser-210, sites found in paired helical filament-tau. FEBS Lett. 492, 221–227 (2001).
Article CAS PubMed Google Scholar
Matenia, D. & Mandelkow, E. M. The tau of MARK: a polarized view of the cytoskeleton. Trends Biochem. Sci. 34, 332–342 (2009).
Article CAS PubMed Google Scholar
Lee, G. et al. Phosphorylation of tau by fyn: implications for Alzheimer’s disease. J. Neurosci. 24, 2304–2312 (2004).
Article CAS PubMed PubMed Central Google Scholar
Derkinderen, P. et al. Tyrosine 394 is phosphorylated in Alzheimer’s paired helical filament tau and in fetal tau with c-Abl as the candidate tyrosine kinase. J. Neurosci. 25, 6584–6593 (2005).
Article CAS PubMed PubMed Central Google Scholar
Shi, J. G. et al. The pharmacokinetics, pharmacodynamics, and safety of baricitinib, an oral JAK 1/2 inhibitor, in healthy volunteers. J. Clin. Pharm. 54, 1354–1361 (2014).
Article CAS Google Scholar
Howard, S. et al. Fragment-based discovery of the pyrazol-4-yl urea (AT9283), a multitargeted kinase inhibitor with potent aurora kinase activity. J. Med. Chem. 52, 379–388 (2009).
Article CAS PubMed Google Scholar
Dawson, M. A. et al. AT9283, a potent inhibitor of the Aurora kinases and Jak2, has therapeutic potential in myeloproliferative disorders. Br. J. Haematol. 150, 46–57 (2010).
Article CAS PubMed Google Scholar
Perumal, D. et al. Dual targeting of CDK4 and ARK5 using a novel kinase inhibitor ON123300 exerts potent anticancer activity against multiple myeloma. Cancer Res. 76, 1225–1236 (2016).
Article CAS PubMed PubMed Central Google Scholar
Divakar, S. K. et al. Dual inhibition of CDK4/Rb and PI3K/AKT/mTOR pathways by ON123300 induces synthetic lethality in mantle cell lymphomas. Leukemia 30, 86–93 (2016).
Article CAS PubMed Google Scholar
Zhang, X. et al. Preclinical pharmacological evaluation of a novel multiple kinase inhibitor, ON123300, in brain tumor models. Mol. Cancer Ther. 13, 1105–1116 (2014).
Article CAS PubMed PubMed Central Google Scholar
Upadacitinib (Rinvoq)—a new JAK inhibitor for rheumatoid arthritis. Med. Lett. Drugs Ther. 61, 183–185 (2019).
Kawalec, P., Mikrut, A., Wisniewska, N. & Pilc, A. The effectiveness of tofacitinib, a novel Janus kinase inhibitor, in the treatment of rheumatoid arthritis: a systematic review and meta-analysis. Clin. Rheumatol. 32, 1415–1424 (2013).
Article PubMed PubMed Central Google Scholar
Lenselink, E. B. et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J. Cheminform 9, 45 (2017).
Article PubMed PubMed Central Google Scholar
Mayr, A. et al. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 9, 5441–5451 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lee, K. & Kim, D. In-silico molecular binding prediction for human drug targets using deep neural multi-task learning. Genes (Basel) 10, 906 (2019).
Article CAS PubMed Google Scholar
Awale, M. & Reymond, J. L. Polypharmacology browser PPB2: target prediction combining nearest neighbors with machine learning. J. Chem. Inf. Model. 59, 10–17 (2019).
Article CAS PubMed Google Scholar
Škuta, C. et al. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J. Cheminformatics 12, 39 (2020).
Article Google Scholar
Martin, E. J. et al. All-Assay-Max2 pQSAR: activity predictions as accurate as four-concentration IC50s for 8558 novartis assays. J. Chem. Inf. Model. 59, 4450–4459 (2019).
Article CAS PubMed Google Scholar
Hoffmann, J. et al. Training compute-optimal large language models. 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf (2022).
Chudobová, J. & Zempel, H. Microtubule affinity regulating kinase (MARK/Par1) isoforms differentially regulate Alzheimer-like TAU missorting and Aβ-mediated synapse pathology. Neural Regen. Res. 18, 335–336 (2023).
Article PubMed Google Scholar
Zempel, H. & Mandelkow, E. Mechanisms of axonal sorting of tau and influence of the axon initial segment on tau cell polarity. Adv. Exp. Med. Biol. 1184, 69–77 (2019).
Article CAS PubMed Google Scholar
Puhl, A. C., Gao, Z. G., Jacobson, K. A. & Ekins, S. Machine learning for discovery of new ADORA modulators. Front. Pharm. 13, 920643 (2022).
Article CAS Google Scholar
Puhl, A. C. et al. Machine learning-aided search for ligands of P2Y(6) and other P2Y receptors. Purinergic Signal. https://doi.org/10.1007/s11302-024-10003-4 (2024).
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
Article CAS PubMed Google Scholar
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2018).
Article PubMed Central Google Scholar
Falcon, W. PyTorchLightning/PyTorch-lightning: 0.7.6 release (0.7.6).) Zenodo https://zenodo.org/records/3828935 (2020).
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K. & Wierstra D. Matching networks for one shot learning. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. https://proceedings.neurips.cc/paper_files/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf (2016).
Altae-Tran, H., Ramsundar, B., Pappu, A. S. & Pande, V. Low data drug discovery with one-shot learning. ACS Cent. Sci. 3, 283–293 (2017).
Article CAS PubMed PubMed Central Google Scholar
Duvenaud, D. et al. Convolutional networks on graphs for learning molecular fingerprints. NIPS Proceedings 2015. https://papers.nips.cc/paper_files/paper/2015/hash/f9be311e65d81a9ad8150a60844bb94c-Abstract.html (2015).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. 3rd International Conference for Learning Representations. https://ui.adsabs.harvard.edu/abs/2014arXiv1412.6980K (San Diego, 2015).

Download references

Acknowledgements

We acknowledge our many colleagues at Collaborations Pharmaceuticals, Inc. for their assistance with this project and the reviewers for their constructive feedback. We kindly acknowledge NIH funding: R44GM122196-02A1 from NIGMS and 1R44ES031038-01 from NIEHS (PI–Sean Ekins). “Research reported in this publication was supported by the National Institute of Environmental Health Sciences of the National Institutes of Health under Award Number R44ES031038. We also acknowledge 1R43AT010585-01 from NIH/NCCAM The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.”

Author information

Authors and Affiliations

Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC, 27606, USA
Scott H. Snyder, Patricia A. Vignaux, Mustafa Kemal Ozalp, Jacob Gerlach, Ana C. Puhl, Thomas R. Lane, John Corbett, Fabio Urbina & Sean Ekins

Authors

Scott H. Snyder
View author publications
You can also search for this author in PubMed Google Scholar
Patricia A. Vignaux
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Kemal Ozalp
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Gerlach
View author publications
You can also search for this author in PubMed Google Scholar
Ana C. Puhl
View author publications
You can also search for this author in PubMed Google Scholar
Thomas R. Lane
View author publications
You can also search for this author in PubMed Google Scholar
John Corbett
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Urbina
View author publications
You can also search for this author in PubMed Google Scholar
Sean Ekins
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Scott H. Snyder: Model building, data generation, analysis, and software (MolBART). Patricia A. Vignaux: Experimental data generation, data analysis, and manuscript writing. Mustafa Kemal Ozalp: Model building, data generation, and software (few-shot learning). Jacob Gerlach: Data curation software. Ana C. Puhl: Experimental data generation, kinase expertise. Thomas R. Lane: Data curation, experimental methods, and data analysis. John Corbett: Data curation software. Fabio Urbina: Project supervision, data generation, data analysis, and manuscript writing. Sean Ekins: Funding, project supervision, and manuscript writing.

Corresponding authors

Correspondence to Fabio Urbina or Sean Ekins.

Ethics declarations

Competing interests

S.E. is the owner and all others are employees of Collaborations Pharmaceuticals, Inc.

Peer review

Peer review information

Communications Chemistry thanks Fang Bai and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplemental Data 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Snyder, S.H., Vignaux, P.A., Ozalp, M.K. et al. The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications. Commun Chem 7, 134 (2024). https://doi.org/10.1038/s42004-024-01220-4

Download citation

Received: 21 December 2023
Accepted: 04 June 2024
Published: 12 June 2024
DOI: https://doi.org/10.1038/s42004-024-01220-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

CancerGPT for few shot drug pair synergy prediction using large pretrained language models

Leveraging large language models for predictive chemistry

Augmenting large language models with chemistry tools

Introduction

Results

Transformer models outperform traditional modeling

Few-shot-learning classifiers outcompete large-language models and classical machine learning with extremely low data

The Goldilocks zone for model selection: discovery of novel kinase inhibitors for Alzheimer’s disease

Discussion

Experimental section

General dataset curation

Large-language model dataset curation

Few-shot-learning dataset curation

Assay central model building

Large-language model architecture and training

Few-shot-learning model architecture and training

In vitro screening

Chemical library

Experimental procedures

Data analysis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplemental Data 1.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links