Main

Chemical and biochemical structures and processes are of great interest to machine learning with graph neural networks (GNNs), as they are supposedly well represented by undirected and directed graphs, respectively1. Conversely, machine learning plays an ever-increasing role in the life sciences, where new methods have been adopted and adapted for a wide range of tasks such as the prediction of physico- and quantum-chemical properties in material science, the prediction of pharmacokinetic properties in drug development, the prediction of binding affinities between small-molecule ligands and proteins in drug discovery, or the prediction of chemical reaction yields in synthetic and process chemistry2,3,4,5,6.

With the introduction of GNNs, graph representation learning for molecules attracted a growing interest, as these neural network architectures allowed representation learning directly on the molecular graphs rather than first extracting features or precalculating descriptors from a given chemical structure7, with newly proposed methods continuously escalating the architectures’ complexity or amount of precomputed or, less likely, experimentally measured molecular properties, such as three-dimensional coordinates or structural motifs, that are added to the graph8,9,10. However, even though the published literature often touts the benefits of introducing new features based on molecular composition and topology, it is generally ignored that these features are also implicitly encoded by lower-order representations. For example, the most-used text representation of molecules, simplified molecular-input line-entry system (SMILES), are a string encoding of the depth-first tree traversal of the molecular graph retaining all topological information and, in the case of isomeric SMILES, geometrical information pertaining to stereochemistry11. The same applies to molecular fingerprints, which often even explicitly encode topological features12. Meanwhile, the molecular graph introduces implicit constraints on the molecular three-dimensional geometry, as topology and atom types induce structure. Conversely, representing a molecule as a graph often encapsulates only an approximation of molecular topology and geometry, as only a subset of chemical bonds, the covalent bonds, can be represented in this static data structure. However, in addition to covalent bonds, molecules and substances used as drugs or materials also contain ionic and metallic bonds that are generally not well represented in molecular graphs. In addition, dynamic intermolecular interactions such as hydrogen bonds and π-stacking are common occurrences in ligand–protein binding, which is of high interest in medicinal chemistry13. Furthermore, specific bonds are not well defined in conjugated systems such as aromatic rings, as electrons are delocalized over multiple atoms and bonds.

Given this often somewhat fuzzy notion of bonds in molecules, we hypothesize that representing a molecule as a set—formally a multiset, as the vectors representing the atoms can be identical for two or more atoms—of atoms rather than a graph may capture the true nature of molecules better than explicit graph representations while preserving implicit information about molecular structure. In this set-based approach, each atom is represented as a vector of one-hot encoded atom invariants as defined in ref. 12 (see details in ‘Choice of atom and bond invariants’ in Methods). While this representation may encode the local topology of a molecular graph through these invariants, for example, the degree of an atom, it does not encode any explicit connectivity of the molecular graph. In the context of set representation learning, a molecule is therefore defined as a set of k-dimensional vectors a where the set’s cardinality is the number of non-hydrogen atoms in the molecule. However, this set-based representation introduces two properties that are not supported by typical neural network-based machine learning architectures: (1) the cardinality of the molecular sets differs depending on the number of non-hydrogen atoms in a molecule, and (2) the molecular sets are unordered. In addition, the architecture must support multiset input, as the number of identical vector representations of atoms matters in the context of molecular sets. Given these requirements, specifically the need for permutation invariance, it is insufficient to simply pad the sets. Therefore, performing machine learning tasks on such molecular sets requires a neural network architecture capable of permutation invariant representation of variable-sized sets. Over the past five years, multiple architectures, including DeepSets, Set-Transformer or RepSet, have become available that fulfil these requirements14,15,16.

Here we introduce several of what we denote molecular set representation learning architectures based on the scheme described above and in Fig. 1, and evaluate them against widely used GNN methods on a wide array of tasks, including physico- and quantum-chemical property prediction for material science, pharmacokinetic property prediction for drug design, protein–ligand-affinity prediction for drug discovery and biochemistry, and reaction-yield prediction for synthetic and process chemistry. We show that using the concept of molecular set representation learning and combining it with GNN architectures, we can not only simplify current approaches but also improve on commonly used GNN implementations such as directed message-passing neural network (D-MPNN), graph attention network (GAT) or DimeNet17,18,19. Furthermore, we uncover that compared with more modern benchmark datasets, extensively used older benchmark datasets may not be well suited to evaluate the advantages of GNNs, as they perform worse than the most simple of our models. Finally, we introduce an easy-to-use, extensible collection of molecular set representation architectures ready to be used in various fields, including materials science, drug discovery and development, biochemistry, and synthetic and process chemistry.

Fig. 1: Overview of set-based and set-enhanced models.
figure 1

All implemented models consist of three parts: an encoding or embedding layer, a set representation layer and, finally, a readout/MLP layer. a, The simplest molecular set representation model MSR1 takes molecules as input and encodes each atom as 133-dimensional binary vectors a0 to ak, where k is the number of atoms in the ith molecule, into a molecular set Ai. The resulting molecular sets with differing cardinalities are passed into a RepSet set representation layer and read out by a regression (regr.) or classification (class.) MLP. b, The dual molecular set representation model MSR2 encodes the atoms and bonds of molecules into two distinct sets Ai and Bi and passes them to two separate RepSet layers whose outputs Aout and Bout are concatenated (Cat) followed by either a regression or classification MLP. c, SR-GINE is a GIN model with a GINEConv layer enhanced with a set representation layer replacing global pooling. The node (atom) embeddings n0 to nk are then passed to a RepSet layer as graph sets Gi followed by an MLP regressor or classifier. d, SR-BIND follows the dual-set architecture of MSR2 by employing two parallel RepSet layers. Atoms al of the ligand Li are added to a set if they are within radius r of any protein atom am. Conversely, atoms am from protein Mi are added to a second set only if they are within radius r of any ligand atom al. Both sets are passed to separate RepSet layers whose output is concatenated and passed to a regression or classification MLP. e, MSR2-RXN also follows the dual-set architecture of MSR2 by employing two parallel RepSet layers. All reactants rj and products pj that are part of the ith reaction in an input data set are encoded using ECFP with a radius of 3 and size of 2,048 into molecular sets Ri and Pi. Both sets are passed to separate RepSet layers whose output is concatenated and passed to a regression or classification MLP.

Results and discussion

We introduce multiple implementations of molecular set representation architectures, including the single-set atom-based MSR1, the dual-set atom and bond-based MSR2, and the graph invariant network-based set representation-graph isomorphism network with edge (SR-GINE) that replaces the pooling function with a set representation layer. For input into MSR1, a molecule is encoded as a set of one-hot encoded vectors, where each vector encodes the atom invariants of a single atom in the molecule, similar to extended-connectivity fingerprint (ECFP) with radius zero, containing no explicit information about molecular topology. Following refs. 12,17, we include atom invariants described in ‘Choice of atom and bond invariants’ in Methods. MRS2 expands on this concept and implements a neural network architecture with two parallel set representation layers, where the first takes the same input as the single layer in MSR1. In contrast, the second takes a set of vectors encoding bond invariants as input. We include bond invariants described in ‘Choice of atom and bond invariants’ in Methods. Therefore, with MSR2, we introduce more information about molecular topology while avoiding explicit definitions of topology beyond bonded atom pairs. In addition to the purely set-representation architectures, we introduce SR-GINE, a graph invariant network with edge attributes (GINE) and a set RepSet representation pooling layer instead of global mean pooling in the vanilla GINE implementation. In all our architectures, the output of the set representation layer is read by a multilayer perceptron (MLP) with a single hidden layer for regression or classification. In the case of dual-set architectures, such as MSR2, the two sets’ outputs are first concatenated. Finally, we chose the combination of RepSet and GINE based on their respective reported performances and the inert interpretability of RepSet7,16. Further reasoning and data substantiating our choice can be found in ‘Choice of set representation and GNN layers’ in Methods.

To benchmark our proposed architectures, we rely on well-known and recently published datasets to compare our method with widely used graph-based methods as a baseline. Initially, we focused on evaluating our methods on datasets commonly known as the MoleculeNet benchmark. We selected the relatively small dataset blood–brain barrier-penetrating molecules (BBBP) (n = 2,039), with the task of classifying small molecules on whether they penetrate the blood–brain barrier, to tune the hyperparameters, namely, the number of hidden sets, the number of elements in the hidden sets, the number of epochs, as well as the number of hidden channels in the MLPs of all our models. Unless otherwise indicated, the established hyperparameters were used for all the benchmarks discussed in this study. We then compared our model with the performance achieved by multiscale graph convolutional network (MGCN), SchNet (quantum-chemical deep tensor neural network), graph convolutional network (GCN), graph isomorphism network (GIN) and D-MPNN, as reported in ref. 20. In addition, we compare our approach with the current state of the art outside GNN-based approaches, namely, MolFormer, a chemical large language model trained on approximately 100 million molecules extracted from the ZINC and PubChem databases21. As a control, we benchmark an implementation of GINE with the standard global mean-pooling layer and the same one-hot encoded vectors as atom and bond attributes that were used for MSR1 and MSR27. For all datasets, Murcko scaffold splits in accordance with refs. 17,20 were used.

As shown in Table 1, MSR1, the simplest of our models, shows a performance close to existing GNN approaches, namely, GIN and D-MPNN, without any explicit topological information about the molecular graph. Indeed, it performed better than D-MPNN in 5 out of 11 benchmark datasets and better than GIN in 8 out of 11. This may suggest that up to now, too much value has been assigned to representing a molecule as a graph rather than a loose set of atoms. Alternatively, these results may be due to the nature of the benchmark datasets or the molecules contained therein, meaning that they are potentially not suited for evaluating GNN architectures. Furthermore, MSR2 did not improve on the performance of MSR1 as expected but performed generally worse. Finally, replacing the global mean-pooling layer of GINE with RepSet improved its performance in 8 out of 11 benchmarks. Overall, these results are promising in light of our hypothesis, which stipulates that a more relaxed definition of molecules than the one provided by graph encoding may be beneficial. However, as discussed previously, the performance of our simplistic models, especially MSR1, may be due to limitations of the datasets. Hence, we benchmarked our architectures on two recent and well-received datasets, the results of which are discussed in ‘Physico- and quantum-chemical property prediction’ and ‘Pharmacokinetic property prediction’. In addition to these more generalized architectures, we introduce two architectures based on the dual-set approach of MSR2 tailored towards binding-affinity prediction in protein–ligand complexes and reaction-yield prediction in ‘Binding-affinity prediction’ and ‘Reaction-yield prediction’, respectively.

Table 1 Benchmarking the implemented molecular set representation architectures against GNN baselines on the MoleculeNet datasets

Physico- and quantum-chemical property prediction

The prediction of physicochemical properties is a common task in machine learning for chemistry, as the elucidation of these properties through either laboratory high-throughput screening or simulation-based computation is expensive and time-consuming at best and intractable at worst22,23. Hence, multiple approaches have been proposed to enable data-driven approximation of properties such as frontier molecular orbital energies, the molecular dipole moment, rotational constants or ionization potentials. During our exploratory study of our proposed architectures on the quantum-chemical benchmark datasets QM7 and QM8 (Table 1), we found that our simplest model (MSR1) showed better performance than both GIN(E) and D-MPNN on QM7 and better performance than GIN(E) on QM8. In this section, we further investigate the performance of our models on additional physico- and quantum-chemical prediction tasks using the OCELOT chromophore dataset24.

The OCELOT chromophore dataset contains chemically diverse π-conjugated molecules, meaning that our set-based architectures should, according to our hypothesis, perform better than graph-based architectures as they put less emphasis on specific bonds24. For this dataset, we changed our hyperparameters to reduce the size of all our models to less than 100,000 parameters, as the relatively high number of samples and tasks, combined with the 5-fold cross-validation, requires non-trivial computational resources. The original study used the dataset to train a hierarchy of models that follow increasingly complex architectures, peaking with an MPNN for quantum chemistry25. In addition, the output of the MPNN was concatenated with precomputed molecular descriptors before the feedforward neural network. We denote this model MPNN+ in Table 2. Furthermore, a method based on a fingerprint-descriptor combination, which we donate ECFP2+, showed exceptional performance in the original study. ECFP2+ is a feedforward neural network that takes ECFP fingerprints with radius r = 2 concatenated with precomputed molecular descriptors as an input.

Table 2 Benchmarking the implemented molecular set representation architectures against reported baselines on the OCELOT chromophore data containing chemically diverse π-conjugated molecules
Table 3 Benchmarking the implemented molecular set representation architectures against reported baselines on the data from Biogen ADME in vitro assays

Our most straightforward model, MSR1, performs as well as MPNN over all predicted properties (paired t-test, P = 0.529), as MPNN performs remarkably poorly on the highest occupied molecular orbital (HOMO) and HOMO–lowest unoccupied molecular orbital (LUMO) (H–L) tasks. However, both set-based models, MSR1 and MSR2, perform significantly worse than MPNN+ (paired t-test P < 0.001 and P < 0.002, respectively). These results suggest that adding explicit bonds and introducing message-passing does not significantly improve the predictive accuracy on the OCELOT dataset, and to improve performance, additional molecular precomputed descriptors are needed; however, in practice, the choice of model may be influenced by the property of interest given the poor performance of MPNN on specific tasks, namely, HOMO and H–L. Over all properties, the set-enhanced model SR-GINE performs significantly better than GINE (paired t-test, P < 0.0001), yet not significantly different from the MPNNs (paired t-test, P < 0.138), although without the significant outliers observed when predicting HOMO and the H–L gap with MPNN, or the additional precomputed RDKit two-dimensional descriptors required by MPNN+ (paired t-test, P < 0.409).

To compare our set-based approach to a current state-of-the-art chemical large language model, we finetuned the publicly available pretrained variant of MolFormer on OCELOT (‘Finetuning MolFormer’ in Methods). Interestingly, the finetuned MolFormer performed below expectations compared with the MoleculeNet benchmark. A reason for the comparatively lower performance could be a lack of relevant samples in the original training set of the publicly available variant of MolFormer26,27.

Together with the results on QM7 and QM8 (Table 1), the results of the OCELOT chromophore benchmark suggest that set representation learning can perform as well as graph representation learning on physico- and quantum-chemical prediction tasks. Furthermore, combining graph and set representation learning, as is done with SR-GINE, shows synergistic effects that result in overall enhanced performance. Specifically, the significant performance increase of SR-GINE over GINE indicates that a set representation layer in lieu of a global pooling function can improve existing GNN architectures.

Pharmacokinetic property prediction

A molecule’s pharmacokinetic properties play an essential role in designing and developing new therapeutics28. However, experimental pharmacokinetic data from in vitro and vivo experiments is scarce, as it is expensive to generate, making its approximation a priority for machine learning research29,30. To evaluate our models’ performance on pharmacokinetic tasks, we use the dataset recently published by ref. 31. The dataset contains 3,521 commercially available compounds that were tested against the following endpoints in Biogen absorption, distribution, metabolism and excretion (ADME) in vitro assays: human liver microsomal stability (HLM, clearance in ml min−1 kg−1), MDR1–MDCK efflux ratio (permeability with Madin Darby–canine kidney cells transfected with MDR1), solubility at a pH of 6.8 (μg ml−1), rat liver microsomal stability (RLM, clearance in ml min−1 kg−1), human plasma protein binding (hPPB, percent unbound) and rat plasma protein binding (rPPB, percent unbound). Reference 31 provides baselines trained on the dataset for random forests (RF), gradient boosting (LightGBM), a hyperparameter-tuned D-MPNN and a hyperparameter-tuned D-MPNN plus precomputed RDKit two-dimensional descriptors (D-MPNN+)17. Values for additional models can be found in the original publication. As input for RF and LightGBM, ref. 31 concatenated 1,024-bit binary functional connectivity fingerprint with radius 4 (FCFP4) concatenated with 316 two-dimensional descriptors that were precomputed using the RDKit package12. In addition, we finetuned the publicly available pretrained variant of MolFormer on the Biogen ADME dataset (‘Finetuning MolFormer’ in Methods). The benchmark results from the original publication are compared to ours and those of MolFormer in Table 3.

Over all pharmacokinetic endpoints, SR-GINE performs significantly better than the control GINE (paired t-test, P = 1.98 × 10−5), which uses the standard mean pooling instead of a set layer. Furthermore, the SR-GINE model performs better than the hyperparameter-tuned D-MPNN and D-MPNN+ models without relying on hyperparameter tuning or additional precomputed descriptors (RDKit two-dimensional descriptors). While MSR1 and MSR2 perform worse than D-MPNN(+) and SR-GINE, they do not perform significantly differently to GINE (paired t-test, P = 0.096 and P = 1.0, respectively). In other words, they perform as well as an out-of-the-box state-of-the-art GNN. The results of MSR1 compared with D-MPNN are interesting, as they contradict drug discovery and design-related findings from benchmarks on the MoleculeNet datasets, where MSR1 often compared favourably to D-MPNN, a prominent example being solubility prediction (Table 1). As the benchmarks found in MoleculeNet remain the most used and cited when benchmarking new GNN architectures, our results may show a need to adopt a different practice or update MoleculeNet with more recent datasets when assessing possible advantages of graph representation-based approaches. In addition, the difference in performance between graph- and set-based methods and that of the chemical large language model (MolFormer) is not as stark on the Biogen ADME dataset as it was for the MoleculeNet benchmarks. Indeed, MolFormer does not perform significantly better than SR-GINE (paired t-test, P = 0.158).

Finally, the renewed significant improvement of SR-GINE over GINE strengthens the case for improving predictive performance by enhancing GNN architectures with a set representation layer.

Binding-affinity prediction

Binding affinity is an important metric in biology and medicinal chemistry that measures the strength of a reversible association between biological macromolecules, such as proteins or DNA, and small-molecule ligands, such as drugs. It is, therefore, a central concept of rational drug design, where the potential efficacy of a drug is measured by its binding affinity to a known biological target implicated in a pathology the drug should treat32. Traditionally, simulation-based molecular docking techniques, such as scoring functions, overfit to use cases that adhere to rigid modelling assumptions, do not fully represent protein flexibility, and do not directly account for solvent effects33,34. Non-parametric machine learning has been proposed as an alternative to infer complex binding effects that are difficult to explicitly represent directly from experimental data35. Therefore, a wide array of neural network architectures combined with a multitude of scoring functions have been proposed36,37,38,39,40.

However, representation learning methods—foremost molecular graph representation learning—have yet to reach the performance established by physics-informed methods such as PAMNet41, or RF and gradient-boosting-based approaches that make use of engineered features and precomputed molecular descriptors such as the ECFP-inspired extended connectivity interaction features (ECIF)42.

For our model used for binding-affinity prediction, we borrowed the architecture and all hyperparameters from MSR2. The encoding of the ligand–protein complex consists of creating a multiset of atoms for the ligand and one for the protein, respectively. The set L representing the ligand is constructed by iterating over the atoms of the ligand, and adding those that are within a radius of r from any atom in the protein to the set. The set M representing the protein is constructed in the same way, however, with the roles of the ligand and protein reversed. The optimal value of r = 5.5 was determined by treating it as a hyperparameter and evaluating its effect on the validation set and supports the findings by ref. 42. All other hyperparameters were kept from the original MSR2 architecture.

We evaluated our method based on PDBbind splits and metrics of basic GNN methods architectures reported by ref. 39. SR-BIND compares well to all graph drug–target affinity (GraphDTA) and GNN-based methods40 (Table 4). This result suggests that, compared with the distance between specific protein and ligand atoms, the molecular topology of the ligand plays a relatively minor role. It must noted that the various GraphDTA methods do not use geometric information but rely on the more readily available amino acid sequence information for representing the protein.

Table 4 Performance of our set-based method SR-BIND compared against GNN-based methods

Reaction-yield prediction

Predicting outcomes of chemical reactions, such as their yield based on data gathered in high-throughput screening, is an important task in machine learning for chemistry. Previously, we were able to show that relatively simple fingerprint-based gradient-boosting models can perform at least as well as computationally expensive density functional theory (DFT) and transformer-based methods43. In cheminformatics, chemical reactions are often defined as two sets of molecules, reactants and products, where the reactants are fully or partially transformed into the products during the reaction process. This set-based definition of chemical reactions hints at a potential use for set representation learning. Again, we focused on the most straightforward implementation by creating a dual-set neural network, where the inputs are binary ECFP (with r = 3 and including stereochemistry) vectors of reactant and product molecules, respectively.

We evaluated this architecture, denoted MSR2-RXN, against the state of the art using a high-throughput dataset of Buchwald–Hartwig cross-coupling reactions with the task of predicting reaction yields, that is, the percentage of input material (reactants) that is transformed to output materials (product). The baseline models include our previously introduced gradient-boosting and fingerprint-based method (DRFP), a DFT-based random-forest model (DFT), as well as the transformer-based models Yield-bidirectional encoder representations from transformers (BERT) and its augmented variant (Yield-BERT (aug.))43. MSR2-RXN performs similarly to the state-of-the-art methods on the ablation study (Rand x/y) using random splits and on the out-of-distribution splits (test n), as shown in Table 5. Indeed, MSR2-RXN does not perform significantly different compared with Yield-BERT, Yield-BERT (aug.) and DRFP (paired t-test, P = 0.283, P = 1.000 and P = 0.307, respectively) and performs significantly better than the DFT-based method (paired t-test, P = 0.008). In addition, while it also performs below a usable threshold on the Buchwald–Hartwig electronic laboratory notebook (ELN) dataset, it does perform better than both the BERT- and GNN-based models, coming closer in performance to DRFP.

Table 5 Performance of MSR2-RXN compared with the state of the art in reaction-yield prediction on experimentally determined yields of Buchwald–Hartwig reactions through HTEs and extracted from an ELN

The results illustrate the power and flexibility of molecular set representation learning. While the other methods rely on DFT calculations, pretrained large language models or a custom molecular fingerprint in the case of DRFP, MSR2-RXN uses the known simple and well-established ECFP embedding to represent the molecules participating in the reaction. Given the development of advanced and specialized alternatives to ECFP44, introducing our set-based architecture provides a new avenue for improving machine learning for chemical reactions using diverse molecular representations.

Conclusion

With this initial foray into molecular set representation methodology, we were able to show competitive results of the technique across a wide range of use cases, including the prediction of quantum-chemical properties, pharmacokinetic properties, binding affinities and reaction yields with minimal hyperparameter adjustments and consistently straightforward architectures. Our most straightforward model, MSR1, which is essentially a set of ECFP fingerprints with a radius of zero, performs as well or better than D-MPNN on 5 out of 11 and better than GIN(E) on 9 out of 11 of the tested MoleculeNet benchmark datasets. With the poorer performance of the purely set representation-based methods MSR1 and MSR2 compared with SR-GINE, GINE and (D-)MPNN on more recent datasets, we have shown that the commonly used benchmarks provided by MoleculeNet and other cheminformatics and machine learning libraries may not be well suited to benchmark GNN architectures, as including explicit molecular topology does not seem to provide an advantage. Given the past and current reliance on these datasets of researchers when benchmarking new molecular deep learning architectures, we conclude and suggest that future architectures should be evaluated on other datasets, including those made available by refs. 24,31.

Furthermore, we showed that introducing a set representation layer in place of a global pooling function in a GNN (specifically a GINE) improves its performance in virtually all benchmarks and, in more recently released datasets, performs equal to or better than MPNN and D-MPNN without the need to introduce additional precomputed molecular descriptors. This insight may be used to extend and improve the performance of all currently used GNN-based molecular representation approaches. In addition, we introduced a set-based model for protein–ligand binding-affinity prediction, which allows for the introduction of implicit geometric information through a radius near-neighbour search between the protein and the bound ligand, allowing the model to perform better than existing graph-based approaches, which often cannot integrate such information easily. Finally, our conceptually naive set-based reaction-yield prediction model more than doubles the performance (R2) of YieldGNN on the ELN-extracted dataset of Buchwald–Hartwig cross-coupling reactions and matches the performance of established methods on the high-throughput experiment (HTE) data.

In addition, an initial comparison between SR-GINE and MolFormer, a large language model pretrained on approximately 100 million compounds, shows no significant differences in performance on recently released benchmark datasets.

Overall, we present results that introduce and back the value of considering molecular set representation learning as an additional important branch of machine learning in computational chemistry and cheminformatics, as it provides a relaxation of the explicit molecular graph topology used in graph representation learning, which we showed to improve and extend capabilities across a wide range of tasks. The results of our set-based and set-enhanced methods on the OCELOT chromophore and PDBbind datasets also support our initial hypothesis that a more relaxed definition of molecular topologies enables the neural network to learn a more meaningful representation of molecules involving conjugated or transient bonds.

Methods

Study design

Initially, we evaluated our general approach against the well-known MoleculeNet datasets, excluding datasets that would require high amount of training time due to their size or number of tasks (ToxCast, MUV, PCBA). QM9 is used to evaluate the scalability of SR-GINE. Furthermore, we excluded PDBbind from our initial evaluation as our base models do not support protein–ligand complexes.

After the initial experiments, we chose more modern and specific benchmark datasets, namely, the OCELOT chromophore dataset and a set of compounds evaluated against Biogen ADME in vitro assays. Furthermore, we introduced two additional set-based models to handle protein–ligand complexes from the PDBbind dataset and reactions from the Buchwald–Hartwig HTE and ELN datasets.

Choice of atom and bond invariants

The concept of atom and bond invariants has been borrowed from the Daylight atom invariants11 and allows us to represent each atom and bond in the molecule as a one-hot encoded vector.

In all models, the atom and bond invariants were chosen based on the choices by refs. 12,17. As shown in Fig. 1a, the atom invariants are (1) the degree (total number of bonds) of the atom; (2) the atomic number limited to 100 (elements above fermium are assigned the one-hot-encoded position 101); (3) the formal charge; (4) the hybridization state; (5) the chiral tag representing, if applicable, the chiral type such as tetrahedral or octahedral; (6) whether the atom is part of a ring; and (7) the total number of hydrogens bonded to the atom. Figure 1b shows the bond invariants, namely: (1) the bond type, (2) the stereochemistry of the bond, (3) whether the bond is part of an aromatic system, (4) whether the bond is part of a conjugated system, (5) the type of the first atom connected by the bond, and (6) the atomic numbers of the two atoms forming the bond.

Choice of set representation and GNN layers

Initially, we chose the set representation layer for MRS1 (which consists of only the set representation layer after a non-learned embedding based on atom invariants) by comparing the three architectures DeepSets, Set-Transformer and RepSet14,15,16. The hyperparameters were chosen based on defaults and findings from the initial publications of each method.

On the basis of the performance of the three different implementations of MSR1 (Extended Data Table 1) on the six datasets BACE, BBBP, ClinTox, ESOL, FreeSolv and Lipo, which were chosen for their relatively small size resulting in fast learning, we selected RepSet as our set representation layer of choice.

For the GNN layer, we implemented and evaluated GIN, GAT and GCN as graph embedding layers for our set representation extended graph neural network7,18,45. Again, we used default hyperparameters based on the original publications and, pairing each set representation layer with each GNN layer, evaluated different versions of our set representation-enhanced GNN. Using the same selection of benchmark datasets as with the set-representation layer selection, we evaluated the nine combinations of GNN and set representation layers (Extended Data Tables 2 and 3).

Overall, the results point towards a need for hyperparameter optimization depending on the combination of models. We therefore picked the combination not on the performance shown in Extended Data Tables 2 and 3 but on the fact that we use RepSet in MSR1 and MSR2, and that GIN(E) is generally reported as a well-performing baseline architecture in chemistry tasks8,40,46.

Hyperparameter optimization

For the graph-based models (the GINE baseline and SR-GINE), we ran an initial simple search using the BBBP dataset over both models while keeping the set representation parameters for SR-GINE fixed at eight hidden sets with eight elements each. For each set of hyperparameters, six models were trained on random seeds and their performance averaged. The best models were selected for lowest cross-entropy loss during validation. During this search, SR-GINE performed consistently better than the baseline (GINE). In a next step, we conducted a further simple grid search on the number of hidden sets and elements, as the authors of the original publication have shown the performance to be influenced by these parameters16. We chose 128 hidden sets with 64 elements each based on tests on the BBBP dataset.

Computational cost and scalability

We empirically investigated the SR-GINE approach’s scalability by comparing its training time and performance with the non-extended GINE architecture. The average runtime for training SR-GINE on the OCELOT chromophore dataset (Ntrain = 20,201) is 9 min 44 s ± 44 s, which is an increase of 6.2% compared with GINE (9 min 10 s ± 42 s). On the larger QM9 dataset (Ntrain = 107,108), the training time increases to 50 min 25 s ± 0 min 47 s for GINE and 62 min 25 s ± 3 min 27 s for SR-GINE, an increase of 23.8%. In the case of QM9, this increase in training time is close to the average increase in performance, which is 24.4% (Extended Data Table 4). Although these numbers suggest that SR-GINE can scale well in terms of a performance–training time trade-off, it is not guaranteed that this is the case for all datasets. However, our experiments throughout this study showed a substantial increase of performance of SR-GINE over GINE on small- to medium-sized datasets where the increase in training time was marginal, as shown with the example of OCELOT.

Overall, these observations agree with the evaluation of RepSet by ref. 16. The timing data were taken from training runs on a Nvidia RTX 4090Ti graphics processing unit with 12 GB random-access memory for GINE and SR-GINE.

Datasets and preprocessing

MoleculeNet datasets

The datasets that are often collected under the name MoleculeNet datasets, namely, HIV, BACE, BBBP, Tox21, SIDER, ClinTox, ESOL, FreeSolv, Lipo, QM7 and QM8, were split into train, validation and test sets based on Murcko scaffolds in accordance with refs. 17,20. Four other datasets from the MoleculeNet collection that are often used (ToxCast, MUV, PCBA, QM9) were omitted from the benchmarks, as they are either too large or have too many tasks to be processed with limited compute.

OCELOT

The OCELOT dataset was used, in accordance with the original publication24, with a fivefold split and the subsequential five training–test iterations. In addition, a randomly sampled fraction of 0.2 of the training set was used as a validation set to facilitate early stopping.

Biogen ADME assay data

Reference 31 released their data with a predefined training/test split. No further processing of the data was necessary.

PDBbind

For PDBbind, the splits suggested by ref. 39 were used. The complexes found in the PDBbind core set were removed from the PDBbind refined set (of which core is a subset). PDBbind refined was subsequently used as a training set and PDBbind core as a test set.

Buchwald–Hartwig (HTE)

The data and splits were provided by ref. 47 in their original publication. Comparison results were taken from ref. 43. No further processing of the data was necessary.

Buchwald–Hartwig (ELN)

The data and splits were provided by ref. 48. No further processing of the data was necessary.

Finetuning MolFormer

For our experiments that include MolFormer benchmark data that were not available in the original publication, we finetuned the model in the respective datasets. Specifically, we finetuned the available pretrained MolFormer model that has been made available publicly (checkpoint N-Step-Checkpoint_3_30000.ckpt) on the Biogen ADME and OCELOT datasets. The ADME and OCELOT models were finetuned for 500 and 300 epochs, respectively. For OCELOT, the number of epochs was reduced as we observed early plateauing. All other hyperparameters were chosen as suggested by ref. 21. Note that the finetuning required a substantial increase in hardware resources over the training of the graph- and set-based models due to a high memory requirement.