The Activity Cliffs (ACs), which are generally defined as pairs of structurally similar molecules that are active against the same bio-target but have large difference in the binding potency, are of great interest to the community of drug discovery. However, the AC prediction task, i.e. to predict whether a pair of similar molecules exhibit AC relationship, have not been fully explored yet. This work introduces ACNet, a large-scale benchmark for Activity Cliff prediction tasks. ACNet curates over 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs, and provides five subsets for model development and evaluation. The performance of 15 baseline molecular property prediction models adopted as molecular representation encoders for the AC prediction task are evaluated in the experiments. The traditional FingerPrint-based method shows superiority over other complex deep learning models on these tasks. And the imbalanced, low-data and out-of-distribution features of the ACNet benchmark make it challenging for the existing molecular property prediction models to cope with. Our work is the first contribution of large-scale benchmark for the AC prediction task, with the hope to stimulate the study of AC prediction models and prompt further breakthrough in AI-aided drug discovery.
The data of ACNet are collected from publicly available database ChEMBL (version 28). Over 17 million of activities, each of which records the binding affinity of a molecule against a target, are screened by the rules shown in the following figure. As a result, 142,307 activities are reserved in our benchmark.
Next, to identify pairs of molecules exhibiting AC relationships against each target, all of the activities are treated separately according to the targets. Matched Molecular Pair is selected as the similarity criterion due to the widely application in previous AC prediction works. All possible MMPs are identified by the algorithm proposed by Hussain et al.. Size restrictions of the substituents are also applied, referring to the rules in previous work. These restrictions make the identified MMPs consistent with the typical structural analogues in practice.
For each MMP, if the difference in potency is greater than 100-fold, then the MMP is considered as an MMP-cliff with a positive label. And if the potency difference is lower than 10-fold, then the MMP is denoted as a non-AC MMP with a negative label. This criterion involve a distinct margin between the potency differences of positive samples and those of negative samples, so that the influence of the observational error induced by the source assays can be restricted. Based on the above-mentioned data collection method and screening rules, we have found a total of 21,352 MMPs exhibiting AC relationships, and 423,282 negative non-AC MMPs.
In ACNet, each sample represents an MMP, and the label of each sample indicates whether it exhibits an AC relationship against a certain target. It is intuitive to organize the samples against different targets into different prediction tasks. To construct dataset for each task, positive samples and negative samples against the same target should be gathered first. In this step, a threshold is applied to screening out the tasks with extremely few positive samples, since the scarceness of positive samples brings little information of the tasks, and it is too tough for a deep learning model to be trained on these tasks. The threshold can be adjusted via the configuration file customized by users, and tasks with fewer than 10 positive samples will be discarded by default.
In addition, the number of positive samples is much smaller than that of the negative ones in each task, so that the tasks in the ACNet benchmark are generally imbalanced. When constructing the dataset of a task, users can customize whether to use the overall negative samples to construct an imbalanced dataset, or randomly choose negative samples to generate a relatively balanced set. By default, we use all of the negative samples for all tasks to generate imbalanced datasets.
Under the default configuration, ACNet contains MMPs against 190 targets, i.e., 190 tasks. And the numbers of samples in each task range from 36 to 26,376. As the number of tasks is large and the data volume of each task varies greatly, for the convenience of model evaluation and comparison, we divide the original 190 tasks into several groups according to the task size. Information of the subsets are shown in the following table. The thresholds for dividing tasks into subsets can also be customized by the configuration file. For convenience, we refer to the Large, Medium, Small subsets collectively as ordinary subsets in the following.
The statistical information of the benchmark is shown in the following figure. It is obvious that the ACNet benchmark shows imbalanced and low-data features.
We further provide a Domain Generalization dataset, which brings OOD feature to the ACNet benchmark. Specifically, we propose an extra Mix subset where all of the samples against different targets are organized into a single task to construct a mixed dataset. To avoid ambiguity, MMPs with conflicting labels against different targets are discarded. The number of samples in the Mix subset is 278,367. To force deep models to learn common knowledge from the Mix subset, a target splitting method is proposed, referring to the scaffold splitting method in the molecular property prediction tasks. Specifically, when splitting the Mix subset into train/valid/test sets, samples that against the same target must be split into the same set.
To evaluate the benchmark, we develop a simple baseline framework for the AC prediction task. The structure of the baseline framework is shown in the following figure.
A molecular property prediction model is used as an encoder to extract the representations of the two compounds in each MMP. Then, the representations of the two molecules are concatenated and an MLP is leveraged as a head for AC prediction. 15 molecular property prediction models are involved. The experiments that these baaseline models are participating is shown in the following table.
Performances of the molecular property prediction models under the baseline framework on the ACNet benchmark are shown in the following tables.
The ECFP+MLP model achieves an eye-catchiing performance, which outperforms all of the other complex deep models. The reason behind performance of ECFP+MLP lies in that ECFP has a natural advantage in extracting representations of similar molecules in MMPs.
Due to the limited data of the Few subsets, it is impossible to train deep models from scratch. Following the pretrain-finetune paradigm in few-shot learning, self-supervised pre-trained models (PTMs) are exploited here.
Results further verify the superiority of ECFP as molecular representations in MMP-Cliff prediction. On the contrary, although the SOTA PTMs have been trained by large amount of unlabeled data, their performance is even worse than the ECFP except the S.T. model. It reveals the difficulty of the Few subset.
When using the target splitting method, the ECFP+MLP fails at this time. And although the GCN model achieves the best performance on this domain generalization task, the 0.579 AUC indicates that we cannot assume that this model has learnt the common latent mechanism behind the ACs phenomenon. Moreover, even the SOTA model Graphormer cannot exhibit good generalization ability when coping with this domain generalization task. These findings show that the Mix subset of ACNet is of great challenge to deep learning models.