Recent efforts on the subject of model building are getting close to fruition. As mentioned a few months back when I presented a CINF webinar, the SAR Table app has a not-yet-released feature which allows the current set of structures and their activities (“responses”) to be packed off to a webservice, which proceeds to construct a model based on structural features, then predicts values for any structures that don’t have values for that particular property.
Since then, the back-end has been redesigned to use a more interesting approach, and hopefully more effective. The models that get built are based on the idea that typed subgraphs can be used as descriptors, in an additive, linear combination. This approach has been used frequently in the cheminformatics literature, and while it is not the most sophisticated method for modelling a response, it does have the advantage that it is chemically interpretable (e.g. each occurrence of C-C-C is worth +0.015 units of log P). The trick is then to find a collection of typed subgraphs, and assign each of them with a contributing weight. Once this is accomplished, a prediction can be made quite quickly by counting the number of occurrences of each of these subgraphs, and adding up the contributions.
Not knowing what subgraphs to use, or what weights to assign to them, is of course a hugely combinatorial problem. The approach that I’m describing uses a “genetic algorithm” to carry out the search. It’s a fancy term, with a lot of literature behind it, but a genetic algorithm is conceptually quite simple. Instead of working on a single candidate try to refine it to find the best result, the method operates on a collection (“population”) of candidates. Each iteration (a “generation”) involves randomly modifying (“mutating”) or merging (“breeding”) candidates, some number of which is kept on for the next iteration. The idea is to give the overall optimisation a chance to keep some less promising candidates around for awhile, just in case they are a transition phase that’s moving towards something good. The losers be ditched sooner or later if they don’t turn out good.
The function that is being optimised is {set of subgraphs} + {weight of each subgraph} which generates a prediction for each of the inputs, and so a deviation can be measured. The subgraphs are initially chosen to be a union of all subgraphs up to size 6 (including all possible branching modes, but not encoding ring junctures). They are initially typed by atomic element and bond order (with aromatic being distinct). The scoring method encourages the number of subgraph descriptors to be ideally reduced to less than the number of responses, which improves the general applicability of the model, but typically reduces the ability to overtrain based on the input data. The number of subgraphs can be reduced by merging them together. For example, C-C-C and C-C=N can be merged into C-C[-=][CN], where the last node can match either carbon or nitrogen, and the last bond can match either single or double. The iteration method pushes the goal of tuning the weights of the subgraphs with slightly higher priority, then coalescing the subgraphs into such “query subgraphs” as the iterations proceed.
All of this takes some effort to tune for practical use, and of course to dig out the multitude of bugs, such as producing inconsistent results, terminating too soon or not at all, etc., but it seems to be approaching the point where it can be released for a use by the early and the brave. Once it is all working reliably and effectively, you should expect to see some more rigorous documentation, but until then it will be an experimental black box that might just be able to provide some value before it reaches maturity.
The objective is to make this method (which is fully automated and requires no parameters) applicable to the kinds of small datasets supported by the SAR Table app (typically <100 compounds) in real time, as well as larger databases (which might run overnight). The resulting models are concise documents (about the size of a wordy email) regardless of the source, so once constructed, they can be embedded in a datasheet and utilised whenever they are needed.
Which brings us back to the SAR Table app itself: the image on the top right shows an iteration in progress, with the predictions plotted against the original responses, with an R-squared value to give an indication of quality. The R2 value doesn’t always keep going up, because the algorithm is trying to reduce the number of subgraphs (aka descriptors) as well as maintaining the best possible match, ultimately making its best effort to avoid overfitting.
The following screenshot is taken from a table that contains mostly compounds with measured anti-TB activity, as well as a handful of compounds that were hypothesised, and so have no activity measurement:
The activity columns is colour-coded using the traffic light scale (green = highly active, red = inactive, yellow = mediocre). The activities that have actual values present the number itself, and a full rectangular colour-coded indication of activity. Those which have only a wedge-shaped half-rectangle are showing a chromatic indication of the predicted activity. In this case the compounds that I chose, made up of preexisting scaffold/substituent fragments, predict fairly poorly, but the two entries with a yellow prediction colour might be worth looking into.
Building a model with a few dozen entries can take a couple of minutes. The model building is all done by a webservice (running on molsync.com). The technical approach is of some interest. Eager to avoiding storing state on the web server, or allowing calculations to be initiated then abandoned, the interplay is done using a chunk/push sequence: the app sends down the molecules and responses to initiate a new job. The server performs the first round of the genetic algorithm, then sends back a compressed snapshot of the state. As long as the app still has an internet connection, and the user has not hit the cancel button, or answered a phone call, the app receives the update and displays the results, then sends back a request to do some more iterations. The server picks up where it left off, does as many iterations as it can before a timeout, then sends back the partial results. This way the user gets visual feedback and cancellation privileges, and the server doesn’t need to store any extra data, or worry too much about unwanted tasks grinding away with nowhere to go when they’re finished. It hits the sweet spot for cloud services, and the modest transmission lag time between iterations is well worth it.
Once complete, the model is stored inside the document, so predictions can be reissued quickly, without having to rebuild the model.
The next major revision of SAR Table (1.3) should ideally be ready for release quite soon, and the model building feature will be available. The predictions are also displayed in the matrix view, though at the moment only for cells that have a corresponding structure that happens to lack activity. There are quite a few new features for the app that can make use of prediction, and the obvious one would be to extend the matrix view so that all of the empty cells show predicted responses to represent the hypothetical structures that could be created… and to make it easy to simple tap on one of the cells to create a new composite structure (e.g. R1=methyl, R2=phenyl: create new compound with these R-groups). Now that’s a way to illustrate interactive computer-assisted drug design on a mobile app.