The SAR Table app has had the ability to match scaffolds to molecules for awhile now, but as of the latest release (1.3.4) just submitted to the AppStore, it will be able to match more than one scaffold at once.
The actual process of performing the scaffold match is provided by a webservice (molsync.com). The singular scaffold matching service takes a molecule and a fragment, and finds all of the substructure matches. If the scaffold fragment has decorations (e.g. R1, R2, etc.), they are assigned, and their values are decomposed accordingly; if there aren’t enough decorations on the scaffold already (e.g. none), the substituent labels are “bushwhacked”, which means that the service will add additional annotations, like R1, R2, etc., automatically.
For a lot of structure-activity series, the scaffold has no (permutational) symmetry, and never occurs more than once in any of the constituent molecules. This being the case, it is quite reasonable to expect the algorithm to go through and assign everything. Unfortunately that’s not always the case: all it takes is a rotatable bond, e.g. a pendant phenyl group with a substituent decoration in the ortho or meta positions:
All of a sudden the total number of ways to assign the scaffolds goes from 1 to 2N, where N is the number of matched structures. Chances are, not all of these combinatorial possibilities is equally good, so it’s a bad idea to toss a coin; for example, if it is possible to find a solution in which R1 is always methyl or ethyl, and R2 is always some other interesting funky non-hydrocarbon substituent, then the structure-activity series will likely be a great deal more meaningful. Or not: in a drug design campaign, it’s the biology that dictates what chemistry is important, and how it should be viewed.
Back to scaffold matching from SAR Table: when a single scaffold can be matched in more than one way, sometimes the algorithm can throw away some of the results, but often the most useful thing to do is instead to rank them. So in the above rotatable phenyl example, the user would be presented with two choices, e.g. R1=methyl, R2=H as the first choice, and R1=H, R2=methyl as the second choice. The ranking might be biased by a precedent of previously made assignments, e.g. there is already a case where R1=methyl, so it makes sense to offer up the result that has one common feature.
When the user interface operates one scaffold at a time, this works quite well: at each step, if there is only one useful result, the user can accept it as-is. If there is more than one, a choice must be made: pick the preferred result, then continue, and take comfort in the fact that this choice will be used to bias the rankings of further results.
Doing many scaffolds at once requires layering on a bit of extra logic over top of the core machinery. When there is no degeneracy, it is easy. When the database does have degeneracy, it gets tricky: it is quite viable to come up with an algorithm that will analyse an entire collection of structures and search for some kind of “global optimum” where all the R-groups are lined up in a nice tidy manner (I published a paper on one such method back in 2009). For more than a year I’ve been sitting on an algorithm, accessible only from a command line tool that has not yet been made commercially available, that does just that.
The difficulty is this: the method works best when it can combine the ability for the user to make the important decisions, with the ability of the algorithm to grind through thousands of possibilities and make its own decisions when the answer is obvious. The best solution involves developing an elaborate dance: when the algorithm gets stumped, the user should make the call. For example when the first assignment is being considered, and {R1,R2}={methyl,H} are both completely equivalent. Perhaps the second assignment will be {R1,R2}={methyl,chloro}, and the algorithm can make a good decision, based on the user’s previous decision. Perhaps the third one is {ethyl,chloro}, for which the algorithm can decide that based on the implied second result. Perhaps the fourth one is {propyl,fluoro}, which is not obvious; perhaps the algorithm should skip that result for now, and see if some of the other inferred results comes up with something to help with the decision; or perhaps it should see that chloro and fluoro are conceptually related; or perhaps it would be better to stop and ask the user.
Getting this user interface to work well on a mobile app that defers to a webservice to do the heavy lifting is not entirely simple. So after some delay, the way it is now implemented uses two slightly different methods, depending on whether one scaffold is being matched, or many. For the multi-match version, each submitted case reports one of three outcomes:
- no matches
- one match
- more than one match
Only cases with one match are accepted. The matching criteria are stricter in some ways: no new R-group labels will be created, so they must be defined for the scaffold fragments. If there are degenerate results, then only the best results are considered, so it is possible for symmetrical scaffolds to provide a definitive answer. Also, the matcher will consider an empty not-provided scaffold to be allowed to match any of the currently available scaffolds.
The overall workflow becomes a multi-step process, where the number and type of steps vary, sometimes unexpectedly:
- for the first case, draw out the scaffold substructure, and perform a single match; select the best result, and have it define the R-group substituent labels automatically
- select all rows, resubmit to the scaffold matcher
- apply non-degenerate results: fill in scaffold/R-group assignments
- if any results were degenerate, and remain unassigned, optionally resubmit the assignment, in case the level of degeneracy was inductively reduced
- examine the still-unassigned cases; assign them by filling in the scaffold and performing a single match, selecting the preferred outcome
- continue one at a time, or repeat the multiple match, if the degeneracy has been collapsed
To conclude, it is now possible to bulk-assign a lot of scaffolds in one shot using the SAR Table app. If the scaffolds are asymmetric and non-degenerate, it’s very quick and easy. If not, then it can often be accomplished in just a few extra steps.