To continue on in the series of reaction-based cheminformatics on the MolSync website, the final missing search piece has now been implemented: reaction similarity. This works in a manner that is analogous to the transform feature, insofar as if you just draw one side of the reaction, or draw both sides but provide no atom-to-atom mapping information, the search behaviour just uses the molecules as-is. If you do provide atom mapping, though, the search gets a whole lot more specific.
The query shown above, when used as a transform search, will match exactly one reaction in the current database, which happens to be from ChemSpider Synthetic Pages #43 (manually curated, with significant effort – though decreasingly so as the XMDS tool gets better):
When the similarity search type is selected, this one comes out on top, with a bunch more underneath:
The way the similarity-with-transform metrics are done is a very new method – coded up this morning, in fact. It doesn’t introduce any groundbreaking new ideas in cheminformatics, but it basically works thusly:
- the two sides of the reaction are analysed, and every atom-to-atom pair in which the atom is not found to be in the exact same environment on both sides is considered to be part of the transform
- for atoms in the transform, the method used to find the initial atom identity value for ECFP-type fingerprints is applied to both sides
- ECFP6 fingerprints are calculated for both sides, except that atoms in the transform have their identity replaced with the hybrid identify, i.e. the central transform atoms have a unique starting point value
What this means is that for most of the structure, the usual ECFP6 method applies, but for the core transform, the starting identifiers are unique to the transform itself (and comparable to all other reactions with the same core transform), and atoms that are within 6 degrees of separation from the core transform will have some of their fingerprints perturbed in a transform-specific way. The general idea is that any two structures with the same transform will get a boost in similarity, and from then on the comparison will be up to the various other substituents and components in the scheme. A reaction that has a transform that is similar, but not quite the same, will get a smaller boost because of the transform. A reaction for which the core transform is completely different will have a set of fingerprints that is much less similar than if the structures were all treated independently.
At this point I’m not entirely sure how well these effects play off against each other, e.g. is the value of the transform part vs. the rest of the molecule too high or two low? Not sure just yet. The general thinking is that if you want to look for only things with the same transform, there’s a search type specially for that. Transform similarity has more wiggle room.