Always-on R/S and E/Z stereo labels with XMDS

xmds_stereo1The most recent addition to the OS X Molecular DataSheet (XMDS) desktop app is calculation of stereochemistry labels as-you-edit, using the Cahn-Ingold-Prelog (CIP) designations (R/S, Z/E). The labelling can be switched off with a menu option, but since most people use software with its factory settings, this is more or less equivalent to having it hardcoded permanently.

Stereochemistry is one of the most pernicious problems with cheminformatics data, especially chiral centres. It is easy to draw a carbon atom and forget to put in the wedge bonds that provide the clues that an algorithm needs to figure out which of the two isomers it is (R/S), and that opens the door to the annoying question of whether an unspecified chiral centre implies a mixture, or unknown, or a mistake. Even with the necessary fields to capture this information, it is still only as good as the users.

Double bond stereochemistry (Z/E) is less of a problem when sketching a molecule, because the planar 2D coordinates are an intrinsic part of the layout, and people generally get it right, but it does have a way of getting lost or mangled when converting into formats that use other coordinate systems, especially line notations (which have no coordinates) and 3D conformations. And while valid 3D conformations never have unknown chirality, this is a mixed blessing, because if the source data was improperly specified, the 3D version may be asserting information that does not really exist.

Having the CIP labels calculated and displayed discretely is a handy feature unto itself, but for a software product that is designed primarily as a content creation tool, it is useful to call attention to presence or absence of stereochemistry as early as possible, e.g.:

xmds_stereo2

In this not-very-sensible molecule (thus far), there is an unspecified chiral centre which is clearly labelled as “R/S” to call attention to this – whether it is deliberate or not is a decision for the operator to make. Also, the double bond has the label E, which would not be there if the alkene had equivalent substituents on either side.

From an implementation perspective, the most tricky part was not the stereochemistry perception (which was ported to Swift from SketchEl and com.mmi), but rather the placement of the labels onto the diagram, being done in real time. As described in an article I published a couple years ago (“Rendering Molecular Sketches for Publication Quality Output”, Molecular Informatics 2013, and other papers), all of the glyphs and lines and symbols that make up the rendered diagram are placed with a fairly complex algorithm, which goes to the point of respecting the convex polygon boundaries of individual letters, to make sure that lines are just the right distance from element symbols, and all of the things like hydrogens, counts, mass labels, charges and so on are in just the right place without overlapping anything. The same technique is used to decide where to place all of the stereo labels, which involves an awful lot of calls to functions for testing for line crossings, rectangle overlaps and point-in-polygon. The best results are achieved with a certain amount of brute force, and given that even the latest versions of the Swift programming language are modestly performant at best, it means that some optimising and benchmarking had to happen.

It should also be noted that from a cheminformatics point of view, the R/S and Z/E labels are not a part of the underlying datastructure, rather they are derived on demand each time the structure is modified. The atoms, coordinates and wedge bonds determine everything. Encoding stereo parity in a molecular structure is an inconvenient but necessary evil when translating into formats without atom coordinates, but this approach has no place in an editable sketch. And as another bonus, calculating the CIP labels on demand means that it is not a catastrophic data corruption event if the algorithm is imperfect. Translating the CIP priority formula into an algorithm is one of those things that is not too hard for most examples, but there are some edge cases that you’d probably rather not think about…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s