For anyone interested in representation of small molecules, I would emplore you take a look at my latest paper in the Journal of Chemical Information and Modeling:
This article addresses a long-standing shortcoming in the cheminformatics cottage industry: the inadequacy of common file formats to reliably represent structures of non-organic compounds.
Not wanting to spoil the punchline, the synopsis is that current formats like MDL Molfile (which has been obsolete for a long time, see Why Not to Use MDL MOL/SDF) were designed to represent drug-like molecules, and just coincidently happen to work adequately for a few others, but because of their limitations, keep the study of cheminformatics confined to a subset of organic molecules. The problem can be solved quite easily, simply by allowing an additional bond type – zero order – and adding an additional atom property to control the automatic addition of hydrogen atoms. These two simple enhancements opens the door to representing pretty much any molecular species that makes sense to compose out of a graph of atoms and bonds.
It should be noted that there are a few file formats that support these properties, though for the most part are not advocated for this purpose. The SketchEl open source project, which I started some years ago, actively supports zero-order bonds and hydrogen atom counting, and all of the mobile apps from Molecular Materials Informatics, starting with MMDS, use the SketchEl molecule format as their native datatype.
The paper describes some simple additions to the MDL Molfile format so that it can support these extra fields. These are essentially trivial to implement, except that there are some subtleties when it comes to backward compatibility: nonorganic compounds are often represented by circumventing the absence of a zero-order bond by pushing charges around to fix up the broken valences. Sometimes this sort-of works, sometimes it doesn’t, but it does mean that storing an extended MDL Molfile such that it is maximally compatible with legacy and modern software makes it preferable to use both of these styles. The paper describes an algorithm to convert zero-bonds into charge separated notation, in cases when it is plausible, and store these in parallel.
This feature is currently implemented and available from MMDS, and comes up when you initiating an outgoing email:
Selecting the extended MDL MOL option makes it calculate and store the charge-separated form, as well as encoding the overridden fields, within the V2000 Molfile, which is included as an email attachment.
The subject of inadequate chemical file formats is also closely related to the recently reinvigorated subject of junk chemical data, about which I put in my 2 cents worth earlier this year. A huge portion of chemical data that is available from various sources is flawed for one reason or another, but a significant portion of these problems arise from the simple fact that the molecule species being represented are incapable of being described using the legacy formats that have been adopted as industry standards.