Mixtures: extracting from text

mixtract_subsetRecently I described an open source tool for editing chemical mixtures, using a machine readable format. Now we have a proof of concept tool for starting with the kinds of text descriptions people use to describe mixtures, and recreating the actual components in their full glory.

Like with so much of science, chemistry has evolved common styles for describing chemical mixtures using names, quantity symbols and punctuation. A publication or a list of materials would commonly include entries like:

Lithium L-aziridine-2-carboxylate ≥97.0% 
Magnesium powder, ≥99%
Bis[tetrakis(hydroxymethyl)phosphonium] sulfate solution technical, 70-75% in H2O
(+)-Limonene 1,2-epoxide mixture of cis/trans-isomers, ≥97.0% (sum of isomers, GC)
1,3-Dimethyl-3,4,5,6-tetrahydro-2(1H)-pyrimidinone absolute, over molecular sieve (H2O ≤0.03%), ≥99.0% (GC)
Lithium aluminum hydride solution ~3.5 M in THF/toluene, suspension

… and all kinds of stuff. Often the way these descriptions are formulated is very straightforward, especially those which start with a recognisable chemical name and end with a purity fraction. These often make up the large majority of mixture descriptions from any given source (especially catalogs), but after those are dealt with, things start to disappear into the asymptotic rabbit hole of edge cases. Automatic parsing of these text descriptions has to start with a variety of major cases where understanding the constituents of a mixture is difficult:

  • solutions, which often follow patterns like solute in solvent, but can be represented in a number of ways
  • multicomponent sub-components, e.g. when the solvent is given as THF/toluene
  • isomers, which can sometimes be expressed within a CTAB structure, but other times should be split out (e.g. mixture of cis/trans-isomers)
  • ranges or indeterminate concentrations (e.g. 70-75% or ~3.5 M)
  • non-molecular structures (e.g. molecular sieve isn’t a specific thing that you can draw into a CTAB datastructure, but it is still an important chemical entity)
  • modifiers that are of secondary importance (e.g. powder or absolute)

And of course sometimes mixture descriptions are just not really applicable for cheminformatics (e.g. Behr Concrete and Masonry Bonding Primer No. 880 may well be made up of distinct chemical entities that could be captured in a machine readable mixture format, but this is not at all self-evident from the name; on the other hand, black tar residue is hard to describe with any more detail than the description itself).

The second major part of the mixtures project that we’re working on at Collaborative Drug Discovery involves building a prototype text extraction system for collecting mixtures from text, which comes from a variety of different sources. Here’s a sneak preview of some of the content that’s been captured so far:


These are rendered using the TypeScript-based toolkit that we’ve made available as open source (see GitHub), and so the hierarchical display is representative of what is inside each of the .mixfile entries and likewise for the structure/name rendering.

The actual process of parsing out the raw text descriptions involves a relatively crude brute force method, which we decided to try first, to see how far it would get: it combines a collection of text parsing rules (each of which is a regular expression, with some additional metadata) alternating with a molecular name recognition step, which makes use of OPSIN and a lookup database for compounds that aren’t matched. As a proof of concept, this approach is quite effective, but it does take some skill to customise the rule set to adapt to any specific oddities in the input… or to match chemical names that don’t follow well defined IUPAC rules.

As part of this project, we will be publishing the training data (most likely on the open source repository for the mixture editor), though our text parsing method is proprietary. We have some big ideas on how to move that part to the next level, and also for integrating the core mixture functionality into other products (like CDD Vault).

There is also a preliminary code block for turning a Mixfile into a MInChI (aka Mixtures InChI), which starts by running the standard InChI generator for each of the structures, and then gluing them together in the formally approved way. It will go through a few iterations before the standard is formally complete, but the intention is that the algorithm will be 100% compliant with what IUPAC decides.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s