Reaction Prediction Models: Chapter 10 – Training Data

The technology described in these articles is based on models that were built using a custom training set, or in the case of procedural algorithms, validated using that same content. The training set is unusual and proprietary, as it has been curated to a very high degree of completion and accuracy relative to any other available reaction data.

This is a series of articles about reaction prediction. The summary overview and table of contents can be found here. The website that provides this functionality is currently in a closed beta, but if you are interested in trying it out, just send an email to alex.clark@hey.com and introduce yourself.

Each of the articles in the series leading up to this one has made allusions to a collection of reaction training data that isn’t like what you can get from elsewhere. Chronologically this is one of the first steps for the whole technology stack, co-evolving with a datastructure that is suitable for expressing the content, and tools for curating and viewing it.

The data landscape for chemical reactions ten years ago was not all that much different than what it is now: if you wanted reaction data, you had to choose between paying for high quality curated content that came with a lot of strings attached besides just the price; or you could use open data text mined in bulk from patent content, with all of its mistakes; or you can create your own.

I was exploring the premise that there are some very useful things you can do if you have a critical mass of reactions that are fully mapped and balanced, with very few mistakes, and rigorous attention to detail with all of the structures: not only present, but drawn with abbreviations fully defined, inorganic bonding fully interpretable with an implicit valence model, and all reaction component roles inferred or specified. At very least, you would be able to ask precise questions of the dataset, using algorithmic queries, such as: what reaction transforms are using a particular catalyst? What groups of similar catalysts are observed, and what are the corresponding solvents? What are the most popular solvents for certain reagents or functional groups? What are the structure-activity trends of metals and ligands in catalysts? And so on.

Such questions are relatively easy to answer if the data is flawless, and basically impossible if it is not.

Having already put in a lot of work on the inorganic coordination complex representation problem, and subsequently developed a component-based reaction editor for the desktop (MacOS specifically) around 2015, I decided to have a go at dogfooding my curation software. I already knew that drawing chemical reactions is deceptively complicated, but I figured that I could streamline it if I sat down and drew out a lot of them. A fortuitous and legitimate internet solicitation showed up at a timely moment: Wiley (the publishing company) sent me a request to review a book proposal, with payment in the form of a book credit that was just enough to purchase a copy of Reactions and Syntheses: In the Organic Chemistry Laboratory. Over the course of several weeks after the textbook arrived, I curated each and every reaction scheme in the book, using my own tools, and got everything right to the best of my ability. It was a long process of iteratively improving my software, sometimes improving the datastructure, coming back to fix systematic mistakes, and then finally getting through to the end.

That was the first major entry into my training dataset for reaction models.

Given that my own personal scientific interests veer toward inorganic chemistry, and one of the major shortcomings of contemporary reaction data (private or otherwise) is the absence of any usable representation of most metal-containing compounds, I turned my efforts toward method papers that specifically explore catalysts and their applications. There are journals that have frequent such papers with a reasonable minority being open access: ACS Catalysis and Journal of Organic Chemistry are especially good ones.

For years one of my idle moment hobbies was to scan the latest issues of these journals looking for method papers to curate. Drawing out these reactions is challenging for a lot of reasons (e.g. figuring out the mechanism in order to get the atom mapping and byproducts right). One of the advantages of scientific research that is designed to explore a series is that there will usually be a few dozen reactions that have a lot in common, e.g. the authors use one or several catalyst structures, vary the R-groups and the conditions, and usually describe the same reaction transform. There are a lot of ways to optimise the reaction editing tool to get orders of magnitude speed improvement relative to starting from scratch for each reaction and drawing out all of the atoms and bonds.

Building a training set mainly from contemporary catalyst use cases provides a lot of valuable data, but it is very biased towards things that are new and great. What was missing was a lot of content that is old and reliable, which is really important as well – and also very noticeable when it is absent. A significant chunk of the training data was filled out by looking up common named reactions and famous catalysts (e.g. Wilkinson, Grubbs, etc.) and making sure that each of them is well represented.

Up to this point, the catalysis bias was very strong, but there is a very compelling reason to try to densify the training data to capture structure activity trends for solvents as well. Unfortunately there is no journal that concentrates solvent method papers, so the only way to get the data stocked up is to do a broad sweep through reaction chemistry.

Anyone who has looked into chemical reaction informatics is probably familiar with a text-mined collection of reactions from the patent data up to 2016 which was released by Daniel Lowe. The patent collection is quite special in that millions of chemical reactions are described purely with text, with relatively few intra-document references, in a reasonably consistent way (which is in contrast to the general tendency of patents to obfuscate on purposes for… reasons). There are a great many projects which use this data, under the premise that quantity can overwhelm quality.

I investigated the hypothesis that the many systematic and random flaws in this dataset might be correctable using a collection of bespoke algorithms, elimination checks, lookup lists, and a final human check that is much faster than primary curation. The reality is that the problems with the dataset are so deep that getting individual datapoints up to the standard of the rest of my training data is almost as labour intensive as drawing out content from papers. Nonetheless, a good few reactions were marked up, and these provide a nice background of rather boring organic chemistry, which is diverse in its own way, being representative of the chemistry that medicinal chemists like to do.

More recently I took another look at this data resource: this time ignoring the markup that was originally provided, and feeding the patent text directly into a large language model, and asking it to pull out everything relevant and stuff it into a simple JSON datastructure. The requested output form has the same outline as a fully described reaction (i.e. components packaged up with roles and quantities, and conditions marked up with units). Post-processing can fill in the structures referred to by name and perform mapping, balancing, etc. My early dabblings with “artificial intelligence” followed a common trajectory: at first I was incredibly impressed by how well these huge models could pull out content without any domain specific training. That was the wow moment. But then getting my hands dirty with the results, the problems started to poke through with increasing frequency. When you require 50 correct answers for each entry, a 2% failure rate means that almost zero results will make it all the way through. A number of reaction schemes got fixed up with significant manual effort for verification and correction before I put that back on the shelf.

That’s where the large language model ingestion stands right now, but I have friends who have provided me with advice on how to work around the problems that I’ve experienced. So this part of the story is probably not over.

Another source of high value reactions is the legendary open access journal Organic Syntheses, which has the unusual pedigree of having all submissions reproduced by impartial chemists. For that reason the instructions tend to be very clear and unobfuscated (because if they’re inadequate for any reason, the article won’t get published). The reactions are often exemplars of their kind, rather than a synthesis step of opportunity. The total number of reactions contained in the century of publishing is not huge, and is not necessarily going to move the needle on model building exercises, but for explicit matches (e.g. reaction templating or reference lookups) having even one example of a reaction class with an excellent reference is incredibly valuable. Manual curation is, however, not rapid, especially given that some of the reactions have complicated mechanisms, meaning that getting the atom mapping right takes a bit of work. I have been curating this resource backwards in time from the present day, and I think I’ve made it through almost a decade of chemistry.

Every so often a research group will publish a decent sized corpus of reaction data in a nominally well defined format, and it is tempting to try to mark it up with a combination of scripts, models and manual checkups. On closer examination most of these have too many problems to fix (e.g. reagents that are referenced using internal database ID codes), or something important is missing (e.g. reaction time). But there are a few that had enough salvageable reactions to make the grade. In retrospect the amount of effort might have been better spent curating literature data from scratch, but what’s done is done.

Several resources that contain a lot of almost-but-not-quite reaction schemes were curated (e.g. Organic Syntheses Using Transition Metals, Roderick Bates) and used separately as a reality check for catalyst proposals.

All of these resources combined together, deduplicated and filtered for validity provided ~10K single step reactions for each of the models (more or less depending on the model) at the time of writing. Compared to commercially available datasets, or the text mined patent data, this is a small dataset: but what it does have is quality and certain types of diversity. There is an argument to be made that a model built from data that has half a dozen representative examples of every applicable class can compete with error-riddled slop that is orders of magnitude larger.

Obviously more of the same quality content would be better, and manual curation does not have great scalability. It’s enough to achieve proof of concept, but going forward the best way to build out the training data would be to work with experimentalists who are conducting reactions at scale, and ensure that the reaction schemes and results are captured at source with all fields filled in completely. Once a system is setup right, the marginal effort would be very small and the value very high.

The data problem has the usual past/present/future distinctions: the present is about just being able to capture and use the data at all; the future is about setting up experiments that populate the datastructures as a matter of course; and the past is about remediating knowledge that has been downgraded to a mixture of scientific English and graphical figures. The wealth of chemical reaction knowledge that is locked up in these digitally intractable formats is vast, and enormously predictive. The large language models that have captured everyone’s attention lately are tantalisingly close to being able to unlock the text parts, but tables and figures dial up the challenge to a level that puts it well out there on the frontier.

The next article describes exporting as graphics.

Leave a comment