Adventures with combining PubMed and ChEMBL

One of the things I’ve been investigating lately is the open access segment of PubMed, which is a rather massive collection of open access medicine-relevant publications, with accompanying full text.  Similarly with the ChEMBL database, which is focused on structure-activity data traceable back to the original literature document from which each datapoint was curated. This is all for the purpose of advancing the BioAssay Express mission of making the world’s bioassay protocols machine readable (aka FAIR).

The BioAssay Express project is a Collaborative Drug Discovery undertaking that provides a nice glide path toward using semantic web annotations to describe bioassay protocols. These protocols are normally communicated in the form of dense literary scientific jargon, or stored in-house in a variety of mutually incompatible ad hoc ways. Given the fact that there are undoubtedly tens of millions of these experiments that have been run, and that there is no consistent or reliable way to even search them (let alone do any kind of analysis), something has to give. This global resource is massive, critical to drug discovery, and pretty much completely intractable.

Our efforts to make a dent in this mountain of data can be seen at When we started building the product, we found that the best starting point in terms of raw data came from the PubChem assay collection – or rather, a subset of it: any protocol that was submitted as part of the MLPCN (Molecular Libraries) project was mandated by their funding conditions to make the data fully available on PubChem. This gave us about 7000 assays with very complete text descriptions, a handful of pre-annotated fields, all of the structure-activity data, and lots of common patterns to data mine.

We have managed to do quite thorough annotations of more than half of these MLPCN assays, which is already a valuable resource. And later on, the EPA submitted their Tox21 assays to PubChem, which we also incorporated, and have begun to annotate.

While we have a few thousand assays left to process, we have been wondering for awhile where our next big chunk of raw data will come from, as we scale up. One might be tempted to think of moving on to other assays in PubChem, but unfortunately the vast majority of them are imported directly from ChEMBL, so they don’t really count as part of the PubChem set, since ChEMBL is also a completely open and accessible resource.

ChEMBL itself includes millions of structure-activity datapoints  curated from about 70,000 literature articles. For our purposes – that is to say providing somewhat detailed annotations for assay protocols – ChEMBL is simultaneously awesome and also a dead-end. The awesome part is the fact that the quality of curation is extremely high (which is correspondingly extremely rare in this public data realm). The individual assays have half a dozen high value annotations – things like target, organism, type, etc. But the catch is that if you want any more information than that, you have to lookup the DOI, navigate through the journal’s paywall, enter your credit card, and then read the paper to try to figure it out. This is in contrast to the MLPCN & Tox21 assays which were uploaded to PubChem with full text details as part of the record, making that content a suitable starting point for natural language analysis.

Now, consider another resource from the same people who brought you PubChem: PubMed. Not to be confused with each other, as they are quite different. PubMed is a gigantic collection of literature articles stored in a server room that has been battle hardened against the zombie apocalypse, among other things. There is a convenient API and download site that can be used to find the subset of PubMed articles that are open access.

One might be surprised by just how many articles there are. A script to grind through all of the files discovered that there are in fact 1.7 million unique DOI’s that correspond to an open access article in PubMed. This is is a lot higher than I expected, given that the moniker open access would usually be better described as author pays (i.e. the poor starving scientists who did all of the work have to pay $3000 so everyone can read it for free, rather than paying nothing and letting everyone pay to read it – a Faustian bargain if ever I heard of one).

So: ChEMBL has data that is wonderful, but has much missing detail and no full text. But it does have DOIs, and so one might think that of 70K articles, some fraction of these would be open access, and most of those would be found in the 1.7M candidate articles in PubMed.

Long story short: that number is 594. That’s about 1 in every 12 articles from the ChEMBL collection that can be reunited with its original text, using a script (with no tedious human intervention), operating on free & open data that is available anonymously to anyone via an API or by FTP.

Once these are united, we’ll have at least this many new assays that we can add to the BioAssay Express collection. Granted we already annotated more than 3.5K, so it’s not a huge increase, but given that many of these articles describe multiple assays, the quantity may turn out to be a multiple of that initial number.

So each of these will have:

  • full descriptive text from the publication
  • DOI and links to the original source material (guaranteed to be open)
  • pre-annotated fields such as target, organism, measurement type, etc.
  • structure-activity measurements for each compound mentioned in the article

The plan is to bundle up these assays, map all of the available content to the existing ontologies used by our Common Assay Template, and from there complete the annotation process. Just by joining these resources together, we already have something like 70-80% of the value of the data, and we have everything we need to apply our human/machine hybrid to fill in the rest.

The BioAssay Express is built around a core natural language/machine learning algorithm that was designed to accelerate human curation (rather than to replace the human and accept a lot of errors instead). Lately we have been improving and complementing these algorithms, with some variations, like more literal text extraction for unambiguous terms (erring on the side of minimising false positives), and following up with any axiomatic rules we can get our hands on, to use existing annotations to infer new ones (since assay protocol details are often very strongly correlated with one another: like reconstructing a partial skeleton making use of symmetry). As these algorithms improve (and with more training data), the annotation process becomes ever less demanding of human time – and we can even entertain the idea of having a “fully automatic” mode for data that is awaiting curation, with all the appropriate caveats.

Once the import tools are ironed out and functioning, this doesn’t have to be a one-off injection of 594 articles worth of assays: ChEMBL is an ongoing project, and has regular new releases (they’re up to release 24 now). As open access publishing becomes more common, it might be reasonable to expect the portion of open new content to increase: 1 in 12 should be the low watermark. Each time a new ChEMBL release comes out, we can re-run the script, and upload anything that’s new.

This investigation of the PubMed database is also foreshadowing a bigger project, namely the incorporation of everything in PubMed that describes an assay, which will lean much more heavily on machines than the human/machine symbiosis that we currently prefer. But more about that later.




One thought on “Adventures with combining PubMed and ChEMBL

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s