In the previous post, I introduced the BioAssay Express project, which is based on the idea that when assay protocols are described using semantic annotations rather than just scientific text, a whole realm of software analysis becomes possible. Whenever there’s a bioassay experiment protocol at hand, chances are there are also some molecules with activity measurements not far away: which is the subject of this article.
The raw data from which the BioAssay Express draws is currently entirely from PubChem, on account of a large amount of data being stored in one place and accessible via one API. Unlike other open repositories (such as ChEMBL), many of the assays have the full text description of the experimental protocol included in the record, which essential for our purposes; and fortunately, PubChem keeps track of who uploaded each assay, which happens to have a near perfect correlation with whether a given assay is suitable.
PubChem is of course more famous for being a giant database of molecules, and many of these are connected to one or more assays via measurement data. Since it went online, the BioAssay Express project has been regularly interrogating the PubChem assay collection, looking for new entries provided by one of the whitelisted contributors. Since last week, this background task has been augmented by two more, which keep themselves busy downloading the measurements for each assay, and for each of these, making sure that all the compounds have structures.
The first visible consequence of this internal expansion is another section at the bottom of the assay assignment page, which you can find on pages such as: http://www.bioassayexpress.com/BioAssayExpress/assign.jsp?pubchemAID=1599:
Right now the grid view is not hugely much more functional than it looks: you can step through one page at a time, or click on a compound for more information, which is currently expressed as a detail dialog:
In addition to powering the molsync.com site, and now being added to the functionality of the BioAssay Express, this toolkit (“WebMolKit”) is actually open source. Right now the source code is on a private GitHub account, which will be opened up to the public once it has had a little bit more time to mature (it’s very green at this moment). The license permits anyone to use it under the Gnu Public License, as soon as the switch is flipped and the code is made available.
From the point of view of the BioAssay Express, adding molecules to applicable pages is a nice novelty feature, though it is not so amazing unto itself since we already have the use of PubChem‘s embeddable widget which achieves much the same effect. The real value kicks in once it is possible to actually do things with the molecules, and the next step along that road is to allow some reorganising of the measurement columns (e.g. pick the activity measurement you want and rename it to “IC50”), and then be able to hit a download button and grab the entire thing as a single SDfile, with all the molecules, and just the columns you asked for. Right now that particular step is quite difficult to accomplish using raw PubChem access, and takes a fair bit of cheminformatics know-how.
But even that is a relatively minor step in the greater scheme of things: it gets really interesting when this functionality is connected up to features that return more than one assay (e.g. the Browse Assays feature). Right now this is a great way to find assays according to certain criteria, but in the not too distant future, it will also be able to show a union of all the molecules for all the selected assays. They will generally have a heterogeneous collection of measurements, which will be able to be selected, renamed and filtered, prior to downloading the whole thing as one big SAR dataset. This workflow is a pretty useful way to go from a vague idea to a dataset that is ready to be plugged into a model building exercise, or a visualisation tool.