BioAssay Express: downloading SAR datasets

bae_download2The last post described the addition of molecules to the BioAssay Express project, and alluded to the near term intention of making this actually useful for something. The first round of utility is now in place: the ability to select columns and download the molecules as a single SDfile, suitable for use as a structure-activity dataset.

The assay protocols & measurements in PubChem take a number of different forms, in terms of the columns that the submitters decided to provide. There are often quite a few of them, e.g. a determination of active vs. inactive, a spread of concentration datapoints, a computed IC50, and any number of auxiliary details that are pertinent to the measurement of the assay. When the intention is to acquire a structure-activity collection, it is most convenient to have one structure and one activity value for each compound (and maybe a name and identifier for reference). There is not any standard way in the PubChem source data to indicate which singular these you definitely want to have, and so a user interface is necessary to let you pick them.

Whenever you see molecules underneath an assay record (e.g. for AID 346), like so:


… it means that clicking on any of the compounds brings up a detail dialog:


This has been modified so that there is a little tag icon on the right of each of the fields, and clicking on it brings up a prompt that asks you to provide your own name for the field. This can be done on any number of them, and there’s another dialog for reviewing the list of tagged fields in general:


What’s shown in the above dialog is that I have picked out 4 of the measurement fields, and given each of them a name of its own (activecurvelogAC50 and npoints). These are in the short keyword style that is suitable for viewing in tabular form, which we’ll get to in a moment.

Each of these selected names can be associated with any number of origin fields. This is less likely to be useful for single assays, but when the compounds are being combined from multiple assays, it will become very powerful, e.g. if the first set of measurements has a field called “pIC50” and the second one has a field called “-logIC50”, they can be combined together by giving them the same tag name.


Having picked out the columns of interest and given them their own name, the next action item is to hit the button labelled Download as SDfile. This does exactly what the label suggests: the server gathers the compounds and blasts them out over the wire onto your hard disk, using the MDL SDfile format. As a technical note, one of the few positive features of this file format is that it is inherently designed for streaming, which is a feature that is taken advantage of, since some of the content assembly is undertaken while the bytes are being pushed out over the network.

Once the SDfile has been downloaded, it can be accessed by any software that can read the format, e.g. the XMDS app:


By all means try out this feature right away: some of the measurements are still busily background downloading from PubChem, but for those that are loaded already, this is a rather convenient way to grab a single assay as a SAR dataset.

The next item on the menu is to add this same workflow to multiple assays at once: at that point the amount of informatics busywork that this interface will be able to save you is quite respectable.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s