Mapping drug target ontologies in BioAssay Express: a narrow use case for Excel

bae_excel_thumbThe BioAssay Express project captures a lot of public bioassay data using public ontologies, and quite often these ontologies overlap with each other: this happens particularly often with the target concept. We needed to map common terms between the Drug Target Ontology and Uniprot (via the Protein Ontology) with a mapping file. The amount of data involved was a little too much to do manually, but not quite enough to justify writing a custom script. Because it had to be done exactly once, the task was an ideal use case for Excel.

Normally when somebody brings up the word “Excel” in a discussion about scientific data I have to suppress a kneejerk reaction to explain that by definition they’re doing it wrong. But there are real use cases, especially when it comes to disposable single instance ad hoc operations that require a crude visual preview, and a chance to make minor manual corrections before committing the results.

The background for this use case is that we have multiple concepts of target in the BioAssay Express Common Assay Template. Just the eponymous target assignment category draws from 3 sources: the BioAssay Ontology, Drug Target Ontology (two root branches) and the Protein Ontology:

bae_excel_0

In addition the NCBI taxonomy hierarchies for proteins and genes are also supported, in different assignment categories. This makes for 5 different ways to indicate which target an assay is attempting to hit (arguably more if you include disease and cell line definitions). Some of these are orthogonal and complementary, since there are multiple ways to arrange target hierarchies (e.g. by protein family, organism taxonomy, biological impact, among others) and multiple items can be asserted in a way that provides additional value. Others combinations are degenerate, and while it would be ideal for there to be one ontology to rule them all, until that time it makes sense to pair up multiple definitions when they are equivalent.

For the public data that we have been collecting within the BioAssay Express, we have used the Drug Target Ontology preferentially curing our initial curation process. Subsequent to doing the bulk of the work, however, we added the Protein Ontology into the same hierarchy. The big attraction of making this addition is that the Protein Ontology is a selected subset of Uniprot, with its own well designed hierarchy that fits well with our use cases. We have been looking for a way to incorporate Uniprot identifiers for quite some time, but progress has been hindered by the fact that it is huge, and converting all of it into an ontology presents a major scalability challenge. Being able to make use of a community-selected subset is of great benefit.

A significant number of the terms from the Drug Target Ontology happen to be an exact match for their Protein Ontology/Uniprot counterparts. We would really like to have Uniprot identifiers for our targets, and we would like to have this without doing any extra curation work. We would also like to not have to write any code. It just so happens that the Protein Ontology Consortium has provided us with a mapping for a number of cases where there is a 1:1 correlation from DTO to PRO:

bae_excel_2

The next issue is a matter of data formats. The internal representation used by the BioAssay Express is not a tabular format, nor is it stored internally as semantic web triples – though it is easy to convert it to either of these. What we do have is the ability to copy results from a search, placing the content onto the clipboard in a tab-separated form that can be pasted directly into Excel:

bae_excel_1

The interesting columns are F (target label) and G (abbreviated URI). The subsequent columns contain spillover when multiple annotations are present.

The relative value proposition for spreadsheets kicks in at this point, because we can go through the list of hundreds of entries and prune out cases that we don’t want to apply to (e.g. sorting then deleting blocks of rows).

The actual mapping process can be done by constructing some crude but effective Excel lookup macros, e.g.

=IF(G2="","",IFERROR(XLOOKUP(G2,Mapping!$A:$A,Mapping!$B:$B),""))

Once the mapping columns have been generated, the blank cells are those that were not in the DTO-to-PRO list. Filtering out the rows with no mappings, then rearranging column content to plug the gaps, can likewise be done quickly with manual use of standard spreadsheet operations.

The next step is to get this ad hoc spreadsheet manipulation product back into the BioAssay Express, and this can be accomplished by a feature that we have designed and used for importing tabular data of varying formats, and marking it up to the semantic-web compatible form that we use. It is actually almost as simple as copying the rows & columns to the clipboard and pasting them into the import page:

bae_excel_3

The above snapshot shows 4 columns transferred in this way. The leftmost column has been designated as the assay identifier, and each row has been connected to an existing assay. The other three columns (LOOKUP1 through 3) contain abbreviated URI terms, and they have been directed to the target assignment. The tool has matched the terms with content from the template, and rendered them with the corresponding labels (grey background). It also shows – for reference purposes – the content that already exists within these assignments for each assay. In this case, it’s the original annotation from the Drug Target Ontology. Since the names clearly indicate the same protein, this serves as a strong evidnece that our mapping is working.

Once the import process is accepted, each of these terms is added to the corresponding assay. This is evidenced within the assays themselves, e.g.

Note that the target has two assignments, both of which represent the same concept, but they are from two different ontologies, both with their merits. This means that we can search and analyse assay results using the Drug Target OntologyProtein OntologyUniprot identifiers, or any combination thereof.

It also establishes a working pattern for carrying out bulk changes on data via ad hoc automation, using a tool that most scientists are already familiar with, i.e. Excel. So I guess it’s not quite time to retire the spreadsheet just yet.

Leave a comment