The 1.5 release of the Mobile Molecular DataSheet (MMDS) introduced a couple of major features, including a minimum viable feature deployment of OpenPHACTS assay integration. That is being quite liberal with the meaning of viable: building this into a practical scientific workflow is more of an ongoing campaign than a specific new piece of functionality, and so other features are being improved in lock step. The next one to get an upgrade is the searching capability, which has an additional preflight configuration block (shown to the right).
The search functionality relies on a middleware service called MetaSearch, which runs on the molsync.com server, along with various other supporting cheminformatics functionality. Its initial task is to farm out searches to PubChem, ChEBI and ChemSpider, each of which has its own strengths and weaknesses; its second task is to reconcile the results and provide a series of deduplication, homogenisation and useful post-processing: limited in scope at first, but growing to fit the needs of various front end workflows (e.g. scaffold analysis).
The newly added options block allows selection of the underlying database. By default it searches them all, but sometimes you know what you want (or rather, where you want it from). While performance is not much impacted (the sub-searches are submitted concurrently), in the case of substructure or similarity searches, it is often useful to restrict it to the most relevant database, so that the precious 100-item maximum cutoff is likely to keep more of the results you want (e.g. ChEBI tends to offer the most heavily curated drug-relevant compounds, but is much smaller than the other two; PubChem has the greatest diversity, but also the most junk; ChemSpider offers uniquely crowdsourced post-curation).
Awhile back, the ability to lookup and return vendor information for records was added. This still applies only to content that can be mapped to PubChem (though it will branch out someday). Unfortunately the lookup process is kinda slow, and so making the fetch optional (and off by default) is a useful feature. And for that matter, the results can also be restricted to include only those which have some available vendor information: if you only want to see compounds you can buy, select the Require option.
There is an innocuous-looking option right at the end entitled Assays. If this is switched to Require, what happens is that each potential result candidate is quickly checked against the OpenPHACTS service, to make sure that the corresponding molecule identifier has been recorded in that database, and for which there is at least one assay record (where assay is defined fairly liberally). It does not actually return the assay information (there’s a separate feature for that), but rather it allows you to gather a collection of molecules and strip out the ones that are unknown to OpenPHACTS. The overall nefarious plan here is to make it increasingly viable (and convenient) to find a bunch of related compounds, grab all relevant known information about their properties and activities, then continue on toward various kinds of SAR analysis and model building.
Other important improvements are: importing of resulting content from the search uses the more sophisticated assimilation process introduced recently to make importing of content more flexible. And, the source identifiers from the underlying databases are also imported (i.e. PubChem, ChEBI and ChemSpider ID codes for the content that was folded into each search result).
Alex, allow me to offer some record straightening, in the nicest way of course. None of the key sources you mention needs any defence but there are misconceptions in your comments. The records in ChEBI have three tiers of expert curation, but according to the DrugBank intersects, have only 992 approved drugs in the 30708 with a self-stated matural product and metabolite focus. Moving on to the big guys we have 474 sources > 58 mill substances > 30 million compounds (just tweeted) for ChemSpider, while PubChem is 257 sources > 128 million > 48 million compounds. Strictly speaking neither contain “junk” in so far as they both use structural plausibility filtration rules. They both exibit the big advantages and lesser disadvantages of an open, arguably low bar (neither are inclined to say “no thanks”) of broad global capture, massive source information agregation but submitter diversity sans-curation at entry. Given caveats on how tailed the submitter numbers are, they suggest PubChem is more diverse. The extent to which this diversity tails out towards (rule-passing) “junk” is a moot point but in my observations, automated patent exctraction and vendors (both essential source types) are the biggest sources of this. While its true ChemSpider crowdsourced post-submission curation could remediate identifieable junk it remains to be declared as to how many records have been expert fixed in this way. Like GenBank, PubChem operates submitter primacy but many sources (e.g. DrugBank, TTD, ChEMBL) extensively fix internaly between submission cycles.
We could fill a shelf with books about this subject 🙂 Broadly speaking, the three databases have three strategic philosophies for addressing the junk data problem: PubChem pushes it back to the submitters; ChemSpider adds crowd-fixing; ChEBI focuses on a smaller selection. I kinda like this oversimplification, because it emphasises that there are multiple strategies to solving this very big and very real problem. In my experience (both anecdotal and systematic), they’re all very far from any kind of reasonable minimum. There some horrors in there that are pretty hard to believe made it though any kind of filter. I have plenty of opinions on the best way to solve the junk data problem, but anyone who’s willing to speak out and try to do something about it has my moral support.
Agreed, but we and, crucialy, these three (and other) source teams, are singing from the same hym sheet but need a little more harmonisation on the lyrics