Recent developments with Bayesian models and app data sharing

bayes_recent1Several of the flagship apps from Molecular Materials Informatics have had major updates recently: the Mobile Molecular DataSheet, SAR Table, MolPrime+, Green Lab Notebook and Approved Drugs. Two separate groups of features have motivated these updates: (1) the inclusion of in-app calculation of nontrivial properties, lately supplemented by the inclusion of Bayesian models, and (2) leveraging the new iOS 8 API feature for importing & exporting data to any compatible service, which includes iCloud by default, but also Dropbox if it is installed. Read the rest of this entry »

Leave a comment

Dabbling with the desktop: molecular datasheet app on the mac

xmds_firstpreviewThis is a way-too-soon sneak preview of what might someday end up as a commercial product. Currently residing under the project acronym XMDS, it is related to the MMDS app, for Mac OS X (i.e. X-Molecular DataSheet). The minimum viable feature set is intended to be approximately equivalent to the datasheet & molecule editor that is implemented by the open source SketchEl project.

The screenshot above shows the sum total of the functionality of the Mac app so far: the ability to load a datasheet file into a window, create an outline grid, and render each of the cells. Obviously the text and numbers are not too difficult to render, but as seen for the two structures, the first one is very simple and plausible, while the second one is nonsense. The latter exists for the purpose of verifying that a variety of different rendering scenarios are handled as intended. If you look carefully at the molecules, you may also notice that they are very carefully crafted, in order to demonstrate publication-worthy aesthetics from raw connection tables. The graphical rendering layout algorithm, which has the 2D coordinates of the heavy atoms (in Angstroms) as its only visual cue, is the same one that is used by the com.mmi software stack that drives the services, and is described in detail in a paper published in Molecular Informatics. Even though the app currently does nothing other than display static content, the amount of machinery that had to be ported over is quite extensive, so this snapshot belies a significant amount of successful coding.

Internally, the project is interesting because it is all written using Apple’s not-really-ready Swift language. As I’ve previously written about the language (first and second impressions), this is a surprise announcement that Apple revealed in June 2014 after a secret incubation period, and for which the specifications made it seem that it was going to be a very fine programming environment right out of the gate, with some very clever and well thought out features. Unfortunately current reality is quite the opposite: the version number should probably be stated as “0.1 alpha”, and the development experience is correspondingly awkward, buggy and broken. Some of my early efforts to port cheminformatics algorithms to Swift (from their original Java implementations) revealed that for the kinds of low level algorithms that ought to perform excellently are actually something like 50-100x slower than a reasonably well written Objective-C version, which is a bit of a dealbreaker. The performance is more like what you would expect from naively implementing a number crunching algorithm in a scripting language. For this reason, I subsequently accepted the reality that my mobile apps for iOS would continue to be based in Objective-C language for a fair bit longer: it has been the victim of premature deprecation.

Nonetheless, since Apple has pretty much staked their reputation on their new language, it means they really stand to lose a lot if they don’t keep improving Swift so that it lives up to the hype in the reasonably near future. For what it’s worth I don’t think Apple is ready to flame out on its current winning streak, and so I’m betting that it is a good idea to start using Swift for products that may not be ready to release for awhile.

And hence the idea of expanding the product lineup of Molecular Materials Informatics to span a larger variety of platforms. While the most prominent software offerings are the mobile apps (iOS and Android), and behind the scenes with increasingly powerful serverside/non-interactive algorithms, the strategic plan has always been to cover the desktop platform, too. The lack of a clear winner in the platform/development space is one of the main reasons it is taking too long (i.e. Windows is big but past its prime, Linux continues to flail but is popular with several very important niche demographics, Mac keeps on winning but needs to hurry up and ditch Objective-C, cross platform tools like Java provide only the lowest common denominator user experience, the continued absence of a viable web runtime rules out browser options… and not to mention all the issues with deployment and installation). Swift promises to resolve the stalemate, and once the worst of its shortcomings are dealt with, it will win by default even if it is ultimately quite mediocre.

A large amount of the method development work that I do, in order to create cheminformatics algorithms for mobile apps and other platforms, requires me to personally edit molecular datasheets for many different purposes. When working on the desktop, I use the SketchEl open source package, which is written in Java. Recently I improved it so that it is a bit more Mac-friendly, since that’s increasingly becoming my preferred workstation choice, but I am frequently reminded of how much better it would be to have a native Mac replacement. The SketchEl project is extremely bare-bones in terms of algorithmic capacity: the datasheet editor is the absolute minimum necessary to make data entry and collection viewing reasonably straightforward for simple cases. Since the main functionality was finished up many years ago, the algorithms that are available within the proprietary codebases of Molecular Materials Informatics have a great deal of potential to augment it.

So don’t expect XMDS to appear in the Mac appstore in the next few weeks, but depending on a variety of different factors, it may see the light of day in the conceivably near future. Stranger things have happened.

Leave a comment

Online file sharing with iOS 8: upgrading the Green Lab Notebook app

picker_gln02Mobile apps for iOS have always been able to share files by a variety of different mechanisms, but many of these were limited in ways that were very detrimental to the user experience. The Green Lab Notebook app is now catching up to the new technology introduced with iOS 8: using the “document picker” interface to import and export files to document providers, which immediately makes it fully interoperable with iCloud, and file sharing services like Dropbox. Read the rest of this entry »

Leave a comment

Structure property calculation in apps: MMDS

mmds_propcalcs00An important milestone in has been reached in the migration of complicated structure-based calculations to pure mobile. The latest version of MMDS (1.5.9) is now available on the AppStore, and allows visualisation of calculated properties for individual molecules, as well as calculating new columns for entire datasheets.

The previous post described how recent porting of core technology (e.g. substructure query fragment searching) to Objective-C and iOS has opened the door to a variety of calculation types, including atom type-based contribution methods, while the post before that described how the porting of modern fingerprint types has enabled Bayesian models to be used. These progressions are significant, because the previous method of choice for carrying out difficult (or resource intensive) calculations was to hand off the data to a webservice, and await a response. The two technical arguments in favour of taking this approach are slowly but surely eroding: as device capabilities improve, the performance argument becomes less compelling, and as the difficult algorithms are migrated to Apple’s unique and incompatible-with-everything-else development tools, that stops being a problem as well.

The current version of the Mobile Molecular DataSheet has had two major features retrofitted: bringing up the calculation panel for an individual molecule now displays a wealth of calculated information, some of it in the form of numbers, some as graphically annotated structures. The calculation panel for whole datasheets offers the option for calculating scalar properties for each row, and storing the results in columns, which means that MMDS can be used as a calculation engine for other uses (e.g. QSAR or various kinds of visualisation). The individual molecule property calculation uses the same code as the MolPrime+ app, which received these new features first (but the second instalment is still awaiting review and should be available soon).

The properties that are now available and can be calculated locally on the app, with no internet connection or security conerns, include:

  • Easy-to-calculate scalar properties: molecular formula/weight, # heavy atoms, H-acceptors/donors, # rotatable bonds.
  • Log P & molar refractivity: both calculated by an atom contribution method (published by Crippen way back when), which requires implementation of substructure searching.
  • Bad valences: reviews the valence counts for each of the main group atoms and reports egregious mistakes (e.g. pentavalent carbon).
  • Stereochemistry: sites for R/S and E/Z stereochemistry are identified and their labels calculated, with unspecified or known ambiguous cases classified appropriately.
  • Tautomers: common H-shifts are identified and the complete list of tautomeric forms are enumerated, with duplicate equivalent molecules removed, and racemised stereocentres labelled accordingly.
  • PAINS filters: the original set of queries for identifying frequent hitters and other high throughput screening problem compounds is applied, and any matches are identified.
  • Mass distribution: the isotope distribution is calculated for integral masses, as well as the exact mass for the base peak.

The presentation of these calculated properties varies for single molecule vs. whole datasheet modes. Some numeric properties are displayed as such, alongside the structure:

mmds_propcalcs01 mmds_propcalcs02

Properties that are not inherently scalar are shown as structure overlays, for example valence mistakes which are identified in red:



Stereochemistry is also shown with the labels overlaid on top of the structure diagram:


However, it also induces a descriptive field, for counting the number of known/unknown stereocentres, and also the concept of “stereoambiguity”, which is essentially 2-N, where N is the number of unresolved stereocentres (so a compound with one unlabelled chiral or alkene-like stereocentre would have a value of 0.5, whereas a compound with two would be 0.25, etc.). This is the beginnings of an idea referred to as “confidence in chemistry”, which you are likely to be hearing more about soon from Chris Lipinski.

Tautomers are presented as an enumerated collection of compounds, when they apply:


Visually they are shown as a scrollable graphic, and in each case the atoms that are affected by a tautomer shift are highlighted. Any stereocentres that were distinctive in one of the tautomeric forms but have been the subject of one of the tautomer shifts are denoted as being ambiguous. For numeric purposes, a similar idea of “tautomer ambiguity” is calculated, with then confidence being 1/N, where N is the total number of tautomeric forms.

It should be noted at this point that some of these properties take some time to calculate. When the property viewing dialog is opened, it initiates a background thread, which grinds away at producing the results. Some of the properties are fast (e.g. molecular formula), some of them are just slow enough that they would glitch the user interface if they were not put in the background (e.g. log P requires a number of substructure matches), and some of them can take seconds, depending on molecule size and how new your device is, which in particular applies to the PAINS filters. These are built from a collection of SMARTS strings (upconverted to connection table queries, with the meta-sub-fragments expanded out, the total is close to a thousand) which all have to be run against the current molecule. The dialog panel updates as and when each of these becomes available:


As for tautomers, the PAINS matches are rendered for individual molecules as a side-scrolling collection. Most compounds hit zero-or-one of these filters, but it is possible to create molecules that hit many of them, and symmetry can crank it up further.

Calculating properties for a datasheet involves a less graphical setup dialog:


It consists of a checkbox for each of the properties that are desired. By default everything is on, except for the slowest calculation type (PAINS). Note also that the screenshot above shows an obscured section underneath, with the heading: Bayesian Models. This is the next major extension to the MMDS app, and it’s coming soon.



PAINS filters now on mobile, with MolPrime+

molprime_painsOne of the trends that you should expect to see more of from apps produced by Molecular Materials Informatics is a shift toward performing more advanced calculations internally on the mobile device, rather than calling out to a cloud service. One of the recent demonstrations was shown with the Approved Drugs app, which can now call up Bayesian models for various predictions. The next version of MolPrime+ that is awaiting review on the AppStore incorporates internal calculation of log P, and also brings the ability to identify PAINS filters for molecular structures.

Read the rest of this entry »

1 Comment

Incorporating Bayesian models into the Approved Drugs app

bayes_apprdrugs1Some interesting experimental features are on their way to the Approved Drugs app, which has been in the crosshairs for expanded functionality recently. The latest round of improvements involves the ability to collect custom-drawn structures, and apply Bayesian models for predicting the presence of “bad drug” properties, both for molecules overall, and for individual atoms. Read the rest of this entry »

Leave a comment

2014 redux

The year of 2014 is almost over, so it’s time to write the summary. The year started by taking a few unexpected turns that I could not have predicted, but things turned out well, and a lot got done.

The original game plan for 2014 was to buckle down and finish building some of the technologies and products that had been on the roadmap for a little too long for my liking. Since the previous year had involved a great deal of travel and activities other than sitting down and building scientific software, I found myself in the unfamiliar scenario of having promised more than I delivered. Perhaps I will get used to this someday, but I still find it rather disconcerting.

One of the first new pieces of functionality for the year resulted from discussions with some of the OpenPHACTS team at the NETTAB meeting in Venice several months previous, where I found out that the project has an open API that allows searching of marked up assay data for compounds. It is now possible to use the MetaSearch feature of the Mobile Molecular DataSheet app to fetch this data, for specific molecules or collections of molecules. The interface is slightly unwieldy in cases where there is a lot of information (e.g. looking up all the data for an ancient drug like aspirin), but it is quite functional.

Following up from a project with Collaborative Drug Discovery, I finished the delivery of an implementation of the “circular fingerprints” commonly referred to by names such as “ECFP_6″ or “FCFP_6″. This was made available for use in the Chemical Development Kit, and so is now in the hands of everyone who uses this open source toolkit. This is not the first open implementation of this fingerprint type, but it is novel in that the algorithm is self-contained within a single code module, which means it is quite easy to port it to a different platform or toolkit and have it produce the exact same results. This kind of portability is a major and understated feature, because it means that models built with one software stack can be evaluated and applied with a variety of different software packages.

Following up the release of the fingerprint implementation, the algorithm was promptly ported to Objective-C and delivered as part of v2 of the TB Mobile app. This major release included a user interface overhaul (for iOS 7), a fair bit more curated data, and a similarity ordering system that uses the portable ECFP6 fingerprints. Furthermore, the data preparation stage for the app includes building a Bayesian model for each of the TB targets. The user can draw or import structures and view the predicted likelihood of activity against the various known targets. Also, a novelty feature was added to the app: dynamic clustering, in which a selected compound is represented graphically in the middle of a 2D cluster. Needless to say the similarity metric uses ECFP6 fingerprints.

Having delivered these tools, as well as the basis for the Bayesian modelling feature in CDD Vault, we wrote up a paper for the Journal of Cheminformatics, which describes the underlying technology and its uses. This is an open access journal, so no need to worry about getting stopped at a paywall.

Not long afterward, I skipped out on the last weeks of the Montreal winter and headed for San Francisco for a sabbatical with Collaborative Drug Discovery, who had a research project that needed to be executed with a certain degree of urgency, involving analysis of biological assay data of the textual variety being marked up in an ontological form. While it has been a few years since I last did any natural language analysis, it’s always good to sharpen up some old skills/try something out of one’s familiar comfort zone every once in awhile, and access to more biological assay data is quite germane to my usual activities. The project went rather well: using a prepackaged open source library took care of the lexical parsing, and since I had been working with Bayesian models recently, that’s what I tried feeding it into first. To a man with a hammer every problem looks like a nail, and in this case it paid off. You can read all about it in the paper we published in PeerJ.

Speaking of PeerJ, it was my pleasure to contribute to this journal, which I have been watching for awhile. They are a company that is trying to disrupt the scientific publishing industry by operating with low overheads, so scientists are not presented with a choice between two evils: expensive author-pays vs. all readers pay. Instead, authors pay a nominal fee, readers pay nothing, and the process has a higher degree of automation than most journals. Because the journal is mainly focused on biology, this one was able to squeeze in because it could be described as bioinformatics.

In 2014 I only attended one scientific conference, the American Chemical Society meeting in Dallas. I gave two talks, one being about cloud computing with mobile apps, and the other describing my Green Lab Notebook (GLN) app, which was at the time still vapourware. Once I returned from California, finishing the app was a high priority, and it was released shortly after. The GLN app has two main notable capabilities: it can represent multistep reactions in an informatically pure way that has a very high degree of machine readability, and it has always-on calculation of green chemistry metrics (process mass intensity, E-factor, atom economy) which should ideally become as important to chemists as characteristics like yield and scale. The GLN app is currently quite powerful, but the list of functionality still yet to implement is rather long.

Unfortunately I returned from San Francisco a week too soon to be able to attend the autumn American Chemical Society meeting which was in town. The reason I couldn’t make that is due to the timing of my birthday, and this year was important, because I turned 40. Since I had not been back to my home country of New Zealand for a long time, and because the west coast is closer to the south Pacific, I figured it was time to make the pilgrimage.

Prior to my sabbatical, I had agreed to write a book chapter about green chemistry in drug discovery informatics. The deadline being what it was, I essentially had a month to write the whole thing, in between working on a major new research project, and figuring out the logistics of living in San Francisco for several months. The chapter is finished and revised, and it should be coming out sometime next year, published by the Royal Society of Chemistry (RSC).

Speaking of the RSC, the organisation commissioned an overhaul of the popular ChemSpider Mobile app, which I built for them some years ago. The original 1.0 version erred on the side of being quite simple: nobody whether it would take off, so we assembled it quickly just to get something out there. A portion of the summer was spent re-skinning the app to make it look nice on iOS 7, and to give it the degree of functionality it deserves: advanced searches, result saving, detailed previews, etc. Also during the upgrading of the app, we uncovered a few glitches with the underlying API which were fixed, and so the searching process is much faster and way more reliable than it was. All things considered, I’m very happy with the way it turned out.

And speaking of the RSC once again, there is another project that I have contributed to: the Medicinal Chemistry Toolkit app. This product is currently on the iTunes AppStore, supporting a physical textbook, and there will be a new version coming out soon that has chemical structure capabilities. This will allow the user to sketch out a structure and observe properties being calculated, like logP, ligand efficiency, lipophilic efficiency, and a graphical rendition of the Astra-Zeneca “ugly structure” filters. While the sketcher tools are taken straight out of my preexisting libraries, some of the calculations required technology to be ported from my serverside (Java) libraries into Objective-C, so they could be used to create cheminformatics calculations that are disallowed from using a webservice. This took most of the time, but now it is done, and apps just got a fair bit more sophisticated: my core libraries now include substructure query searches, as well as the aforementioned circular fingerprint generation and Bayesian modelling.

Another project that got started before my sabbatical, and had to be finished after it was over, is the Valence app. This is co-designed with TouchText LLC, and is my first attempt to create an app that is designed exclusively for the education market, as opposed to dual purpose efforts to interest students and professionals both. The Valence app introduces students of any age to the concepts of the Lewis octet rule (no relation to Lisa Lewis, my partner in crime for this project). The app was announced at the International Conference on Chemistry Education (ICCE) in Toronto, which I was originally intending to present in person, but was on the other side of the continent. The Green Lab Notebook app was also preannounced, by proxy.

Halfway through 2014, Apple announced two technologies that caught my attention: Swift and iCloud Drive. Normally I tend not to be an early adopter of Apple’s latest [not necessarily greatest], and wait at least a few months to start updating tools and products. There are always plenty of other eager beavers out there with nothing better to do than break their teeth on the newest of the new, but this time around I joined them, and unfortunately regretted it. The Swift programming language looked like it would be rather awesome, and replace Objective-C, which is rather the opposite of awesome. It turns out that it needs to go back in the oven for quite awhile. Same with iCloud Drive: it was promised to be basically Dropbox integrated into all of Apple’s products, but it seems to have some undocumented limitations (like not working, for example). I suspect both of these technologies will be long term winners, but that’s for the future, not the present.

In order to test the Swift language, I at least had the sense to refrain from updating my main codebase, and instead created a novelty new app, called the Beer Lab Notebook. The name is a tongue-in-cheek suggestion that it’s a companion to the Green Lab Notebook app, to the extent that these days the closest I get to my practical laboratory roots is by practicing my hobby of zymurgy.

After checking off some of the major projects that I had intended to release much sooner, I turned my attention to the first and most fundamental technology that my company created: the gesture-based chemical structure sketcher that is the basis of all the mobile products that I have created. Since the beginning I have struggled with the paradox of needing to redesign the familiar user interface paradigm in order to make sketching molecules excellent on a touchscreen. If you try to reproduce the standard toolbox from the desktop era, the result will be unusable on a phone and mediocre on a tablet. If you accept the redesigned approach that I came up with in 2010, it will be great on any form factor, but requires learning some new ideas. And even though I was under no illusions about peoples’ resistance to learning new software interfaces, the mobile space is even less forgiving than one might think. Over the years I’ve tried various ways to inject tips, tutorials, training modes, etc., but never been very satisfied with the way they worked. And so these concepts (tips/tutorials/training) have been integrated more gracefully into the sketcher interface, in order to better ease the discordance of being thrown into a different way of doing things. This will no doubt be an ongoing process.

On the subject of sketchers, I have been finding myself using a Mac more regularly as my desktop of choice, and I regularly use my own open source sketcher (SketchEl) for method development. After much procrastination, I put in some work to make it behave better with Apple’s dissenting approach to things like keyboard shortcuts, and various other issues that make Java almost (but not quite) platform independent.

While working at CDD earlier in the year, I helped out with and was included as a co-author on a paper with Sean Ekins & Chris Lipinski. One of the discussion points was regarding a collection of “probe molecules” that came out of a major NIH research effort. This paper recently got mentioned on In The Pipeline. I have made the curated probe compounds available as part of the Approved Drugs app, which now has two built-in collections of molecules (drugs and probes).

The Approved Drugs app is slated for a number of new functionalities, which will appear early next year. For one, the fingerprints have been switched over to use ECFP6, and soon there will be a 3rd collection for custom-defined molecules, similar to the way this is currently done with TB Mobile. The intention is to include a number of Bayesian models for common medicinally relevant properties, such as structure “ugliness”, toxicity, ADME, etc., and make it incredibly easy to evaluate and visualise the disagreeable molecular features. This is part of a general trend toward distilling data that requires an expert to curate, fix and model, and repackaging it as part of an easy to use mobile app.

As per usual, the year of 2015 includes a long list of projects that are in progress, or are just ideas. Expect mobile apps to provide more calculations, and to a large extent these will be carried out locally on the devices, using models and datasets that were prepared using powerful hardware. Interfaces will continue to improve, and the percentage of chemistry tasks that can be done without using a desktop/laptop computer will continue to increase. Expect Apple technology to be a wildcard, as with each passing year more of the deliberate restrictions on what apps can or can’t do are lifted, along with the hardware becoming more powerful at an incredible pace.

Leave a comment


Get every new post delivered to your Inbox.

Join 1,159 other followers