The Journal of Cheminformatics is organising a Special Collection entitled “Biomedical Data Analyses Facilitated by Open Cheminformatics Workflows” which encourages researchers to publish their workflows for gathering, preparing, curating and cleaning data. This resonates well with a growing explicit awareness within the community that data quality isn’t just an important thing, it’s the important thing.
I would wager that this increasingly bright spotlight on data quality has a lot to do with the grandiose promises of the current wave of machine learning tools for drug discovery (also known as artificial intelligence, which is what you call it when you’re pitching to venture capitalists). Fortunately it is true that bringing better model building algorithms, faster computers and bigger datasets does have the potential to make significant contributions to drug discovery – but none of them have the ability to solve the garbage in/garbage out problem. If your input data is noisy, incomplete or poorly categorised, the best you can get out of it is a great model that is predictive of nothing.
Hence the focus on data preparation workflows, which everybody now understands is their rate limiting step. There are so many angles to consider, including but not by any means limited to:
- good old fashioned experimental error
- volume asymmetry between hits vs. misses
- classification of assays – when is it OK to combine two different ones?
- stage of classification, e.g. bulk screening vs. followups, in vitro vs. in vivo
- missing metadata: sample solvents, target preparation, etc.
- chemotype diversity, medicinal chemist bias, catalog availability, activity cliffs
- structure maintenance problems, e.g. normalisation, salt treatment, tautomers
- data laundering errors, e.g. scientist to publication to curator to normalisation to format conversion
Every organisation doing drug discovery, whether virtually or actually or some combination of both, needs to solve these problems at some scale. There are a lot of techniques that need to be applied, and numerous different approaches to solving each of the problems, each of which having its own nuances. There’s no standardised way to fix everything, at least not yet: this is what the Special Collection is trying to address. It would be great if we could pool our resources and learn from each other as much as possible.
The most important caveat is that the Journal of Cheminformatics insists on reproducibility of everything that is published between its online pages, which means that you do need to be willing to share. We’re hoping that most people think of data preparation as being at least somewhat pre-competitive. So give it some thought: if you have a workflow for solving a problem for getting your data into the right form to analyse/visualise/model, please do consider writing it up as a paper and submitting it!