It’s been awhile since I last had the task of maintaining a chemical data warehouse using an SQL relational database. That’s not exactly a coincidence: I put in my time and did a lot of work with Oracle and MySQL back in the day, but my takeaway conclusion was that the transactional table-based systems are profoundly unsuitable for scientific data. The recently popular wave of “noSQL” database systems (such as MongoDB) are, on the other hand, quite a natural fit.
Some of the recent developments with Open Drug Discovery Teams, and content hosting for web sharing, have necessitated that the molsync.com server trade in its former stateless purity, and run a database.
SQL, with its ACID-compliant transactions, was designed for IT infrastructure such as bank accounts or membership rolls or cell phone billing records. The table-based concepts work well, the query language can be used to simplify the programming model, the fixed table format can be carefully optimised ahead of time, and the benefits of transactionality are non-negotiable for many mission critical applications.
Science data in general, and chemistry data in particular, tends to be of a very different, non-relational form: so very often, the core database concept is a bunch of documents, the form of which can be somewhat variable, and which need to be indexed by various means, some of them not based on primitive datatypes.
A chemical database might contain a bunch of documents, which each contain some molecules, reactions, text, numeric data, and other marked-up descriptive content that is specific to higher order applications. Coming up with a relational schema to describe all of the possible ways in which the documents could be expressed is generally a nightmare, and has all kinds of problems, like for example not knowing at the beginning everything that will ever be done with it. You will find yourself fighting against SQL every step of the way, and getting very little back in return.
The schema-less approached used by database systems such as MongoDB is quite delightful: a collection is defined as a bunch of documents. Each of these documents is described by an amorphous datastructure which is analogous to a JSON-formatted variable, which is essentially a collection of dictionaries, arrays and primitive datatypes. Most datastructures can be represented in this format very naturally, with a 1:1 correspondence.
While there is no imposed schema by default, indexes and constraints can be added for performance and some rudimentary consistency. Documents can be added, located and modified as necessary. The concept of a transaction is very limited: each operation is defined to be exclusive, within its own domain, but operations are not the massive multi-table crosslinked meta-operations that are typical in the SQL world. This means that some of the burden of data concurrency is shifted to the user of the database API, rather than being implied by the nature of the queries.
This sounds like it could be a recipe for trouble, but for many scientific data storage needs, it actually is not. In fact, many cases abound when data consistency actually isn’t that important. A one in a million chance of having your bank account debited twice would be a disaster of epic proportions, but if a molecule gets duplicated every once in awhile, this may or may not be a problem – it depends on your application. And as the programmer, you can decide, rather than SQL making the decision for you.
If you’re starting a new chemical database project, and you haven’t decided on the database yet, I highly recommend you look into the noSQL option. You might just find it very worth your while.
How easy is it to set up substructure or similarity based searching?
I don’t know of any implementations, but since SQL databases don’t provide any features to help with these search types, I expect it to be same-or-better. I’ll end up building it sooner or later, so no doubt there will be a blog about it!
Funnily enough, I have come to the same conclusion, as i update my database with chemical safety data
(e.g. GHS classifications, H- and P-phrases, bp, explosion limit, storage class). The existing system run at mysql and with some tricks i try to hold the structure very flat (to avoid endless table joins), one dirty example – synonyms are in one text field separated by “;”.