5 kinds of open: source, data, algorithms, formats and protocols

Over the years I have often found myself favourably inclined towards the principles of openness in the field of software in general, and scientific software in particular. While I use many open source products on a daily basis, and have personally contributed to the free software gestalt, openness comes in many different forms, and they are not all applicable to all situations.

While openness in general is an admirable goal, it is simply not fair or reasonable to demand that people or companies practice it when it is clearly not in their best interests.

This article is a series of musings about several categories, and the extent to which I am inclined to endorse them by putting my money where my mouth is, given that my full attention these days is focused on running Molecular Materials Informatics, Inc. (MMI), which is a for-profit corporate entity.

Open Source

When the source code for computer software is made available to anyone who wants to see it, under conditions that allow people to build their own software from it, it is commonly referred to as open source. Some people prefer the term free software, which is really the same thing, except that it comes with a serving of ideology on the side. There are a lot of types of licenses that give different flavours to open source: some are free with no restrictions whatsoever (public domain), others are designed to prevent the source from ever becoming closed (copyleft), and others allow inclusion in closed products as long as credit is given (permissive).

From the point of view of someone who wants to be compensated for their hard work and innovation, the critical point is that releasing a product as open source means that you will not be able to charge customers for having the product – that is, unless they are a bit gullible, because they could legally obtain it for free. But you can charge them for quite literally anything else, including installation, customisation, support, or any other business model you can think of. You just can’t stop anyone else from doing likewise.

There are a number of companies in existence making a healthy profit after having released their products as open source, but it would probably be fair to say that most of these companies tend to be significantly less avaricious than average, as they have forfeited most of the traditional gouging opportunities. It is certainly a neat and tidy notion that open source companies have to work harder and provide better service to maintain their position, but my personal take on the matter is that business models based on open source products are more precarious: the possible ways to make a business successful, i.e. earning more than spending, are fewer, which is why proprietary software continues to thrive.

At the present time, my company does not officially provide any open source products, but as an individual, I continue to maintain one open source project that I have been working on in my spare time for some years: SketchEl. This project is not something I would describe as innovative, and I would not want it to be held up as an example of my best programming, but I find it to be very useful. Also, the open data formats that it uses have been incorporated into all MMI products. SketchEl is an open source example of the original implementation for these.

Open Data

Making data openly available to anyone who thinks they might have a use for it is a wonderful idea for any software domain, and particularly so in the sciences. There are some interesting ideological debates going on about how, when, and how much data should be open.

On one end of the scale, there is proprietary data that is key to a competitive advantage (e.g. drug candidates, material compositions, etc.), and there are obvious reasons for keeping it secret. On the other end there are transparency advocates, who would like to see more people making all of the scientific facts that they gather available to the world, in more or less real time.

Since my company does not presently own any data of consequence, I can speculate on this with relative impunity. For what it’s worth, I advocate being selectively open, and erring on the side of generosity. When you publish supplementary data with a paper, include everything that is likely to be useful, rather than holding back in the interests of brevity, but don’t include everything just for the sake of it. A lot of data is just not useful, and the scientist who generated it is usually the person most qualified to make that call.

As a general rule, any data that you have that could be of use to another scientist, and is not unambiguously associated with a competitive advantage that is necessary to make sure you continue to get paid to do what you do, should be a candidate for open data.

Open Algorithms

Computational chemistry is a field that includes many implemented algorithms, contained within software products with all manner of different licenses. There are all kinds of tricks and techniques that can usefully be described as distinct algorithms. For example, an algorithm to find the smallest set of smallest rings within a molecular connection table. There are a number of publications in the literature which document an algorithm for this calculation, which are described as a mixture of words, diagrams, mathematical notations and pseudocode.

This could be described as an open algorithm when anyone who has a copy of the paper is free to go ahead and implement it, using whatever programming language or toolkit suits them, without having to pay royalties, or ask for permission, or invent the whole technique from scratch.

The open algorithm is an excellent compromise between the need to be competitive and the benefits of sharing.

Last year (2010) I published a collection of algorithms in the Journal of Cheminformatics, which describe the most important intellectual property that MMI has, putting it up for display in full gory detail for anyone to copy, borrow, be inspired by or just plain rip off. The cornerstone of my company business model is now formally based on open algorithms. I did not do this to be nice. I published the algorithms for a number of reasons, but the first one is: they’re hard to implement. I want other scientists to know this, and I want them to know that I am still serious about being a part of the scientific community.

An algorithm is not a product. Taking a published algorithm and implementing it takes a lot of time and work: it has to be built, tested, validated, wrapped into a user interface, tested by actual users, packaged into a final product, and finally sold. And if the person or institution who published the algorithm is already selling just such a product, the second implementation has to be better in order to gain market traction when the inventor has such a head start.

And if the second product is better, the original inventor can scramble to copy the new innovations, in order to stay in the game. The original product improves, the customers have choice, and innovation keeps happening.

That sounds a bit idealistic, and it probably is. But the bottom line is: what are you afraid of? If you have a new algorithm that is important, you should consider taking the time to publish it. If you intend to keep innovating, increased competition should not stop you from being scientifically open.

Open Formats

When it comes to the file formats used by applications to store data, my thoughts become somewhat more than conditional suggestions. In the scientific software field, it would be hard to think of a situation in which a software vendor could justify storing data in an intractable proprietary format.

By this I do not mean to include minor indiscretions like failing to document a trivial configuration file, or simply having not yet found the time to document some particular aspect of a file format that is relatively easy to figure out anyway. I would not resort to that level of hypocrisy, but as it stands, the main file formats used by MMI are open. They are documented and unconstrained: anyone can read up on them, take a look at a few examples, and implement readers or writers, for their own independent purposes, or for interaction with MMI products.

The two formats that MMI products use most extensively are the SketchEl molecule format, and the XML-based molecular datasheet format, both of which were originally developed for use with the open source SketchEl project, mentioned previously.

The scientific software community has generally been quite good about making file format specifications open and unencumbered, which stands in contrast to the rest of the software industry. Many chemical data formats are quite well documented, and I am not aware of any that come with legal restrictions.

Openness with regard to formats also comes with an obligation to be reasonably consistent: making major changes to a format with each new release of the primary application, such that all other software breaks, decreases the effective value of an open format.

Using open formats means that data is not locked into any particular application: as long as a transport mechanism exists, a user is free to use it with other applications. If the original creator of an open format produces an application that makes use of it, the user community may decide to use a product from a competitor instead of or as well as. Or they may build their own software to augment or replace it, or they may decide to contribute to an open source effort.

Making use of proprietary formats to induce lock-in was once a popular tactic in the software industry, but nowadays it is usually recognised for what it is: anti-competitive. The opposite approach – using and encouraging use of open formats – is conducive to building a software ecosystem around your product. And this makes sense for anyone, whether the objective is fame, wealth, or both.

Open Protocols

Whenever two software applications need to share data with each other, they need to involve some combination of protocols. Sometimes this is a matter of saving a file using a commonly understood format, but there are many cases where interprocess communication is necessary to join applications together to complete a workflow.

Web services are a common and useful example. There are a great many web servers sitting around listening for a client to make a request for some functionality. While many such services are restricted by password-protected accounts, for good reasons, the level of documentation and public accessibility varies considerably.

Opening up protocols, by way of documentation, allows applications that were originally designed just to talk to each other to engage with additional applications not necessarily intended by the original inventors. Like all other facets of openness, this allows clients and servers to be swapped in or out, rather than going for the lock-in approach.

As with open formats, it is not enough just to document a protocol: it has to be designed so that it is sufficiently robust and preferably backward compatible to be useful. Its behaviour needs to be defined fairly rigorously, so that expectations are met, and software does not cease to function because the communication transport mechanism was inadequately specified.

At the time of writing, products from MMI make use of one major open protocol, which is used for making web service functionality available to the Mobile Molecular DataSheet. The protocol is defined rigorously, and changes are only made after careful consideration of the effects on current-generation implementations. Anyone can (and is strongly encouraged to) build web services of their own, and run them from MMDS. Anyone is also welcome to build their own client software, too.


The sum total of these musings is that I am highly in favour of all 5 of these types of openness when it comes to software, but only dogmatic about some of them. Open source is great, but if you need to build a business model around it, it may very well be unrealistic for your case, and if so, skip it. Same for open data: if it is your competitive edge, then you probably should not be giving it away unconditionally, but if not, it is better to be generous. Open algorithms, open formats and open protocols: chances are, if your reason for not doing these things is anything other than haven’t gotten around to it yet, you are probably not taking the high road!

One thought on “5 kinds of open: source, data, algorithms, formats and protocols

  1. “Open source is great, but if you need to build a business model around it, it may very well be unrealistic for your case, and if so, skip it. Same for open data: if it is your competitive edge, then you probably should not be giving it away unconditionally, but if not, it is better to be generous.”

    Indeed, it makes no sense to put yourself out of business. This is also why I think academic researchers are in a luxurious position here. ODOSOS (or OSODOAOFOP, and I would aggregate the later under ‘Standards’) is not core business in life sciences (the paper is, particularly about things we see around us), and can therefore be Open.

    For businesses that is indeed different, but it is indeed important to stress that commercial entities too can take OSODOAOFOP to their advantage and strengthen their business, by sharing burden, taking advantage of a community (which ‘free for academics’ does too, though, until the scholar realizes that without OAOFOP they end up in a vendor lock-in), etc.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s