Open source ECFP/FCFP circular fingerprints in CDK

circular_rocAs of now, the latest version of the popular open source Chemical Development Kit (CDK) has its own implementation of the highly regarded ECFP and FCFP classes of chemical structure fingerprints (sometimes referred to as circular or Morgan fingerprints). While the general recipe for this kind of fingerprint has been available for awhile, and there are a number of implementations in various different toolkits, this one distinguishes itself in several ways: it has been implemented as closely as possible to the description of the original definition (without having access to the trade secrets that were left out of the paper); it includes resolution of chiral centres; it is freely available as open source Java code; and, last but not least, the algorithm is designed to be as portable as possible, with no major dependencies on specific programming languages or cheminformatics toolkits.

This contribution to CDK has been made Collaborative Drug Discovery, and the implementation carried out by yours truly. It is part of a bigger picture (which I won’t reveal just yet), but other than already being available as open source, the project has already had one major positive byproduct: the recently released TB Mobile app for tuberculosis research uses an identical implementation of this species of ECFP6 fingerprints.

Because the CDK version (available in our Github fork, or in the latest & greatest main CDK branch), written in Java, produces fingerprints that are literally identical to the version that was coded up in Objective-C for use in iOS apps, it means that models can be created using a Java-based desktop application or webservice, and applied on the client side by the mobile app. This is how the TB Mobile app is able to provide similarity sorting, visual clustering and target activity prediction, all by mixing precalculated reference data with dynamically calculated user-supplied data.

In case you’re not familiar with the terms ECFP6 and FCFP6, in a nutshell: the chemical structure is examined for all subgraphs with a diameter of up to size 6 (i.e. start with a single node, and do 3 breadth-first iterations). Each of these graphs is assigned a hash code based on the properties of the atom, the bonds, and where applicable, chirality. These hash codes are put through several redundancy elimination steps, and eventually converted into a list of 32-bit integers. A druglike molecule typically has from dozens to hundreds of these unique hashcodes. Molecules that are structurally very similar tend to share a large number of these indices in common, and so are often compared using the Tanimoto coefficient. For ECFP-class fingerprints, the atom properties are somewhat literal (e.g. atomic number, charge, hydrogen count, etc.), whereas for the FCFP-class (“F” stands for functional) the atom characteristics are swapped out for properties that relate to ligand binding (e.g. hydrogen donor/acceptor, polarity, aromaticity, etc.) which means that different atoms often start with the same value (e.g. -OH and -NH might be considered the same).

There are many different types of graph-based fingerprints that can be used as alternative choices for various kinds of structural comparisons. The ECFP and FCFP categories have been used successfully in a number of studies, particularly for Bayesian model building. The way these fingerprints are constructed provides a good balance, giving empirically good proportionality when used for the various kinds of similarity comparisons, which has made them a popular choice for drug discovery.

Multiple software vendors have implemented their own style of circular descriptors, but there exists a problem: the original invention is based on an algorithm that has been published in the literature, but unfortunately leaves out key details that make it not possible for anyone else to implement a version that is literally compatible. That may not matter if you are doing all of your modelling with software from a single vendor, but if you want to mix and match, fingerprints generated by one package cannot be compared to fingerprints generated by another, even if the input molecules are the same and the implementation follows the same basic recipe: the numbers will be completely different.

Because the CDK project previously did not have its own implementation, we have filled this particular hole. Anyone using software in a Java runtime environment can have access to it without having to pay anyone or ask for permission. We have put in a significant amount of elbow-grease to make sure that these fingerprints pass various validation tests, and perform with an enrichment rate comparable to other implementations. But perhaps more importantly, the algorithm has been very deliberately built in a way that is relatively easy to describe in words, and is based on code that is highly self contained. Definitions like implicit hydrogen count, aromaticity, ring blocks and chirality are minimalistic, well defined, and guaranteed never to change. This means that if you generate a list of fingerprints for a structure, you can store them in a database, and use them forever; you don’t need to version them and make sure they get rebuilt whenever one of the dependencies changes (which is a major headache with many software packages). And because the implementation is quite platform agnostic, a single source file can be translated line-by-line into a different development environment. In practice, you can use the CDK implementation to generate sample results, to make sure that the transplanted version is operating identically. As mentioned previously, this has already been done and is in use by the TB Mobile app.

We intend to explicitly document the algorithm in the scientific literature in the near future, to complement the freely available source code, but you will have to wait for that. In the meanwhile, if you feel brave, look for the file CircularFingerprint.java in the CDK source, under the fingerprints hierarchy.

This is also the first time I have actively worked with the CDK codebase. The project appears to be in the midst of a major overhaul, so it will be interesting to see what comes out the other end. Besides an important new class of fingerprints, that is!

3 thoughts on “Open source ECFP/FCFP circular fingerprints in CDK

  1. Hi Alex,

    Though it’s certainly useful to have another open-source implementation of ECFP/FCFP out there, particularly for CDK users, this isn’t the first one that is as close as possible to the original paper. The RDKit Morgan fingerprints are also implemented according to the publication and, as we’ve presented a few times, generate similarity values that are extremely close to those in the commercial reference implementation. In my implementation of FCFP, I opted to use more “pharmacophoric” feature definitions, but since the RDKit implementation allows client code to provide the atom invariants, it’s trivial to get these close to the reference implementation as well.

    -greg

    1. I certainly didn’t mean to suggest that we were the first to do this (other than Java + open source); if the tone suggested otherwise, I apologise, and blame the enthusiasm of the moment! There is a philosophical difference in approach, though: as far as I can tell, RDKit offers much flexibility in the way it operates, including how the molecule is preprocessed (bond style, aromaticity, H-count, rings, chirality, isotopes). The ECFP algorithm we submitted to CDK aspires to inflexible by design; it essentially accepts the bare bones atom/bond information, and does all of its own interpretation. (It also has an interesting method for handling localised chirality.) This makes it very resistant to changing its output if somebody tweaks a file somewhere else in the source tree, makes it much easier to describe in a single paper (on the to-do list), and also makes it quite easy to port to a different language/cheminformatics toolkit (already done more than once). I’m sure you’ve experienced what it’s like to start porting a single class to a new environment, only to end up pulling a thread that causes dozens of supporting classes and library dependencies to get dragged along for the ride: often ends up being a dealbreaker. The idea is that 20 years from now, multiple different implementations of the same algorithm will be generating the same lists of integers for the same molecules, and my assertion is that this is valuable.

Leave a comment