There is been a fair bit of chatter on the blogosphere lately about the perennial problem of low quality chemical data, e.g. chemical structures that do not adequately describe the material they claim to represent, messed up names, broken references, mistaken or underspecified accompanying data, etc.
This latest round of discussion seems to have been triggered by the recent publication of the so-called NCGC Pharmaceutical Collection, which not only claims to be definitive, but may in fact be unusually bad, such that even practitioners who have no illusions about the quality of data in general use are noticeably unimpressed (e.g. ChemConnector, In The Pipeline). While I haven’t personally examined the offending dataset, this is as good an occasion as any to weigh in with my long-held opinion on how things have gotten so bad that chemical data wranglers routinely expect between 5 and 20% of their source data to be junk.


