Pages

Friday, December 21, 2007

Christmas presents...

Our Christmas tree has not been decorated yet, but the presents are there: the BMC Bioinformatics paper on userscripts in life sciences, Bioclipse 1.2.0, a long list of blogs to rate, and a very nice overview from Wendy Warr on workflow environments, discussing and comparing different offerings like Pipeline Pilot, Taverna, and KNIME.

Userscripts
The paper on userscripts describes how Greasemonkey scripts can be used to combine different information sources (DOI:10.1186/1471-2105-8-487). A trailer:
    Background
    The web has seen an explosion of chemistry and biology related resources in the last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available with a wide variety of types of information. There is a huge need to aggregate and organise this information. However, the sheer number of resources makes it unrealistic to link them all in a centralised manner. Instead, search engines to find information in those resources flourish, and formal languages like Resource Description Framework and Web Ontology Language are increasingly used to allow linking of resources. A recent development is the use of userscripts to change the appearance of web pages, by on-the-fly modification of the web content. This pens possibilities to aggregate information and computational results from different web resources into the web page of one of those resources.

Peter et al. have been using this technology for CrystalEye too, but the paper was in a finalizing state when the userscript was announced, unfortunately.

Bioclipse 1.2.0
The other present is the Bioclipse 1.2.0 release, for which the QSAR feature is a great new feature addition (see my blog the other day with an overview of blog items detailing my participation in that feature). Ola et al. have done a great job with the plot functionality, which is very nice to scatter plot calculated descriptors. This release is likely going to be the last one in the Bioclipse 1 series, except for bug fix releases, so, this release also means I can start contributing to the Bioclipse 2 series. Recent items in the Bioclipse blog show a bright future, with project based resource handling, better scripting (R, ruby, JavaScript, BeanShell?).

BTW, we never have presents under the tree; we have Sinterklaas.

Thursday, December 20, 2007

The molecular QSAR descriptors in the CDK

Pending the release of Bioclipse 1.2.0, Ola asked me to do some additional feature implementation for the QSAR feature, such as having the filenames as labels in the descriptor matrix. See also these earlier items: (How more open notebook science can you get?)

But I ran into some trouble when both JOElib and CDK descriptors were selected, or Ola really. Now, nothing much I plan to do on the JOElib code, but at least I code investigate the CDK code.

The QSAR descriptor framework has been published in the Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. paper (DOI:10.2174/138161206777585274). However, while most molecular descriptors had JUnit tests for at least the calculate() method, a full and proper module testing was not set up. This involves a rough coverage testing and test methods for all methods in the classes.

So, I set up a new CDK module called qsarmolecular, and added the coverage test class QsarmolecularCoverageTest. This class is really short and basically only requires a module to be set up, as reflected by the line:
private final static String CLASS_LIST = "qsarmolecular.javafiles";
The actual functionality is inherited from the CoverageTest. The coverage testing requires, unlike tools like Emma for which reports are generated by Nightly, a certain naming scheme (explained in Development Tools. 1. Unit testing in CDK News 2.2).

Now, testing for a lot of the methods in the IMolecularDescriptor and IDescriptor interfaces are actually identical for all descriptors. Therefore, I wrote a MolecularDescriptorTest and made all JUnit test classes for the molecular descriptors extend this new class. This means that by writing only 10 new tests, with 29 assert statements, for the 45 molecular descriptor classes, 450 new unit tests are run without special effort, making to total sum of unit tests run each night by Nightly for trunk/ pass the 4500 unit tests.

Now, this turned out to be necessary. I count 52 new failing tests, which should hit Nightly in the next 24 hours.

Wednesday, December 19, 2007

Test results for the CDK 1.0.x branch

The Chemistry Development Kit has never really been without any bugs, which is reflected in the number of failing JUnit tests. For trunk/ this is today 106 failing tests (live stats). The stable cdk-1.0.x/ branch, however, the number of failing tests is not much lower: 64 failing tests today (live stats).

Overall, only a low percentage of the tests fails (<2% for cdk-1.0.x/ and <3% for trunk/), and, more importantly, it is particular algorithms that are typically broken. For example, in the structgen module 8 tests fail, for both CDK versions. In the cdk-1.0.x/ branch it is the valency checker code that causes quite a few fails, which I discussed in Atom typing in the CDK and which is the reason for the atom type perception refactoring in progress in trunk/ (see Evidence of Aromaticity). Not all code in trunk/ has yet been updated yet, and this causes quite a few failing tests for trunk/ in the reaction, qsarAtomic and qsarBond modules.

Back to the cdk-1.0.x/ branch. Previous CDK releases tended to have around 40 failing tests, so I was worried about the number of tests failing now. Maybe backported patches causes additional fails? To study that I had my machine run the JUnit tests for all revisions of the cdk-1.0.x/ branch since the branch was made in commit 8343. The result looks like:

Indeed, it is a number of backports that cause the clear increase in bugs between commit 9044 and 9058. Nothing particular I can see, and worse, the intermediate revisions do not compile and do not have test restults:
    104 9044 3731  84  73  979.709  0
    105 9045 0 0 0 0.000 0
    106 9046 0 0 0 0.000 0
    107 9047 0 0 0 0.000 0
    108 9048 0 0 0 0.000 0
    109 9049 0 0 0 0.000 0
    110 9050 0 0 0 0.000 0
    111 9051 0 0 0 0.000 0
    112 9052 0 0 0 0.000 0
    113 9053 0 0 0 0.000 0
    114 9054 0 0 0 0.000 0
    115 9055 0 0 0 0.000 0
    116 9056 0 0 0 0.000 0
    117 9057 0 0 0 0.000 0
    118 9058 3740 104 146 989.566 0

I should have taken more care when merging in these patches, even though they are supposed to fix issues:
    Merged r8697: Add a method to the query atom container creator which creates an
    queryatomcontainer. This replaces each pseudoatom to an anyatom.
    Merged r8699 and r8700: Added test file by Volker (see cdk-user) for the shortest path problem;
    JUnit test provided by Volker Haehnke (haehnke - bioinformatik uni-frankfurt de), somewhat
    rewritten.
    Merged r8701: Renamed a variable to comply with http://en.wikipedia.org/wiki/Dijkstra's_algorithm
    Merged r8751: Bug fixes for bugs #1783367 'SmilesParser incorrectly assigns double bonds' and
    #1783381 'SmilesParser uses Molecule instead of IMolecule'. Test case for bug #1783367.
    Merged r8754 and r8773: Fix and test case for bug #1783547 and #1783546 'Lost aromaticity in
    SmilesParser with Biphenyl and Benzene'
    Merged r8774: Add a MDL RXN reader which uses the MDLV2000Reader instead of the MDLReader
    Merged r8775, r8776, r8777: bug fixes for #150354 #1783774 #1778479 in the SmilesParser,
    SmilesGenerator and MDLWriter/PseudoAtom.
    Merged r8791: Code for v,mass atom two digits mass atom and exception handeling
    Merged r8800: Fixed reading of MDL molfiles with exactly 12 columns (==valid) in the bond block
    Merged r8802: Made a little more memory efficient by removing unnesscary cloning operations
    Merged r8803: Fixed it so that we make a deep copy of the input molecule
    Merged r8809: Added code to work on a local copy of theinput molecule
    Merged r8811: Updated Javadocs
    Merged 8824 8821 8820 8819 8817 8816: Added code to properly work on a local copy

I'm quite sure it must be the deep-cloning fix ported from the commits 8800-8824. I already fixed a number of bugs in the IP calculation code which is still a good deal of the failing tests in the cdk-1.0.x/ branch (and affects trunk/ too), as can be seen by the drop in bugs just after the big increase:
    r9079 | egonw | 2007-10-15 13:24:10 +0200 (Mon, 15 Oct 2007) | 1 line

    Renamed container to localClone to clear up code. Fixed a bug where the uncloned atoms was
    searched in the cloned atomcontainer. More bugs like this are in the code. Miguel is contacted
    about this problem.
    ------------------------------------------------------------------------
    r9082 | egonw | 2007-10-15 13:48:15 +0200 (Mon, 15 Oct 2007) | 1 line

    Renamed container to localClone to clear up code. Fixed a bug where the uncloned atoms was
    searched in the cloned atomcontainer.

The big drop in number of fails is caused by the removal of the SMARTS code from the branch, which has been present since the start of the branch (see this page).

From this analysis I conclude that CDK 1.0.2 can soon be released. With the not that the ionization potential calculation is not safe to use.

Monday, December 17, 2007

Open Data getting more recognition

The OD part of ODOSOS is getting more and more attention, and it seems that Peter's Open Data battle is paying off (see his original OpenData article in Wikipedia): an open data specific license has reached the beta stage (see this announcement).

The idea behind this licenses seems to come down to:
    Facts are free. The Rightsholder takes the position that factual information is not covered by Copyright. This Document however covers the Work in jurisdictions that may protect the factual information in the Work by Copyright, and to cover any information protected by Copyright that is contained in the Work.

I am looking forward how this license will be picked up by the community. PubChem may be a good candidate to use this license; to formalize their dump into the public domain. Not just yet, though, because things might still change. It is said that a wiki will be set up to ask for feedback. Paul has written a nice writeup on the history of this license.

I particularly like the quote by Tim O'Reilly from this blog:
    One day soon, tomorrow's Richard Stallman will wake up and realize that all the software distributed in the world is free and open source, but that he still has no control to improve or change the computer tools that he relies on every day. They are services backed by collective databases too large (and controlled by their service providers) to be easily modified. Even data portability initiatives such as those starting today merely scratch the surface, because taking your own data out of the pool may let you move it somewhere else, but much of its value depends on its original context, now lost.

In the past I have argued for the CC-BY license, and so does Peter in this recent comment on a post by Deepak on educating people about data ownership. Interestingly, the new license proposes to remove ownership as solution to free the data :)

Thursday, December 13, 2007

I don't blame Individuals in Commercial Chemoinformatics

The comment I left in the ChemSpider blog, was probably a bit blunt. ChemSpider announced having licensed software from OpenEye. I have seen such announcements more often, but am intrigued about the nature of such announcements. Is it bad that ChemSpider is using OpenEye software? Certainly not. But it is surprising that they "announced today they had entered into an agreement that will allow the incorporation of a number of OpenEye’s products into ChemZoo’s online chemistry database and property prediction service, ChemSpider" (emphasis mine).

Is it really special that you buy software and then use it? Maybe, it increasingly is, with a number of good software products freely available. Even many proprietary products are freely available, sometimes to a selected group only, though. Or, is there some license behind this that restricts you in what you may and may not do with it?

Anyway, I made the somewhat inconsiderate comment:"Amazing! (Forgive me that I [have] not read every bit…) But, amazing! A press release for the fact that one may use software ;)".

Anthony replied with these lines:"Yes, I think it is amazing that companies of this caliber are willing to provide their tools at no cost to systems like ChemSpider". He read my sarcasm correctly. I find it absurd that the future of chemoinformatics is left to the goodwill of benevolent companies. Chemoinformatics is way too important, and in way to crappy state, to be kept as proprietary toy to industry; that's something I argued before.

Let me try to explain where my sarcasm is coming from.

I do Not blame Individuals in Commercial Chemoinformatics
There is nothing wrong with getting payed for what you do. I get payed for the software I develop too, though most of my contributions to the CDK, Jmol en even some some of my contributions to Bioclipse I have made as a hobby, in my spare time, unpaid. Nothing wrong with a good hobby, I would say.

But I do not blame people for not doing the same. Neither do I blame myself for making a reasonable living in the Netherlands, unlike all those poor bastards who struggle to make it to the next month, like many in the United States. But I do not like the situation. Neither do I blame people for being religious, though I really dislike several of the things the Church is trying to make
people believe (such as that the HIV virus can get through condoms). I hate the situation.

I do not dislike the Commercial Model
People have to make a living. I do; anyone does. I do feel, however, there is a difference between making a living because you work, and getting money because you happen to be at the right side of the money flow. There is a difference between a baker getting up at 5am every morning to feed a village, and someone selling a thin slice of bread via eBay to a poor African soul who just received his/her OPLC laptop. Not that I think this really applies to the ChemSpider/OpenEye deal; just to make a statement about commercialism.

The Bill Gates foundation spending a lot of money on scientific research is what Dutch would call een sigaar uit eigen doos. This translate to something like getting a present you payed yourself. Literally, 'to get a sigar from ones own box'. But that's another story.

I hate the situation
I hate the situation that research for new drugs is so expensive, and medicine likewise. I hate it that pharmaceutical industry cannot sell these drugs cheaply to development countries, because they will be sold expensively in western markets. But I do not blame the scientists working in pharma industry.

I hate the situation that scientific results cannot be reproduced independently, because software is being used as black box. But I do not blame the guy who wrote the code.

I hate the situation that I cannot contribute the excellent products around, because they disallow me to discuss my work with others. But I do not blame the guy who sold me the license.

I hate the situation that many very qualified scientists have to find a post-doc after post-doc before the give up and do to industry. I hate the situation that the better scientist you are, the less science you actually do, because all time is spent on getting further funds. But I do not blame those who payed for those temporary post-doc positions.

I hate the situation that people have to use commercial models for their scientific contributions, just to make a living, even though they would have loved to contribute that to mankind. But I do not blame them for wanting to be able to fulfill their primary living requirements (and those of their families).

I hate the situation I review papers for free for commercial publishers, just to help science progress. I do blame myself for not having stopped doing that yet.

But I do not blame ChemSpider for buying or using commercial products. I do not blame the people working at OpenEye for making a living. But I do find it absurd that we have to be amazed that scientific software is put to work.

I apologize for being blunt, but I cannot apologize for disliking the current situation chemoinformatics is in.

Monday, December 10, 2007

Tagging, thesauri or ontologies?

Controlled vocabularies, hierarchies, microformats, RDF. Nico Adams pointed me to this excellent video:



It's a really nifty piece of work, which goes into the differences between thesauri, controlled vocabularies, and, as such, ontologies, and social tagging systems. Both have their virtues; it is fuzzy logic versus ODEs all over again. Whether one is better than the other only depends on the problem at hand. For example, can you imagine social tagging in atom typing prior to performing force field calculations? Or, an 150-term ontology to annotate the scientific content of your literature archive?

More from where they come from...
The video appears to be made by the Digital Etnography group, which has made several more movies. Certainly something I'm going to check out over the winter holidays (I guess I am quite a bit more religious about ODOSOS than about gods).

Nico wrote: As long as we appreciate that there may be more than one top node…. I am not entirely sure, but if he refers the thesauri, which are, a particular form of ontologies, where basically the only relations that can be found are is-a or is-parent-of, resulting in a hierarchy of controlled terminology with one top node (such as the Gene Ontology). Ontologies can and should be much richer if we really want to take advantage of our information technologies, just like we do with any graph mining. Why mould reality in a tight hierarchy?

Chemical ontologies
Peter has not seen the movie yet, but replied with a recent comment he had on CML:
    Ebs and Michael had reviewed CML and questioned why the key concepts were atoms, molecules, electron, substances, whereas they suggested it would have been better to start from reactions. I think that’s a very clear difference in orientation between endurants and perdurants. Although chemists publish reactions, most of the emphasis is on (new) substances and their properties. CML is designed to map directly onto the way chemists seem to think - at least in their public communication - e.g. through documents. Of course we can also do reactions in CML, but even there the emphasis is often on the components.


The suggestion by Ebs and Michael is indeed quite surprising: ontologies tries to capture knowledge and expressed this an a small set of terms, each of which with an accurate and non-overlapping meaning (orthogonal, if you wish). Now, the terms carbon, nitrogen, oxygen, and the other 104 elements are quite accurate and rather different from each other, at least from a chemical point of view. Sure, bonding is more difficult, and let's not start about aromaticity. But to question atoms, bonds or electrons as key concepts??

Friday, December 07, 2007

Open Source, Open Data at the European Bioinformatics Institute

I was pleased to hear that Christoph will move to the EBI early next year. Christoph has been working on Open Source and Open Data chemoinformatics since at least 1997. I first got in contact with Christoph when I wrote code for JChemPaint (which Christoph developed) to be able to read Chemical Markup Languages (CML). This also got me into contact with Dan Gezelter who is the original author of Jmol, to which I also added CML support. And, of course, with Henry and Peter, who first developed CML. This was before XML was an official recommendation, and I have worked with CML files which you would no longer recognize. It was in Dan's office that the CDK was founded, where Christoph, Dan and I designed data classes to replace the JChemPaint and Jmol data classes. Both JChemPaint and Jmol were rewritten afterwards, but for Jmol it was later decided that more tuned classes were needed to achieve to required performance for the live rendering of tens of thousands of atoms.

Well, Christoph has done many other Open Source and Open Data stuff, including the NMRShiftDB, Bioclipse and Seneca, a tool for computer-aided structure elucidation (CASE). The scientific impact for Christoph's work is considerable. When I realize that much of his past work was setting out foundations, and that these foundations have found the be solid, I am happy to hear that he can now start to apply his work to life science problems, where current methods are failing.

Christoph, cheers!

Monday, December 03, 2007

Web2O, Open Chemistry, and Chemblaics

Chemistry World December issue features a nice item on the future of data in chemistry: Surfing Web2O; Peter gave an excerpt, and Peter commented on it.

The article discusses many of the things that have been happening in the field of chemical data. It touches Jean-Claude's work on Open Notebook Science, and then moves to Peter's Open Data, mentions a number of other blogs and the Chemical blogspace. Via some video efforts, it ends up with Mitch' Chemmunity, which has the coolest Captcha I have seen so far:



It also cited Rich' blog item on 32 free chemical databases, Christoph's NMRShiftDB.org, Project Prospect, and CML which recently saw its the 7th research paper.

Of course, this is the arena of chemblaics, but unfortunately my blog is not cited (though my name mentioned). So, what is wrong with my blog??

Tuesday, November 27, 2007

Be in my Advisory Board #1: being a good Open Science citizen

I recently saw that blogger.com blogs gained a poll feature. From now on, I will try to be a bit more Open Science, in addition to Open Source. From now on, you can be in my Advisory Board. To do so, vote on my next chemblaics (aka Open Source Chemoinformatics) project. The poll can be found on the left side of this blog. Associated which each poll, which I may run more or less frequently depending on the time of year, will be one blog post where I introduce the options. Options not mentioned, or completely different things, you would like to suggest me to do, can be left as comments to these items.

Finishing the new JChemPaint code
Goal of this option is to use the code written by Niels in his ProgrammeerZomer project to implement a new JChemPaint based on Java2D and independent of the widget set used (Swing/AWT/SWT/...).

CML-roundtripping of the CDK data model
The goal of this project is to ensure that all information the CDK data model can hold can be roundtripped in CML.

Integrating InChI-NestedVM in Bioclipse
Rich is, besides an excellent blogger, also someone who is not afraid to try new things. Recently, he experimented with compiling the InChI library into a Java executable. Bioclipse already is able to generate InChIs, using the code written by Sam Adams for the CDK, but a InChI/NestedVM plugin for Bioclipse could make a nice show case.

Writing CDK News articles
On the other hand, you might find that I should focus on getting a new CDK News issue out, for which we are stilling lacking (finished) contributions.

It's up to you. Deadline in about two weeks; still got some other things to finish :)

Monday, November 26, 2007

Metabolomics workflows in Taverna

My current jobs description is to speed up metabolomics data analysis, and finally got around to making a first relevant workflow for Taverna, using the webservices just posted over at ChemSpider:


I uploaded the source to MyExperiment, so anyway can play with it. There is much to improve, such as using CDK-Taverna for further analysis of the results.

I am not sure if opening the workflow in your Taverna installation will automatically set up the WDSL scavenger for the ChemSpider services, which are available in a HTTP version too, btw. If not, right click on the Available Processors folder, and pick Add new WDSL scavenger... and point it to the URL http://www.chemspider.com/MassSpecAPI.asmx?WSDL. The result should look like:


Oh, and please note this comment:
    These services are offered free of charge to our users during this period of testing, validation and feedback. Some of these services will be made available commercially in the future and we are proactively informing you of our intention to do this. It is likely that these services will remain available to academia at no charge. Please contact us at feedbackATchemspiderDOTcom with feedback and questions.

So, I do not know when my workflow will stop working.

Thursday, November 22, 2007

MetWare: metabolomics database project started on SourceForge

The Applied Bioinformatics at PRI group where I now work in Wageningen and the group of Steffen Neumann in Halle have started the MetWare project on Sourceforge to develop opensource databases for metabolomics data.

The databases design will be based on and ideally compatible with proposed standards like ArMet (DOI:10.1038/nbt1041) and those recently written up by the Metabolomics Standards Initiative (see the issue around DOI:10.1007/s11306-007-0070-6).

One important design goal is that the project will use BioMart, which will allow easy integration of the database content in data analysis programs like Taverna and R using the biomaRt package (see DOI:10.1093/bioinformatics/bti525).

Though the software will be opensource, it is yet unsure how much data will be open.

Tuesday, November 20, 2007

When standards fail...

Jim shows that some people do not think webservices standards are complex enough in itself:

Monday, November 19, 2007

An R-based genetic algorithm

During my PhD I wrote a simple but effective genetic algorithm package for R. Because there was a bug recently found, and there is interest in extending the functionality, I have set up a SourceForge project called genalg.

The package provides GA support for binary and real-value chromosomes (and integer chromosomes is something that will be added soon), and allows to use custom evaluation functions. Here is some example code:
# optimize two values to match pi and sqrt(50)
evaluate <- function(string=c()) {
returnVal = NA;
if (length(string) == 2) {
returnVal = abs(string[1]-pi) + abs(string[2]-sqrt(50));
} else {
stop("Expecting a chromosome of length 2!");
}
returnVal
}
monitor <- function(obj) {
# plot the population
xlim = c(obj$stringMin[1], obj$stringMax[1]);
ylim = c(obj$stringMin[2], obj$stringMax[2]);
plot(obj$population, xlim=xlim, ylim=ylim, xlab="pi", ylab="sqrt(50)");
}
rbga.results = rbga(c(1, 1), c(5, 10), monitorFunc=monitor,
evalFunc=evaluate, verbose=TRUE, mutationChance=0.01)
plot(rbga.results)
plot(rbga.results, type="hist")
plot(rbga.results, type="vars")

Friday, November 16, 2007

Molecules in Wikipedia without InChIs #3

Third in the series of blogs about molecules in Wikipedia without an InChI (see also #1 and #2). There a certainly false positives, but here's the updated list:
http://www.en.wikipedia.org/wiki/AZD2171
http://www.en.wikipedia.org/wiki/Alizarin
http://www.en.wikipedia.org/wiki/Allantoin
http://www.en.wikipedia.org/wiki/Allylamine
http://www.en.wikipedia.org/wiki/Alpha-ethyltryptamine
http://www.en.wikipedia.org/wiki/Anthraquinone
http://www.en.wikipedia.org/wiki/Aspartame
http://www.en.wikipedia.org/wiki/Barium_sulfate
http://www.en.wikipedia.org/wiki/Biotin
http://www.en.wikipedia.org/wiki/Boron_nitride
http://www.en.wikipedia.org/wiki/Botox
http://www.en.wikipedia.org/wiki/Bremelanotide
http://www.en.wikipedia.org/wiki/CAS_registry_number
http://www.en.wikipedia.org/wiki/Cadmium_sulfide
http://www.en.wikipedia.org/wiki/Carminic_acid
http://www.en.wikipedia.org/wiki/Celestine_%28mineral%29
http://www.en.wikipedia.org/wiki/Cellulose
http://www.en.wikipedia.org/wiki/Chemical
http://www.en.wikipedia.org/wiki/Chemical_file_format
http://www.en.wikipedia.org/wiki/Cheminformatics
http://www.en.wikipedia.org/wiki/Chloramine
http://www.en.wikipedia.org/wiki/Chloroethane
http://www.en.wikipedia.org/wiki/Cinnamic_acid
http://www.en.wikipedia.org/wiki/Crabtree's_catalyst
http://www.en.wikipedia.org/wiki/DDT
http://www.en.wikipedia.org/wiki/DMAP
http://www.en.wikipedia.org/wiki/Dimethicone#Applications
http://www.en.wikipedia.org/wiki/Dimethyl_amine
http://www.en.wikipedia.org/wiki/Dimethyl_sulfide
http://www.en.wikipedia.org/wiki/Dimethylethanolamine
http://www.en.wikipedia.org/wiki/Dioxine
http://www.en.wikipedia.org/wiki/Diphenylamine
http://www.en.wikipedia.org/wiki/Dmso
http://www.en.wikipedia.org/wiki/EDTA
http://www.en.wikipedia.org/wiki/Eschenmoser%27s_salt
http://www.en.wikipedia.org/wiki/Ethylene_carbonate
http://www.en.wikipedia.org/wiki/Folate
http://www.en.wikipedia.org/wiki/Formic_acid
http://www.en.wikipedia.org/wiki/HMPA
http://www.en.wikipedia.org/wiki/Hafnium(IV)_oxide
http://www.en.wikipedia.org/wiki/Heavy_water
http://www.en.wikipedia.org/wiki/Hexafluoroisopropanol
http://www.en.wikipedia.org/wiki/Hydrogen_cyanide
http://www.en.wikipedia.org/wiki/Hydrogen_cyanide#Hydrogen_cyanide_as_a_chemical_weapon
http://www.en.wikipedia.org/wiki/Hydrogen_peroxide
http://www.en.wikipedia.org/wiki/Hydroxyapatite
http://www.en.wikipedia.org/wiki/Hydroxybenzotriazole
http://www.en.wikipedia.org/wiki/IUPAC_nomenclature_of_inorganic_chemistry
http://www.en.wikipedia.org/wiki/Indole
http://www.en.wikipedia.org/wiki/Interferon_beta-1a
http://www.en.wikipedia.org/wiki/J%C3%B6ns_Jakob_Berzelius
http://www.en.wikipedia.org/wiki/Lawesson%27s_reagent
http://www.en.wikipedia.org/wiki/Lewisite
http://www.en.wikipedia.org/wiki/MTBE
http://www.en.wikipedia.org/wiki/Maitotoxin
http://www.en.wikipedia.org/wiki/Menthol
http://www.en.wikipedia.org/wiki/Merck_Index
http://www.en.wikipedia.org/wiki/Mescaline
http://www.en.wikipedia.org/wiki/Metaldehyde
http://www.en.wikipedia.org/wiki/Methionylalanylthreonyl...leucine
http://www.en.wikipedia.org/wiki/Methyl_amine
http://www.en.wikipedia.org/wiki/Methyl_salicylate
http://www.en.wikipedia.org/wiki/Molecular_Query_Language
http://www.en.wikipedia.org/wiki/N-butyllithium
http://www.en.wikipedia.org/wiki/Nafion
http://www.en.wikipedia.org/wiki/Nitrous_oxide
http://www.en.wikipedia.org/wiki/Octanitrocubane
http://www.en.wikipedia.org/wiki/Organic_chemistry
http://www.en.wikipedia.org/wiki/Organic_chemistry#Molecular_structure_elucidation
http://www.en.wikipedia.org/wiki/P4O10
http://www.en.wikipedia.org/wiki/Paraldehyde
http://www.en.wikipedia.org/wiki/Penicillin
http://www.en.wikipedia.org/wiki/Peroxyacetic_acid
http://www.en.wikipedia.org/wiki/Phenol
http://www.en.wikipedia.org/wiki/Physical_science
http://www.en.wikipedia.org/wiki/Piperidine
http://www.en.wikipedia.org/wiki/Potassium_chloride
http://www.en.wikipedia.org/wiki/Psilocybin
http://www.en.wikipedia.org/wiki/Pubchem
http://www.en.wikipedia.org/wiki/Quinine_total_synthesis
http://www.en.wikipedia.org/wiki/Resveratrol
http://www.en.wikipedia.org/wiki/Rhodamine
http://www.en.wikipedia.org/wiki/Salvia_divinorum
http://www.en.wikipedia.org/wiki/Selenium_dioxide
http://www.en.wikipedia.org/wiki/Silicon_carbide
http://www.en.wikipedia.org/wiki/Skatole
http://www.en.wikipedia.org/wiki/Skeletal_formula
http://www.en.wikipedia.org/wiki/Soman
http://www.en.wikipedia.org/wiki/Splenda
http://www.en.wikipedia.org/wiki/Standard_atomic_weight
http://www.en.wikipedia.org/wiki/Subgraph_isomorphism_problem
http://www.en.wikipedia.org/wiki/Sulfur_hexafluoride
http://www.en.wikipedia.org/wiki/Sulfur_mustard
http://www.en.wikipedia.org/wiki/TBHQ
http://www.en.wikipedia.org/wiki/Tabun_(nerve_agent)
http://www.en.wikipedia.org/wiki/Teicoplanin
http://www.en.wikipedia.org/wiki/Tetra-ethyl_lead
http://www.en.wikipedia.org/wiki/Tetraazidomethane
http://www.en.wikipedia.org/wiki/Tetrachloroethylene
http://www.en.wikipedia.org/wiki/Thiomersal
http://www.en.wikipedia.org/wiki/Titanium_dioxide
http://www.en.wikipedia.org/wiki/Tourmaline
http://www.en.wikipedia.org/wiki/Uric_acid
http://www.en.wikipedia.org/wiki/VX_%28nerve_agent%29
http://www.en.wikipedia.org/wiki/Valence_%28chemistry%29
http://www.en.wikipedia.org/wiki/benzylbromide
http://www.en.wikipedia.org/wiki/cortisone
http://www.en.wikipedia.org/wiki/epothilone
http://www.en.wikipedia.org/wiki/piperidine
http://www.en.wikipedia.org/wiki/stilbene
http://www.wikipedia.org/wiki/Phosgene

Wednesday, November 14, 2007

Last Call for Open Laboratory 2007

Pedro reminded me of the last call for Open Laboratory 2007, which prints the best blog items of 2007 in book form. The list of chemistry contributions is not so large yet, so go ahead and nominate some of cool chemical blog items of the last year.

I will post my shortlist later this week.

Monday, November 12, 2007

Scintilla and Postgenomic.com on Linux 2.6.17+

That's why blogging works! I reported last Friday on using my Wii for reading Scintilla and Postgenomic.com. Alf replied:
    It is the Linux kernel, yes: TCP window scaling was switched on by default in kernels since about a year ago (and in Vista too, I think), and one of our routers or firewalls doesn't like it. We're trying to get them upgraded, but it takes a while...

Ah, the trick word: TCP windows scaling. A quick google turned up a workaround in John's Tidbits blog:
    There are 2 quick fixes. First you can simply turn off windows scaling all together by doing

    echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

    but that limits your window to 64k. Or you can limit the size of your TCP buffers back to pre 2.6.17 kernel values which means a wscale value of about 2 is used which is acceptable to most broken routers.

    echo "4096 16384 131072" > /proc/sys/net/ipv4/tcp_wmem
    echo "4096 87380 174760" > /proc/sys/net/ipv4/tcp_rmem

    The original values would have had 4MB in the last column above which is what was allowing these massive windows.

    In a thread somewhere which I can’t find anymore Dave Miller had a great quote along the lines of

    “I refuse to workaround it, window scaling has been part of the protocol since 1999, deal with it.”

That worked for me. I think Dave Miller is right, but can't resist reading Scintilla and Postgenomic.com on my desktop too ;)

Friday, November 09, 2007

Using the Nintendo Wii for serious science...

On my desktop, the Scintilla and Postgenomic.com websites do not work. It is not a browser problem, but has something to do with TCP/IP packages not reaching its destination: the browser. Euan told me they are aware of the problem, but apparently have not found a solution yet.

However, my Wii does not have the problem, which makes me wonder if it is a disagreement between the Nature server and my Linux kernel... Anyway, this is what the two website look like (first Scintilla, then Postgenomic.com):





(BTW, that was one very nice piece of work by Rich! Make sure to also read the follow up.)

The only real disadvantage is that it does not integrate well with the things I do daily. If I see some interesting post, and would like to tag it on my del.icio.us account, I have to google for it on my desktop :(

(You thought I was going to talk about F@H or so, didn't you? :)

Thursday, November 08, 2007

Cytoscape in Amsterdam

Right at this moment I am listening to Andrew Hopkins from Dundee on chemical opportunities in system biology, at the Cytoscape conference in Amsterdam. Anyone who wants to meet up over lunch or coffee break?

Wednesday, November 07, 2007

Comparing JUnit test results between CDK trunk/ and a branch

I have started using branches for non-trivial patches, like removing the HückelAromaticityDetector, in favor of the new CDKHückelAromaticityDetector. I am doing this in my personal remove-non-cdkatomtype-code branch, where I can quietly work on the patch until I am happy about it. I make sure to keep it synchronized with trunk with regular svn merge commands.

Now, the goal is that my branch only fixed failing JUnit tests, not that it creates new regressions. To compare the results between two versions of the CDK, I use these commands:

$ cd cdk/trunk/cdk
$ ant -lib develjar/junit-4.3.1.jar -logfile ant.log test-all
$ cd ../../branches/egonw/remove-non-cdkatomtype-code/
$ ant -lib develjar/junit-4.3.1.jar -logfile ant.log test-all
$ cd ../../..
$ grep Testcase branches/egonw/remove-non-cdkatomtype-code/reports/*.txt | cut -d':' -f2,3 > branch.results
$ grep Testcase trunk/cdk/reports/*.txt | cut -d':' -f2,3 > trunk.results
$ diff -u trunk.results branch.results

The last diff commands gives me a quick overview of what has changed. See get the statistics, I can do:

$ diff -u trunk.results branch.results | grep "^-Testcase" | wc -l
$ diff -u trunk.results branch.results | grep "^+Testcase" | wc -l

The first gives me the number of JUnit tests which are now no longer failing, while the second
gives me the number of tests which are new fails. Ideally, the second is zero. Unfortunately, not yet the case :)

Tuesday, November 06, 2007

Evidence of Aromaticity

I have been working on a new atom type perception engine for the CDK, after having decided that the existing atom type lists where not sufficient for the algorithms we have in the CDK. The new list is growing in size, and basically contains four properties (besides element and formal charge):
  1. number of bounded neighbors
  2. number of pi bonds (or double bond equivalents)
  3. number of lone pairs
  4. hybridization state
This seems to be a minimal and accurate set to cover a rather good deal of chemoinformatics. I have yet to make the mappings of the new atom type list with existing lists for force fields, and radicals are missing too. However, the following algorithms in the CDK seem to translate rather well:
  • hydrogen adding
  • aromaticity detection (Hückel rules)
I still have to rework the double bond perception.

Aromaticity
Now, aromaticity is a fuzzy concept, and there is no general agreement on what it is. Some say it is smelly compounds, others say ring systems which apply to the Hückel rule. Based on the new atom type list, I have rewritten the Hückel aromaticity detector and it applies these rules:
  • only single rings and two fused non-spiro rings
  • 4n+2 electrons
  • no ring atoms with double points not in the ring too
This approach differs in two ways from the old code: it no longer tries to test all ring systems, which required to use the CDK AllRingsFinder algorithm which combinatorial generates all possible ring systems. The new code only considers ring systems with up to two single rings. Aromaticity beyond that is even less well defined than aromaticity in general.

The other difference is that the ring system must not have ring atoms which have a double bond which is not part of the ring too. The classical example is benzoquinone (InChI=1/C6H4O2/c7-5-1-2-6(8)4-3-5/h1-4H) which is not aromatic, even though it conforms the 4n+2 rule (image from PubChem):



Evidence of Aromaticity
The final rule, of course, is what nature tells us what is aromatic and what is not. There are many other details to aromaticity than I just covered. For example, take azulene (InChI=1/C10H8/c1-2-5-9-7-4-8-10(9)6-3-1/h1-8H). All atoms are aromatic, but not all bonds (also PubChem):



These things are complex, but the rise of Open Data helps us out, as well as increasing computing power. Peter has been running two rather projects which may help us out: CrystalEye (Nick: no blog?) and OpenNMR.

NMR shifts will give us experimental backup on our notion of aromaticity, and so do bond lengths. I asked Peter about this, and whether OpenNMR predicted shifts could indeed confirm aromaticity of compounds, and he replied and showed that the predicted spectra could be used to distinguish between C-C and C=C bonds.

I commented the following (which was in moderation at the time of writing), and that gets us to experimental evidence for aromaticity:
    Thanx for the elaborate answer. What I had in mind was the question whether NMR shift predictions can be used to tell me if a certain ring system is aromatic or not, and in case of fused rings, which atoms and which bonds are aromatic and which not. I’m sure the prediction error for 1H NMR shifts is well below 2ppm, and more in the order of 0.2ppm.

    But maybe I should be asking, can I use CrystalEye to decide if ring systems are “aromatic”, and in case of two rings fused together (non-spiro), which atoms and bonds are aromatic and which not. Aromaticity is a fuzzy concept, with various definitions. I would be interesting in linking what the expert considers ‘aromatic’ (or SMILES, or the CDK, or …) with what the QM chemistry (via bond lengths or NMR shift predictions) and crystal structures (via bond lengths) has to teach us. The null hypothesis being that the bonds are not delocalized (bond length) and that no ring current is found (NMR shifts, 1H in particular).

    Regarding those bond lengths, ‘aromatic’ bonds show a bond length in between that of single and double bonds (e.g. see this random pick). The CrystalEye data does not reflect that really, and only a trimodal histograms shows up. Indeed, the C#C peak is *very* low, around 1.2A :) Apparently, the triple C#C bond order is underrepresented in nowadays crystallography.

    Maybe aromatic C:C bonds are underrepresented too, or can the absence of a peak around 1.40A be explained otherwise? I would at least have expected a shoulder or deviation in peak shape of the peak at 1.37A.

This is what the histogram looks like (for archival reasons):

Monday, November 05, 2007

Glueing BioMoby services together with JavaScript in Bioclipse

Ola has been doing a good job of integrating BioMoby support into Bioclipse. Earlier he completed a GUI for running BioMOBY services, and added more recently a JavaScript wrapper too, using the Rhino plugin developed by Johannes.

For example:
       console = Packages.net.bioclipse.util.BioclipseConsole; 
    moby = Packages.net.bioclipse.biomoby.ui.scripts.MobyServiceScripting;
    biojava = Packages.net.bioclipse.biojava.scripts.BioJavaScripting;

    prot=moby.downloadGenbank("NCBI_GI","111076");
    seq=biojava.parseString(prot);
    fasta=biojava.toFasta(seq);

    console.writeToConsole(fasta);

Today he explained how to create convenience JavaScript shortcut, to reduce the typing.

Screenshots and status of the Bioclipse-BioMoby work is available from the wiki.

Wednesday, October 31, 2007

Offline CDK development using git-svn

While Subversion is a signification improvement over CVS, they both require a central server. That is, they do not allow me to commit changes when I am not connected to that server. This is annoying when being on a long train ride, or somewhere else without internet connectivity. I can pile up all my changes, but that would yield one big ugly patch.

Therefore, I tried Mercurial where each client is server too. The version I used, however, did not have the move command, so it put me back into the old CVS days where I lost the history of a file when I reorganize my archive.

Git
Then Git, the version control system developed by Linus Torvalds when he found that existing tools did not do what he wanted to do. It seems a rather good product, though with a somewhat larger learning curve, because of the far more flexible architecture (see this tutorial). Well, it works for the Linux kernel, so must be good :)

Now, SourceForge does not have Git support yet, so we use Subversion. Flavio of Strigi fame, however, introduced me to git-svn. Almost two month ago, already, but finally made some time to try it out. I think I like it.

This is what I did to make a commit to CDKs SVN repository:

$ sudo aptitude install git-svn git-core
$ mkdir -p git-svn/cdk-trunk
$ cd git-svn/cdk-trunk
$ git-svn init https://cdk.svn.sourceforge.net/svnroot/cdk/trunk/cdk
$ git-svn fetch -rHEAD
$ nano .classpath
$ git add .classpath
$ git commit
$ git-svn dcommit

The first git-svn command initializes a log Git repository based on the SVN repository. The git-svn fetch command makes a local copy of the SVN repository content defined in the previous command. Local changes are, by default, not commited; unless one explicitly git adds them to a patch. Once a patch is ready you can do all sorts of interesting things with them, among with commit them to the local Git repository with git commit.

Now, these kind of commits are on the local repository, and I do not require internet access for that. When I am connected again, I can synchronize my local changes with the SVN repository with the git-svn dcommit command.

A final important command is git-svn rebase, which is used to update the local git command for changes others made to the SVN repository.

Monday, October 29, 2007

BioSpider: another molecule search engine

I just ran into BioSpider. Unlike ChemSpider, BioSpider crawls the internet (well, this list of sources really) to find information, and depending on what it finds it continues the search. Below is a screenshot of an intermediate point after starting with the InChI of methane:


After the search it generates a long HTML page with all the information it found on the molecule you queried for. This approach is much more scalable than storing all in one database.

This crawling of information is something I was working on myself a bit too, and I think this is a good approach. However, I think the use of a central website is not the right approach. Instead, the search should be distributed too: the crawling should be done on the client machine; it should be done in Taverna or Bioclipse instead.

My conclusion: excellent idea, bad implementation.

Friday, October 26, 2007

My FOAF network #1: the FOAFExplorer

In this series I will introduce the technologies behind my FOAF network. FOAF means Friend-of-a-Friend and
    [t]he Friend of a Friend (FOAF) project is creating a Web of machine-readable pages describing people, the links between them and the things they create and do.

My FOAF file (draft) will give you details on who I am, who I collaborate with (and other types of friends), which conferences I am attending, what I published etc. That is, I'll try to keep it updated. BTW, FOAF is a RDF language.

FOAFExplorer
Pierre has done some excellent FOAF work in the past, and developed the MyFOAFExplorer, and also developed a tool to create a FOAF network based on the PubMed database, called SciFOAF. The latter is neat, but does not allow putting all this personal details in the FOAF files. However, the output could be a starting point.

Back to FOAFExplorer, this is what the FOAFExplorer shows for my network:



I'm a bit lonely, even though I have linked to two friends in my FOAF file, of which one has a FOAF file too (Henry):
<foaf:knows>
<foaf:Person rdf:ID="HenryRzepa">
<foaf:name>Henry Rzepa</foaf:name>
<rdfs:seeAlso rdf:resource="http://www.ch.ic.ac.uk/rzepa/rzepa.xrdf"/>
</foaf:Person>
</foaf:knows>
<foaf:knows>
<foaf:Person rdf:ID="PeterMurrayRust">
<foaf:name>Peter Murray-Rust
<foaf:mbox_sha1sum>926d6f8ed367bdded26353a05e80b4f0ce18230d
</foaf:Person>
</foaf:knows>

I guess the FOAFExplorer does not browse into my network. More on that in later items in this series.

Wednesday, October 24, 2007

One Billion Biochemical RDF Triples!

That must be a record! Eric Jain wrote on public-semweb-lifesci:

    The latest release of the UniProt protein database contains just over a
    billion triples*! PRESS RELEASE :-)

    The data is all available via the (Semantic or otherwise) Web:

    http://beta.uniprot.org/

    ...or can be bulk-downloaded from:

    ftp://ftp.uniprot.org/

    * Counting some reification statements, and assuming no overlap between
    "named graphs".

    P.S. This should be the last you'll hear from me on this topic -- I'm off
    to new adventures...

I surely hope this is not the last we hear of this huge RDF collection.

My blog turned 2

A bit over two years I posted my first blog item, Chem-bla-ics, introducing the topic of my blog. In January this year I explained why I like blogging.

Friday, October 19, 2007

Bob improved the POV-Ray export of Jmol

Bob has set up a new interface between the data model and the Jmol renderer, which allows him to define other types of export too. One of this is a POV-Ray export, which allows creating of high quality images for paper. Jmol has had POV-Ray export for a long time now, but never included the secondary structures or other more recent visual featues. PyMOL is well-known for its POV-Ray feature, and often used to create publication quality protein prints. The script command to create a POV-Ray input file takes the output image size as parameters:
write povray 400 600   # width 400, height 600

Here's a screenshot of a protein with surface:


And here a MO of water:



Note the shading. More examples are available here.

Thursday, October 18, 2007

More QSAR in Bioclipse: the JOELib extension

I added a Bioclipse plugin for JOELib (GPL, by Joerg) which comes with many QSAR descriptors, several of which are now available in the QSAR feature of Bioclipse:


Meanwhile, the Bioclipse team in Uppsala has set up the obligatory scatter plot functionality, but leave that screenshot for them to show. Therefore, time for integration with R.

Open Data Misconception #1: you do not get cited for your contributions

The Open Data/ChemSpider debate is continuing, and Noel wondered in the ChemSpider Blog item on the Open Data spectra in ChemSpider. The spectra in ChemSpider come from four persons, two of which released their data as Open Data (Robert and Jean-Claude) and two as proprietary data.

One of the two is Gary who expressed his concerns in the ChemSpider blog that people would not cite his contributions if he would release the data as Open Data:
    In principle, someone could download an assortment of spectra for a given molecule, calculate some other spectra, and then write a paper without ever recording a single NMR spectrum of their own. Would they then include the individual who deposited the spectra as a co-author or even acknowledge the source of the spectra that they used? Who knows.

It is a misconception that releasing your Open Data will cause a situation that your scientific work is not acknowledged (citing statistics is the crude mechanism we use for that). First of all, using results without acknowledgment is called plagiarism (which is ethically wrong by any standard). But this is not a feature of Open Data, it is found in any form of science. Recall Herr Schön.

Some months back I advised an other chemical database who had similar concerns, and I pointed the owners, like I commented to Gary, to the CC-BY license which has an explicit Attribution (BY) clause:
    Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

Using this license, plagiarism would not even just be (scientifically) unethical, it would be illegal too, because it would brake the license agreement. This even allows one to bring the case to court, if you like. (BTW, I was recently informed that the database had switched to the CC-BY license!)

Tuesday, October 16, 2007

Lunch at Nature HQ (with Euan, Joanna, Ian and Ålf)

On my way back from the Taverna workshop I visited Nature HQ, as Ian reported about on Nascent. It was a (too) short meeting, but very nice to meet Euan (finally; he wrote the postgenomic.com software which I use for Chemical blogspace), Joanna (whom I met in Chicago already, where she had two presentations, and is responsible for Second Nature), Ian (who works on Connotea, and commented on my tagging molecule blog) and Ålf (who works on Scintilla) and briefly Timo (who rules them all). BTW, I had a simple but delicious pasta.

First, let me note that if I would have to name a favorite molecule, and it was acetic acid, not ascorbic acid. Reason why it would be my favorite is acetic acid was the first organic molecule I put in the Woordenboek Organische Chemie in 1995.

We discussed a number of things, regarding the things we do. One of these was tagging molecules. Ian used http://rdf.openmolecules.net/?info:inchi/InChI=1/CH4/h1H4 instead of http://rdf.openmolecules.net/?InChI=1/CH4/h1H4. The first was not yet picked up by rdf.openmolecules.net but I fixed that.

We also discussed linking molecular structures with scientific literature. The discussions in blogspace of this week show that doing that by using computer programs is not appreciated by publishers (see here, here, here, here, here, and here) (The publishers seem to prefer to like to send of a PDF to India or China.)

I proposed that the InChI would be part of the publication, for all molecules mentioned in the article. If a journal can require exact bibliography and experimental section formats, they can certainly require InChIs too. There are few programs left which cannot autogenerate an InChI, and the chemists draws the structures anyway. However, the software used in the editorial process does not support linking InChIs with a PDF (if that software would have been opensource ...).

So, the best current option seems to be social tagging mechanisms, and this is what we talked about. Just use Connotea (or any other service) and tag your molecule with a DOI:



and



This tagging is done manually. No machines involved in that. Nothing the publishers can do about this. No ChemRefer needed. But this will allow us to start building a database with links between papers and molecules, which we badly need. BTW, we will not have to start from scratch. The NMRShiftDB already contains many links, which is open data!

Now, you might notice the informal semantics of the doi: prefix. That's something I hereby propose, as it allow services to pick up the content more easily. You might also note the incorrect DOI in Connotea. The reason for that is that Connotea does not yet support a '/' in a tag. I reported that problem.

ChemSpider: the SuSE GNU/Linux of chemical databases?

A molecular structure without any properties in meaningless. Structure generators can easily build up a database of molecules of unlimited size. 30 million in CAS, 20 million in ChemSpider or 15 million in PubChem is nothing yet. The value comes in when linking those structures with experimental properties.

Now, chemical industry, academia and publishers have done there best in the past 50 years to maintain such databases, and decided that a commercial model was the best option to maintain such databases. This was true 50 years ago, but no longer is. ICT has progressed so much that a 20M database can be stored on a local hard disc, or site repository anyway. Moreover, and more importantly, creating a database like this is much cheaper now. These ICT developments threaten the stone age chemical databases around now. Current approaches can easily build cheap and Open chemical databases; if we only all wanted.

ChemSpider is attempting to set up the largest free chemical database, by mixing both Open data, as well as proprietary data. As such, they are attempting to achieve what SuSE and other commercial GNU/Linux distributions are trying to do: create a valuable product by complementing Open data with proprietary data when that adds value. That is, I think they are doing this. SuSE, for example, includes proprietary video drivers. ChemSpider, for example, contains proprietary molecular properties computed by ACD/Labs software (BTW, some of which can be done with Open tools too, as I will show shortly.)

Now, this poses quite a challenge: different licenses, different copyright holders, requirements to provide access to the source (for the Open data), etc, all in one system. Quite a challenge indeed, because ChemSpider is now required to track copyright and license information for each bit of information. GNU/Linux distributions do this by using a package (.deb, .rpm) approach. And, the sheer size of the database poses strong requirements if people start downloading the whole lot.

ChemSpider has had their share of critique, but the are learning, and trying to find to set up a sustainable environment for what they want to do. That might involve a revenue stream from clients if there is no governmental organization, academic institute or some society stepping in to provide financial means. A valid question would be why the did not set up a non-profit organization. But neither did SuSE, RedHat and Mandriva, but that has not stopped those from contribution to Open source.

I have no idea where ChemSpider will end up (consider that a request for a copy of the full set of Open Data), but am happy to help them distribute Open data, and even help them replace proprietary bits with open equivalents, which I'm sure the are open too. With respect to proprietary bits the are redistributing, I understand they can only relay the ODOSOS message to the commercial partners from which they get those proprietary bits, and hope they are doing. ChemSpider has the great opportunity to show that releasing and contributing chemical data as Open Data does not conflict with a healthy self-sustainable business model.

Sunday, October 14, 2007

CompLife2007, Utrecht/NL. Day 1 and 2

CompLife 2007 was held 1.5 weeks ago in Utrecht, The Netherlands. The number of participants was much lower than last year in Cambridge. Ola and I gave a tutorial on Bioclipse, and Thorsten one on KNIME. Since a visit to Konstance to meet the KNIME developers, I had not been able to develop a KNIME plugin, but this was a nice opportunity to finally do so. I managed to do so, and wrote up a plugin that takes InChIKeys and then goes of the ChemSpider to download MDL molfiles:


Why ChemSpider? Arbitrary. Done PubChem in the past already. Moreover, ChemSpider has the largest database of molecular structures and in that sense important to my research.

Why KNIME? Played with Taverna in the past, and expect to do much more work on Taverna in the coming year (see also this and this). Moreover, KNIME got a CDK plugin already, and the KNIME developers contributed valuable feedback to the CDK project in the last year. It was about time that I contributed something back, though the current functionality is quite limited. KNIME has a better architectural design than Taverna1, but will face though competition with Taverna2, due next year.

The presentations
Heringa gave a presentation on network analysis, and discussed the scale-free network, hub nodes, etc, after which he gave an example on the 14-3-3 PPI family which both have promoting and inhibiting capabilities. Fraser presented work on improving microarray data analysis, by reducing non-random background noise. Schroeter presented the use of Gaussian process modeling in QSAR studies, which allows estimation of error bars (see DOI:10.1002/cmdc.200700041. I did not feel the results were very convincing, though, but the method sounds interesting. Larhlimi presented research on network analysis of metabolic networks. His approach finds so-called minimal forward direction cuts, which identifies critical parts in the network if one is interested in repressing certain metabolic processes. Hofto presented some work on the use of DFT for proteins, and picked up that one has to do things critically to be able to reproduce binding affinities. Combinations of DFT or MM with QM are becoming popular to model binding sites. Van Lenthe presented such an approach of the second day of CompLife.

By far the most interesting talk at the conference, was the insightful presentation by Paulien Hogeweg. She apparently coined the term bioinformatics. Anyway, she had a exciting presentation on feed-forward loops in relation to evolution, and showed correlation between jumps in FFL motifs with biodiversity. She also warned us for the Monster of Loch Ness syndrome, where computational models may indicate large underlying processes, which are not really existing. But that should be a problem that most of my readers should be aware of. She introduced evolutionary modeling, to put further restrictions on the models, to reduce the chance of finding monsters.

Hussong had an interesting presentation too, if one is interested in analysis of GC/MS or LC/MS data. He introduced a hard-modeling approach for proteomics data using wavelets technology. His angle on this was to use a wavelet that represents the isotopic pattern of a protein mass spectrum. Interestingly, the wavelet had negative intensities, something which one will never find in mass spectra. However, I seem to recall a mathematical restriction on wavelets that would forbid taking the squared version of the function. He indicated that the code is available via OpenMS.

Jensen, finally, presented his work at the UCC on Markov models for protein folding, where he uses the mean first passage time as observable to analyze of processes in folding state space. This allows him to compare different modeling approaches and, for example, to predict how many time steps are needed to reach folding. Being able to measure characteristics of certain modeling methods, one is able to make a objective comparison. Something which allows a fair competition.

Why ODOSOS is important

I value ODOSOS very high: they are a key component of science, and scientific research, though not every scientist sees these importance yet. I strongly believe that scientific progress is held back because of scientific results not being open; it's putting us back into the days of alchemy, where experiments were like black boxes and procedures kept secretly. It was not until the alchemists started to properly write down procedures that it, as a science, took off. Now, with chemoinformatics in mind, we have the opportunity to write down our procedures in high detail.

I keep wondering what the state of drug research would be, if the previous generation of chemoinformaticians would have valued ODOSOS as much as I do. Now, with a close relative being diagnosed last week with a form of cancer with low five-year survival rates, I can not get more angry about those who want to make (unreasonable) money by selling scientific research. A 1M bonus is unreasonable. I can have 10 post-docs work on chemoinformatics research for the same period; I can have them work on drug design for various kinds of cancer.

Therefore, I will continue to use every opportunity to convince people of ODOSOS, and will continue to develop new methods to improve accurate exchange of scientific data and experimental results. I will help people where I can to distribute open data, even if the whole project is not 100% ODOSOS. For example, the Chemistry Development Kit is open source itself (LGPL) which does allow embedding into proprietary software. This does not mean that I will contribute to the proprietary software, and actually am proud not having done so in the last 10 years.

I will continue to advice people how to make their work more ODOSOS, even if they cannot make the full transition. I will also continue to make sure that all my scientific results are ODOSOS, as there is no other kind of science. To set a good example, and, hopefully, to lead the way.

This is why I am a proud member of the Blue Obelisk.

Monday, October 08, 2007

Taverna Workshop, Day 1 Update

The second part of the morning session featured a presentation by Sirisha Gollapudi which spoke about mining biological graphs, such as protein-protein interaction networks and metabolic pathways. Patterns detection for nodes with only one edge, and cycles etc, using Taverna. An example data she worked on is the Palsson human metabolism (doi:10.1073/pnas.0610772104); she mentioned that this metabolite data set contains cocaine :) Neil Chue Hong finished with an introduction on the OMII-UK which is co-host of this meeting.

After lunch Mark Wilkinson introduced BioMoby, which we actually use in Wageningen already. I have tried to use jMoby to set up services based on the CDK, but failed sofar. Will talk with Mark on that. Next was my presentation, and I spoke about CDK-Taverna, Bioclipse and some peculiarities with chemoinformatics workflow, like the importance with intermediate interaction, the need to visualize the data and complex, information rich data. Bioclipse is seeing an integration of BioMoby and of Taverna.

After the coffee brake Marco Roos spoke about myExperiment and his work on text mining. I unfortunately missed this presentation, as I was meeting with people from the EBI who work on the MACiE database (see this blog item).

A discussion session afterwards introduced a few more Taverna uses, and encountered technical problems. Taverna2 is actually going to be quite interesting, with a data caching system between work processors, and a powerful scheme of annotation of processors, which will allow rating, finding local services, etc. More on that tomorrow. Dinner time now :)

Taverna Workshop, Hinxton, UK

I arrived at the EBI last night for the Taverna workshop, during which the design of Taverna2 is presented and workflow examples are discussed. Several 'colleagues' from Wageningen and the SARA computing center in Amsterdam are present, along with many other interesting people. This afternoon is my presentation.

Paul Fisher just presented his PhD work on using workflows to improve the throughput of QTL matching against pathway information and phenotype. One interesting note was its function to make biological informational studies more reproducible. He had getting the versions of online databases explicitly in the workflow, so that it gets stored in workflow output.

Monday, October 01, 2007

How the blogosphere changes publishing

Peter is writing up a 1FTE grant proposal for someone to work on the question how automatic agents and, more interestingly, the blogosphere are changing, no improving, the dissemination of scientific literature. He wants our input. To make his work easy, I'll tag this item pmrgrantproposal and would ask everyone to do the same (Peter unfortunately did not suggest a tag himself). Here are pointers to blog items I wrote, related to the four themes Peter identifies.

The blogosphere oversees all major Open discussion

The blogosphere cares about data

Important bad science cannot hide
I do not feel much like pointing to bad scientific articles, but want to point to the enormous amount of literature being discussed in Chemical blogspace: 60 active chemical blogs discussed just over 1300 peer-reviewed papers from 213 scientific journals in less than 10 months. The top 5 journals have 133, 78, 68, 57 and 48 papers discussed in 22, 24, 10, 11 and 18 different blogs respectively. (Peter, if you need more in depth statistics, just let me know...)

Two examples where I discuss not-bad-at-all scientific literature:
Open Notebook Science
I regularly blog about the chemoinformatics research I do in my blog. A few examples from the last half year:

Update: after comments I have removed one link, which I need to confirm first.

Sunday, September 30, 2007

CompLife2007, Utrecht/NL; Taverna, EBI/Hinxton/UK

Two working days left before I'm off to two conferences. First, next Thursday/Friday, the two day CompLife2007 in Utrecht/NL, with sessions on genomics, systems biology, medical information and data analysis. And, on the second day tutorials on KNIME and CDK/Bioclipse. I will try to orient as much as possible around MS-based metabolomics, and metabolite identity in particular. Last year the conference was very interesting.

The Monday/Tuesday after that, I will present CDK-Taverna integration I worked on in 2005 (see e.g. Taverna on Classpath and CDK-Taverna fully recognized) at the Taverna meeting, before Thomas continued on that leading to the cdk-taverna.de plugin website. If time permits, I will prepare an example workflow from metabolomics. Unlike previous times I went to Cambridgeshire, I won't fly in on Stansted, but take the EuroStar instead. I am very much looking forward to that. Unfortunately, I will not have time to visit Cambridge itself, this time :(

Friday, September 28, 2007

SMILES to become an Open Standard

Craig James wants to make SMILES an open standard, and this has been received with much enthusiasm. SMILES (Simplified molecular input line entry specification) is a de facto standard in chemoinformatics, but the specification is not overly clear, which Craig wants to address. The draft is CC-licensed and will be discussed on the new Blue Obelisk blueobelisk-smiles mailing list.

Illustrative is my confusion about the sp2 hybridized atoms, which use lower case element symbols in SMILES. Very often this is seen as indicating aromaticity. I have written up the arguments supporting both views in the CDK wiki. I held the position that lower case elements indicated sp2 hybridization, and the CDK SMILES parser was converted accordingly some years ago. A recent discussion, however, stirred up the discussion once more (which led to the aforementioned wiki page).

You can imagine my excitement when I looked up the meaning in the new draft. It states: The formal meaning of a lowercase "aromatic" element in a SMILES string is that the atom is in the sp2 electronic state. When generating a normalized SMILES, all sp2 atoms are written using a lowercase first character of the atomic symbol. When parsing a SMILES, a parser must note the sp2 designation of each atom on input, then when the parsing is complete, the SMILES software must verify that electrons can be assigned without violating the valence rules, consistent with the sp2 markings, the specified or implied hydrogens, external bonds, and charges on the atoms..