## Wednesday, December 28, 2005

### The good, the bad and the ugly molecules

Derek Lowe is the author of the blog In the Pipeline which is really fun to read. Derek works in pharmaceutical industry and gives a great insight in how things work in that field of molecular sciences. Yesterday he blogged about What Makes an Ugly Molecule?, and touches the Rule-of-Five, the hydrochloric acid bath (aka stomach), and other reasons that make molecules ugly.

But there are many other interesting posts, and, something that my blog still lacks, comments by many users, discussing the ideas he posts, making his blog even nicer.

## Tuesday, December 27, 2005

### Knoppix saves the day...

After the three obligatory days of christmas holidays (fun, especially with two children, but very exhausting), it is time to get back to business again. I'm still at my father-in-laws place with only XP installed, so booted the Knoppix 4.0.2 DVD I burned last friday. Eclipse is not working, but being able to use Kmail to read my email again is just what you need as in internet-junkie. A computer is just not complete without a nice KDE session hanging around.

Anyway, booted eclipse on my computer at work, and tunneled the window over SSH. Not overly fast, but it seems to run fine. (If only I knew how to setup NX on that Kubuntu breezy system!) Let's see if I can get the CDK bug count somewhat lower.

## Friday, December 23, 2005

### Subset selection: mind the complexity

In a recent JCIM article, Schuffenhauer compares a few subset selection methods, and notes that some of them reduce the average complexity of the molecules. They put this in relation to other research that states that lead compounds with high complexity have higher activities. Recommended reading material for the holidays.

## Sunday, December 18, 2005

### StatCVS on CDK

One of the Classpath developers pointed me to their CVS statistics when I asked them how actively their project is currently developed, i.e. the number of active developers.

The pages are generated with StatCVS, so I ran it one the CDK too:

I knew I did a lot of work on the CDK, but never realized that 62.7% of the commits were mine! Keep in mind, though, that a lot of these commits are for code maintainance! Next in line are steinbeck and rajarshi. In total 28 people commited patches to CVS, though other people contributed patches too, which were commited by a developer with write access. There is jump in the commit messages somewhere this summer, which I think is the move of the data directory from cdk/data to cdk/src/data.

The full analysis results can be found here. It was generated with the StatCVS version in sid, and will rerun it soon with a more recent StatCVS version.

## Friday, December 16, 2005

### CDK Debug classes and fixing the ModelBuilder3D bug

For some weeks now I have been thinking about bug 1309731 : "ModelBuilder3D overwrites Atom IDs". The ModelBuilder3D is a complex piece of source code, reusing many other parts of the CDK, including atom type perception.

Somewhere in October, however, I found that Taverna could not create 3D models and convert these into reasonable CML because the Atom ID's were messed up. So the question is, where did the ModelBuilder3D do this? Did it do this itself, or is it done by one of the other pieces of CDK that it uses? But due to the complex nature of this algorithm, it quickly became clear that looking at the code was not going to solve it; there was too much code to look at.

The solution was clear to me: use the new data interfaces. To identify where the IDs where messed up, I only needed to write a DebugAtom class with a method that looked like:
public void setID(String identifier) {  logger.debug("Setting ID: ", identifier);  super.setID(identifier);}

And I would immediately at what stage the ID was overwritten.

So I started this week to implement the DebugAtom and related classes. By extending Atom, I could just add debugging stuff and reuse the code in that class. However, the DebugAtom can not extend DebugAtomType too then. And this is a pity, because all methods inherited by the Atom interface from AtomType, Isotope, Element and ChemObject interfaces could not be inherited from the DebugAtomType class. Instead, they now have to duplicate those bits of code.

This is not a clean solution, as duplicate code is a known cause of bugs. So, the next step was to write JUnit tests for the new debug classes. And for this I wanted to reuse, i.e. extend, the tests for the default data classes. This required, however, changes to those test classes.

The first thing that needed to be changed was that instantiation of data classes in the tests would now have to depend on the data classes being tested. A simple
Atom atom = new Atom("C");

only makes sense when a specific Atom class was important. Fortunately, the new interfaces provide a solution for this: the ChemObjectBuilder implementations. These allow to use the following syntax to replace the hard coded instantiation:
Atom atom = builder.newAtom("C");

Therefore, I added a protected field to the AtomTest, which was instantiated in the setUp():
protected ChemObjectBuilder builder;public void setUp() {  builder = DefaultChemObjectBuilder.getInstance();}

and use this builder to instantiate all test objects, as shows for the atom above.

And then I can simply reuse this JUnit test by defining the DebugAtomTest like:
public class DebugAtomTest extends AtomTest {  public DebugAtomTest(String name) {    super(name);  }  public void setUp() {    super.builder = DebugChemObjectBuilder.getInstance();  }  public static Test suite() {    return new TestSuite(DebugAtomTest.class);  }}

The sources for these debug data classes tests are found in the new cdk.test.debug package.

The number of JUnit tests for the CDK jumped from around 1250 to over 1500 tests right now. And if you think these new tests only test old code, because of all the super.bla() calls in the debug classes, you're way off. I found bugs in the new debug classes, but also many class cast bugs and several other problems in the real data classes!

Anyway. Does this help fix the ModelBuilder3D bug? Yes, it does:
$grep "Setting ID" reports/result.modeling.builder3d.ModelBuilder3dTest.txtorg.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: carbon1org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: oxygen1org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: Corg.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HCorg.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HCorg.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HCorg.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: Oorg.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HO This shows me where the Atom ID is overwritten to be something other than "carbon1"! I can now look at the rest of the result.modeling.builder3d.ModelBuilder3dTest.txt file to see what the ModelBuilder3D was doing at the time, and which CDK class made the setID() call. I only needed to change this line in the JUnit test for the bug to generate the above debug lines: Molecule methanol = new Molecule(); into Molecule methanol = new DebugMolecule(); ## Tuesday, December 13, 2005 ### Math libraries for Java? I drop in on the #classpath channel of freenode.net IRC network, where the #cdk channel runs too. The #classpath channel is for the Classpath project which is developing the free Java libraries used by most open source virtual machines. A Slashdot.org item was mentioned "Java Is So 90s". It lead to a funny discussion about what that would make C/C++ and Fortran. A more serious question was brought up: where are the efficient and super fast Java linear algebra and complex number libraries? There is Weka but it is more aimed at data analysis. I believe it has support principle component analysis, so it must have singular value decomposition. There is a book called Java Number Cruncher: The Java Programmer's Guide to Numerical Computing by Ronald Mak, 2003, Prentice Hall. After some further asking about it on the channel, they mentioned the Apache commons math project, which seems promising. The website mentions complex numbers, linear algebra, statistics and numerical analysis, but have not looked at the full API, so not sure how well populated these areas are. Anyone, with experience in the area of numerical computing and Java? ## Saturday, December 10, 2005 ### Jumbo 5.0 and the CDK I reported earlier that the CDK has been updated in CVS to use CML from the new Jumbo 5.0. The transition actually involved a lot of changes in the CDK, some I would like to address in the following comments. One thing is that CML write support (not reading!) uses the new Jumbo library which requires Java 1.5. Thus, if Java 1.5 is not available, then CML writing should not be compiled. This is how this is done. The JavaDoc The CDK makes extensive use of JavaDoc taglets. CDK uses tags of type @cdk.SOMETAG. And an important tag in this case, is the @cdk.require tag, becuase it allows us to make the CDK build system aware that the class requires Java 5.0 to be compiled. Thus, we have for example this code in CVS, of which bits are: /** * Serializes a SetOfMolecules or a Molecule object to CML 2 code. * Chemical Markup Language is an XML based file format {@cdk.cite PMR99}. * Output can be redirected to other Writer objects like StringWriter * and FileWriter. * * @cdk.module libio-cml * @cdk.builddepends xom-1.0.jar * @cdk.depends jumbo50.jar * @cdk.require java1.5 */public class CMLWriter extends DefaultChemObjectWriter {} As probably is clear compiling this jars requires a two jars to be present, of which the jumbo50.jar itself is not required for compiling the class source code. It also shows the use of the @cdk.require tag. The build.xml Because the CDK still does not require Java 1.5, the CDK is supposed to be buildable with Java 1.4 (the oldest supported Java release). The Ant build.xml script is quite able to conditionally leave out compiling parts of the CDK, if configured correctly using proper JavaDoc tags, as explained earlier. First, the build.xml checks what libraries are available for compiling certain parts of the CDK. For example, the build.xml code to check for Java 1.5 looks like: <condition property="isJava15"> <contains string="${java.version}" substring="1.5"/></condition>

Run ant info to see what is being checked for, or look at the build.xml source code for the check target.

All compiling is done by the compile-module target, and there it in- and excludes bits of the CDK depending on the checked conditions:
<javac srcdir="${build.src}" destdir="${build}" optimize="${optimization}" debug="${debug}" deprecation="${deprecation}"> <excludesfile name="${src}/java1.4+.javafiles" if="isJava13"/>  <excludesfile name="${src}/java1.4.javafiles" unless="isJava14"/> <excludesfile name="${src}/java1.5.javafiles" unless="isJava15"/>  <excludesfile name="${src}/ant1.6.javafiles" unless="hasAnt16"/> <excludesfile name="${src}/r-project.javafiles" unless="rispresent"/>            <includesfile name="${src}/${module}.javafiles"/></javac>

Keep in mind that the *.javafiles are created with JavaDoc based on the CDK JavaDoc tags mentioned earlier.

The build.xml 2

While the above mechanism has been present since for some time now, having jumbo50.jar in CVS made the situation a bit trickier: the jumbo50.jar uses the 49.0 class format used in Java 1.5, and cannot be processed by Java 1.4 systems. Since the classpath used when compiling CDK source code, is defined in configuration files for those modules in src/META-INF, the problem did not occur when compiling the modules. However, it did show an error in the reallyRunDoclet target today, when I was creating the *.javafiles with JavaDoc. The solution was trivial:
chat channel.

## Saturday, October 15, 2005

### Single PDFs for CDK News articles

This week was the CDK5AW event, a workshop for users and developers of the Chemistry Development Kit (CDK). After talking with other developers we agreed on creating PDF and HTML versions of single articles that appeared in the CDK News newsletter. Well, I haven't figured out how to create nice HTML (the latex2html does not give nice results, anyone ideas?), but for the PDF version I now have a pipeline.

For each article, a split.config file determines which pages from the CDK News issue PDF should be extracted. To do this, I used the PDF ToolKit, or pdftk for short (comes with Debian/Unbuntu by default). And using a Perl script to read this config files, the pipeline creates PDF files for each article. Currently, I'll only have it do the features articles; that is, not the ChangeLog, Editorial, Literature and FAQ. For those you'll need to download the full issue. If you don't like that, let me know :)

Ok, you will probably have noticed that the almost server is down (Googling for 'CDK News' allows you read the cache!), and I the PDF's will be uploaded there asap. For those not familiar with CDK News, the articles are FDL, so feel free to copy and distribute them. If you reuse the text and update it, which is allowed too, please let us know.

### Chem-bla-ics

This new blog will deal with chemblaics in the broader sense, and will not be restricted to research in this field in which I am involved personally.

Chemblaics (pronounced chem-bla-ics) is the science that uses computers to address and possibly solve problems in the area of chemistry, biochemistry and related fields. The general denomiter seems to be molecules, but I might be wrong there.

The big difference between chemblaics and areas as cheminformatics, chemoinformatics, chemometrics, proteochemometrics, etc, is that chemblaic only uses open source software, making experimental results reproducable and validatable. And this is a big difference with how research in these areas is now often done.

Egon