## Monday, September 27, 2010

### Visualizing data embedded in XHTML+RDFa

Orion picked up my challenge and made an web based application to graph data from one of my XHTML-RDFa pages. Well done! He wrote his work up in his blog of which the results looks like:

The text field shows the SPARQL used to aggregate the data, which is then visualized in the plot below that field. You can edit the SPARQL and, for example, plot the boiling point (t) as function of the number of carbons (p):

This work nicely shows some interesting McPrinciples: it shows what happens if we allow reuse and share our knowledge; it shows that nice graphics and semantic access to original data are very compatible. All in all, this is an important step forward to semantic publishing of chemical data! Orion, thanx for this really nice work!

## Sunday, September 26, 2010

Tomorrow it is 10 years ago that the CDK was founded. The project has considerably grown and has a high impact on science. The CDK code base did not start 10 years ago, but was founded on the code bases of CompChem, JChemPaint, and Jmol, but has seen several reincarnations since the start.

The CDK project by now is so large, it is hardly possible to keep up, and I am very grateful to particularly Chris and Rajarshi for actively keeping the project going, to all those that submit patches and bug reports, and to all that use the CDK in their software. This created a healthy development and user community, as is visible from the blog aggregator Planet CDK.

But, reflecting on the past, it is also clear where the project needs help. The flow of CDK News papers is effectively void, the documentation needs serious updating, we still need way more unit testing, as well as more in-depth validation of algorithm implementations. And we all know we are short on code reviewers to control the flow of patches going into the library. There is also still some functionality missing, like a simple force field (the Jmol LGPL UFF code could be ported, doi:10.1021/ja00051a040) and support for popular file formats like Symyx V3000 molfiles and the ChemDraw CDX formats.

I am really positive about the future of the CDK project and the current future is mostly limited by the number of people working on maintenance, code quality, and releases. For example, I would love more frequent releases, but making a release takes about half a day. It is not merely creating the files to distribute, but also to ensure that the branch is in a releasable state, that it has no important outstanding bugs and at least does not have more unit test fails than the past release (preferably fewer...), and writing a release message.

This maintenance also involves writing unit tests for reported bugs, and ensuring that someone fixes the bug. This is a second important challenge to the project: how to keep the original code authors involved, and make them feel responsible for making bug fixes in the code they wrote. Cheminformatics is very much a field of write once, go off to another job, and forget about it. This is why I am so strong on having unit tests, proper JavaDoc, and clean code, so that others can do this required code maintenance.

If we look at the current numbers, we see about 170 open bugs out of 1115 ever reported, and 24 open patch reports out of 276 reported. Those are acceptable numbers, though they need to go further down.

I really hope that 2011 will be the year that commercial CDK support is picking up, providing value for users by providing dedicated support. Right now, to get something fixed, you need to wait for someone to fix the problem; however, none of the CDK developers actually is working solely on the CDK and many contributions are done in spare time. That nicely shows the power of Open Source, but also well illustrates the need of proper funding. That said, this is merely limited by people actually willing to pay for such support, or even just to donate financial support to the project. If you are interested in that, please contact me offline, as we have the means in place to do this.

In short, I have no clue where the CDK will go, except that it will continue to grow. This is another power of Open Source: the accumulated effort cannot be lost. Seriously, back in 2004 I wrote a What's 2004 going to bring?, and here's a lousy attempt for 2011:
• a new stable series, 2.4 or 3.0 (versioning has not been decided on yet)
• it will be faster and support parallel computing
• we will have a UFF implementation
• more extensive chirality support (EZ, ...)
• rendering and editor will be integrated
• we will use JExample for unit testing
• cheminformatics in the webbrowser (using the CDK)
• we will have books about the CDK
• more molecular descriptors

But we will also have to overcome these issues, for which we need your help:
• CDK News needs a new editorial board
• we need an second release managers (one for stable, one for the development branch)
• we need more code reviewers
• making patches is easier than ever

### Using PubChem to create CDK unit tests

In 2008 I posted about Wicked chemistry and unit testing and was using BeanShell at the time to convert a structure on PubChem into CDK source code. But since I rather use Groovy now, I have updated the code. I used CDK 1.3.6 and the PubChem XML format now.:
import org.openscience.cdk.Molecule;
import org.openscience.cdk.io.*;

if (args.length == 0 || args[0] == null) {
System.out.println("Syntax: pc2ut.groovy [CID]\n");
System.exit(0);
}

String cid = args[0];
String urlString =
"http://pubchem.ncbi.nlm.nih.gov/summary/" +
"summary.cgi?disopt=SaveXML&cid=" + cid;

URL url = new URL(urlString);

StringWriter stringWriter = new StringWriter();
CDKSourceCodeWriter writer =
new CDKSourceCodeWriter(stringWriter);
writer.write(mol);
writer.close();

System.out.print(stringWriter.toString());


Update An observant reader would have noticed that the output of the current CDKSourceCodeWriter is actually producing code that does not compile. The CDK API has changed, but the created output was not updated accordingly. Apparently, no one is actually using this class, or those who have were not interested in that piece of functionality to file a bug report.

## Wednesday, September 22, 2010

### Presentation at AstraZeneca

Today and tomorrow I am guest at AstraZeneca in Mölndal. This is the presentation I gave today about the work on Bioclipse-RDF:

## Monday, September 20, 2010

### Noel's chemical ASCII art

Very short post, but this is brilliant! Not sure if Noel posted this in reply to this Blue Obelisk eXchange question, but certainly answers it:

Noel, very nice post!

## Friday, September 17, 2010

### The CDK 2003 paper is now cited 99 times

The 100th paper, or the first one that is not mine, I will give a
special review in my blog.

### A list of things I miss in CiteULike

AJCann posted a blog today about what he doesn't like about Mendeley. Abhishek replied that he does not like people complain about one tool, instead of pointing out a good alternative. Mendeley has two alternatives, Zotero and CiteULike (there is also Connotea, but got behind in evolution).

Agreeing with @citeulike and @abhishektiwari, as a service provider any bad news is good news too: they provide opportunities to improve. So, as encouraged to do so, I reported my long list of things I miss in CiteULike:
• @citeulike ok, one more. wish #18: get readermeter.org to also support citeulike
• @citeulike wish #17: allow people linking between papers in their libs using CiTO to annotate how they cite papers, see http://ur.ly/lBUO
• @citeulike wish #16: I think I saw images from some papers, right? how about doing that for #biomedcentral journals too?
• @citeulike wish #15: at the same http://ur.ly/lIGn page, the tag cloud should reflect tag use with font sizing
• @citeulike wish #14: upon 'post url', the first page with extraced information should allow marking as 'I am author' (cannot find that)
• @citeulike (new) wish #12: clicking an account name should get me to a public portal, rather than just his paper list
• @citeulike good point, wish #13: be more strong on requiring people to tag papers... and use article keywords as default tags
• @citeulike wish #11: remove 'no-tag' from tag clouds
• @citeulike wish #10: support #RDF export with BIBO and/or PRISM
• @citeulike wish #9: use #foaf for the RDFa for account pages, and to mark up friends
• @citeulike wish #8: and more generally, make #citeulike part of the #linkeddata network (provide an #rdf API)
• @citeulike wish #7: start using RDFa, e.g. with the PRISM ontology
• @citeulike wish #6: on an article page (like http://ur.ly/lvWk) summarize the network that bookmarked that article, not just the acc names
• @citeulike wish #5: don't show the 'copy' button for papers that are already in my archive (really a bug)
• @citeulike indeed, but don't or do it right... wish #4: allow people to have that link automatically point to an external blog
• @citeulike wish #3: provide summaries of lists, like article count per journal and article count per year
• @citeulike well, I'll use the blog functoinality to summarize... wish #2: do not try to be a blogging platform
• @citeulike (new) wish #1: put automatically focus on text field after clicking search and select all text for easy deletion
The reports are now also available in the fora of CiteULike.

## Thursday, September 16, 2010

### xournal: free annotator for PDFs

See http://xournal.sourceforge.net/ or go for 'sudo aptitude install xournal'.

### Call for Papers: Thematic Series about RDF in Chemistry

As a follow up of the ACS RDF 2010 Symposium we just had in Boston, I can announce that we are preparing a thematic series in the Journal of Cheminformatics around this theme, and that at least six speakers will present their work in this series.

However, as it was the goal of the meeting to create an active and collaborating community, we feel it important to open up, and make encourage others too to submit papers to the series that are about the use of Resource Description Framework in chemistry. The exact scope of the series is that of the symposium, of which all abstracts and some slide sets are available here.

Therefore, it is my pleasure to send around this encouragement to submit papers:
L.S.,

As organizers of the ACS RDF 2010 Symposium held at the American Chemical Society meeting in Boston in August 2010 (http://egonw.github.com/acsrdf2010/), we would like to encourage you to submit a paper to a Thematic Series of papers around the use of Resource Description Framework (RDF) in chemistry, in the Journal of Cheminformatics (http://www.jcheminf.com/). Six speakers have already agreed to participate.

Journal of Cheminformatics was launched in March 2009 as a fully Open Access cheminformatics journal. Papers published in this journal benefit the cheminformatics community through the free, widespread and unrestricted readership that Open Access offers. As organizers of the ACS RDF 2010 symposium, we believe that it is important to share research in this area within the cheminformatics community as much as possible. The journal is peer reviewed, has unrestricted article length, and unlimited use of color illustrations, supplementary files, etc., making it an excellent platform to both give an overview of scientific progress, but being able to go into detail at the same time. Authors retain complete copyright to their published paper.

An example Thematic Series is available at: http://www.biomedcentral.com/series/ENSEMBL2010.

We would be pleased to explore this opportunity further with you, so if you have any comments or questions e.g. about the scope of the thematic issue, article type, or otherwise, please don't hesitate to get in touch.

Submission Deadline: 28 November 2010 Submission Info (process, fees, policies): http://www.jcheminf.com/info/instructions/ Looking forward to hearing from you, with kind regards, Egon Willighagen Martin Braendle

## Wednesday, September 15, 2010

### CDK 1.3.6: the changes, authors, and reviewers

The list of changes is particularly long for this development release. Therefore, I will list the authors and reviewers first. Note that this release also includes the changes of the 1.2.6 and CDK 1.2.7 release.

The authors
I am mildly impressed by this release's list of authors... it is certainly not the usual suspects anymore, and I would like to thank all contributors very much. Interestingly, we also see the number of places where contributions increase, and see patches from 7 different institutes (6 if you count the current ones)!
87  Egon Willighagen
42  Gilleain Torrance
9  Rajarshi Guha
4  Mark Rijnbeek
2  Andreas Truszkowski
2  Saravanaraj


The reviewers
The below list has also slight gained size, and I welcome Gilliean and Syed as active reviewers.
56  Rajarshi  Guha
18  Egon Willighagen
4  Gilleain Torrance


The changes
The changes include bug fixes (see also the 1.2.6 and 1.2.7 release notes), but also an updated SMSD engine, a rename of the IAtomType method getHydrogenCount() into getImplicitHydrogenCount() and of MDLWriter into MDLV2000Writer, and the addition of the signature code by Gilleain, and many, many small fixes.
• Compare values not objects (fixes #3061263) 1e4adac
• Unit test to reproduce failing atom type perception with one of the options to create a -1 Integer object 87f8bfe
• Smiles parser setting to preserve aromaticity as provided in the Smiles String itself. ae21ee2
• Added two further unit tests: one to see if the descriptor properly 'ignores' hydrogens; a second to reproduce the numbers in the original Wiener paper from 1947 91c58dd
• Renabled test which was (accidentally?) outcommented when switching to JUnit4 in commit 06a1a3dd ac82b12
• Added a reference to the original Wiener paper 0c5187e
• No need to declare throws for other Exception's if the superclass is already declared thrown itself (also, no need to define java.lang explicitly) e73a291
• Minor cleanup: corrected copyright statement; simplified JavaDoc by removing empty parameter table 6292875
• Replaced inline citation by reference to the main bibliography 8ddbb1b
• Implementation of a descriptor to measure molecular complexity in terms of sp3 to sp2 ratio of carbon atoms 6a6db0b
• Implementation of a descriptor to measure molecular complexity in terms of sp3 to sp2 ratio of carbon atoms bd9fb4f
• Removed output to STDOUT 18f7d52
• Fix for branching bracket issue when generating SMILES for BrC1C(Br)C(Br)C(Br)C(Br)C1Br 526362e
• Unit test for bug #3040273. 3bf8397
• Fixed hybridization information: these are sp3 hybridized systems e8cec0f
• More missing elements for SMILES parsing problems reported in bug #3048501 bcb8432
• Unit tests for SMILES parsing bugs reported in #3048501 bb5ffe1
• Upper case the first character to also properly recognize lower cased 'aromatic' two-character element symbols (fixes SMILES parsign of things like c1[se]ccccc1 3f6056b
• JavaDoc fixes: correct @cdk.cite use, and small typo d097acf
• Updated the JavaDoc for an API changed a while ago: the getInChIToStructure() method now takes an IChemObjectBuilder as second argument (fixes #3035890) 3e1aba0
• Updated the JavaDoc for the atoms() Iterable API change (fixes #3034824) 2a5bb6b
• Added the maven build file (closes #3042475) 6209a48
• a)removed blank/unused methods and fixed imports Signed-off-by: Syed Asad Rahman 911af3f
• a)IMatch to Match and constructor call for state used TargetProperties for faster processing Signed-off-by: Syed Asad Rahman 4a2f9c1
• a)Assert cleaned and fixed b)Correct ASSERT imports used a420b61
• Removed unused code cb4048e
• Updated checking of indices which now are -1 if unset, instead of null cd8ccee
• a) Default constructor supported as per changes in the CDK b) VFMCS index error should point to -1 not null Signed-off-by: Syed Asad Rahman 9c6c830
• a)Refactored matchers, atom matcher and bond matcher b)getTanimoto score fixed for single atoms 1dae4f2
• Updates for the getImplicitHydrogenCount() renaming 0f64e4d
• The smsd module now depends on the signature module 7d5fc9d
• Latest SMSD code 1.2.0: Major changes are: 7217e82
• Use the factory to not depend on an implementation e2c8329
• Renamed get/setHydrogenCount() to get/setImplicitHydrogenCount(), per report #3020065 8ec9ad4
• Introduced a helper class with info about the CDK library: the version number, which is read from the build.props which is now included in the cdk-core.jar 5a2bc65
• Also take into account super classes 0de5777
• Renamed MDLWriter into MDLV2000Writer (implements #3029447) f0256d0
• Assert pattern has the expected value as first argument d4045cc
• Use the builder pattern to instantiate an IIsotope 6760330
• MDLV2000 reader interprets D and T without M ISO line mandatory f6fd82b
• Use the PT class to see if something can be an element and removing the redundant element symbol info 937c4c7
• Fixed annotation with TestMethod, not TestClass (fixes #3016632) 9eb04e5
• Skip inner classes for @cdk.module and @cdk.githash JavaDoc tests (fixes #3043084) d070396
• Throw a CDKException when a QUADRUPLE bond order is in the input, which is not supported by the MDL/Symyx molfile format (fixes #3029352) 32cdc48
• Deal with a special situation: pyridine N-oxide in the non-charge-separated representation, with a N.sp2.3 nitrogen, with two double bonds. Previously, any ring-outward double bond would disqualify the ring as aromatic. This compound is now an exception. c66039d
• Assert the compound is aromatic (fixes false negative) d1da527
• Exporting the signatures jar 7959947
• Removed the OpenJavaDocCheck Jazzy extension which was not supposed to go in; it's not ready for prime time 9364179
• Added missing dependencies (fixes failing unit tests) e582852
• Added missing dependencies (fixes failing unit tests) ef14ef0
• Copy the 2D templates into the source distribution 5bcd111
• Added finally blocks to MolHandler to close its input streams, with proper logging calls if it fails Fixes bug #3032568 1201e01
• UIT timeout fix 5a329d4
• Cleaned up exception handling in TemplateHandler.loadTemplates() to be more specific and less pessimistic. e1df763
• QSAR descriptor values made serializeable 2ef459c
• Fixed assertion to compare two Vector3d's c133da5
• Added test to assert that two Vector3d's are identical 6651956
• All methods now tested 004770a
• Hooking up test machinery 32155b0
• Altered the getTetrahedralDescription to be get stereo and return ITetrahedralChirality.Stereo f61266c
• Moved Stereotool to stereo package a99c918
• Convenience method that returns chirality descriptor (R/S) given four atoms in priority order 7abbdcc
• Missing docs 9a55100
• Just to be paranoid, test +/- tetrahedra above and below the XY plane e32f334
• Tests for square planar shapes, trig bipyr, and oct 8b49490
• Cleanup and fixing (I think) of the trigonal bipyramidal method 1b01c84
• Test for 3 colinear points 01ae3fc
• Better docs bb38d10
• Tests for tetrahedral sign, some basic defined vectors 897b6ee
• Test class, and name change of distanceToPlane to show that it returns signed (+/-) distance 1735551
• Initial go at the StereoTool - functions taken from Jmol's smiles package for 3D stereo checking c25591f
• Updated OB note and robustified unit tests 2b96512
• Added support for multi-molecule mol2 files. Updated source and added unit test and test file dbd6df4
• Updated cdk-1.2.x patches for master API e383f4b
• Updated the DebugBond unit test too now: new DebugBond() has zero atoms 6ef1fb1
• Backport patch, to make the patches compile with cdk-1.2.x ce9b1bd
• Additional patch to reduce atom count on setAtom(null, int) and unit tests for the setAtom(IAtom, int) behavior. 6cf95db
• Also fix the new NNBond() == 0 atoms for the nonotify module 23d9685
• Fixed Bond() constructor to create a bond with zero atoms. Also fixed setAtom(IAtom, int) to increase the atom count if a null entry is filled with a non-null IAtom. 916ab96
• Updated test to assume new Bond() creates a bond with zero atoms 4ac7111
• Added two unit tests for aromatic N-oxides that are the basis for failures in SMARTS matching 12fb96b
• Exceptions when clone atomless ISingleElectron and ILonePair too c5d4cd3
• Unit test for ArrayIndexOutOfBoundsException occuring when trying to clone an IAtomContainer with an IBond with no IAtoms d71c31c
• Added unit tests for SMILES with failing atom typing, from email on the cdk-devel mailing list June 11 2010 890d0f5
• Added the N.oxide atom type, for structures like (CH3)N=O bb431e3
• Fixed reading of SD properties: keep the first line too 3133a18
• Fixed unit test: surely there is no atom with symbol 0... how long has this been failing?? c888e4f
• Added a test class to aromaticity of three compounds: the last incorrectly fails a454ab8
• Also except N.amide as part of an aromatic ring 3357113
• Added a test class to repeat atom type perception and test consistency 0bd5b42
• Unit test fix: the molecules *is* aromatic, as we should assume it is. Fixes a big goof up cd83236
• Replace special chars where spaces are supposed to occur, fixing the fail of the unit tests every now and then 483c856
• Merge branch 'master' of ssh://cdk.git.sourceforge.net/gitroot/cdk/cdk eb6defd
• BooleanIOSetting to let MDLWriter output bond type 4 (aromatic) a586d6a
• Reorderd imports 2896226
• Added test cases to check that the code runs when faced with covalently bonded metals. Currenly on failure is due to the presence of Pt, for which we do not have a valency in AtomValence f32e9a2
• chemfilter improvised Signed-off-by: Syed Asad Rahman 2a7ee54
• SMSD Test cases updated as per new code Signed-off-by: Syed Asad Rahman 8f3c70a
• SMSD Test cases updated as per new code Signed-off-by: Syed Asad Rahman 3266460
• deleted ext Signed-off-by: Syed Asad Rahman b44c82b
• updated SMSD code Signed-off-by: Syed Asad Rahman eccc70d
• updated SMSD code Signed-off-by: Syed Asad Rahman 370a926
• Convenience constructors for AtomSignature that take IAtoms rather than atom indices 935b5f6
• Added missing jar meta file for the Signatures library 805103a
• Updated to reflect the upstream jar name d33a71a
• Abstract methods not part of API are now protected, to discourage thier use 7a92d42
• Various fixes to tests 592b920
• Added the missing assert: runCoverageTest() returns a boolean stating the success 850fd1a
• cdk.cite references for two Faulon papers e7ccc4e
• More missing javadocs - notably the class documentation for AtomSignature and MoleculeSignature 8bd664f
• Mising constructor javadoc, and inheritdoc annotation on AtomSignature 88ffdd6
• Final missing test cf148dd
• Atom signature test methods 888bf60
• Molecule signature tests b64c4e5
• Signature quotient graph test method 0d0a95e
• Missing test for Orbit::iterator 2609f1b
• Removed unused import 020fb92
• Molecule From Signature Builder tests 9d2a0c0
• All orbit methods tested 7c46a7c
• Finally correct junit annotations, and sort test fba5ff5
• BeforeClass method must be static; better clone test 7e370a2
• Fixes some problems flagged by PMD cd3a866
• Use junit's BeforeClass instead of Before annotation 2566b87
• Inherit JavaDoc where methods overwrite a super method d4a9436
• Fixed PMD warnings: more descriptive field names; use of Integer.valueOf() 02020d6
• Added test method annotation for the clone() method cabd95f
• Removed some unnecessary imports 19b21a3
• Test orbit cloning 3c85ab2
• Improved canonical label method in graph signature 0bca11f
• Integer invariants, and updated signatures jarfile 5c838b6
• Hooked in PMD testing for the signature module 7384bc3
• Defined dependencies for the test module 132badc
• Hooked the signature module into the CDK build system 4ea7d48
• Removed the dependency on the nonotify module 93e0a22
• Created a CDK style module test suite a9a4aa8
• MoleculeFromSignatureBuilder tests d7bb6f6
• Extend from CDKTestCase and use the slow running test check 6969629
• Fixed and cleaned up molecule signature tests 3998ed7
• Initial commit of signature package ff2bdeb
• specify fail behavior (returns null) in javadoc bfd06db
• Added PMD tests for detecting misuse of TestClass and TestMethod, e.g. TestClass on a method. (fixes #3014808) c94ebcf
• Removed two lines added in master to the removed doccheck target 9d679fe
• Merge branch 'cdk-1.2.x' 055f702
• Improved javadoc generation using a link tag, so that references to java library classes are resolved properly 1f8bb2d
• Update OpenJavaDocCheck to 0.5: fixing a few false positives f0c7d0c
• Removed use of the proprietary DocCheck utility 1523f66
• OpenJavaDocCheck errors are fixed for SMSD related modules Signed-off-by: Syed Asad Rahman ac9384a
• solved cdk-Bugs-3006773 : small JavaDoc errors in the smsd module f66a779
• Fix for character spacing for "APO" line in RGFile output ed510bf
• Branch open for commits for the future 1.3.6 1ab64ff
• Use the new tests in more situations e8b13b2

## Sunday, September 12, 2010

### CDK 1.2.7: the changes, the authors, and the reviewers

CDK 1.2.7 is the latest of bug fix releases in the 1.2 series. It brings a number of JavaDoc fixes, but, importantly, also bug fixes in SMILES handling and atom type perception. I am really pleased to see the application domain of various algorithms in the CDK continously grow: SMILES parsing for some transition metals has been fixed, and the SMILES generation for some types of ring closures too. Additionally, an important bug was fixed in the atom type perception algorithm, which failed for custom atom types with formal charges. Everyone using the CDK 1.2 series is advised to upgrade to this version.

The changes
• Compare values not objects (fixes #3061263) 324f7f5
• Unit test to reproduce failing atom type perception with one of the options to create a -1 Integer object 9c1b95a
• Removed output to STDOUT 223fc9a
• Fix for branching bracket issue when generating SMILES for BrC1C(Br)C(Br)C(Br)C(Br)C1Br b9b2272
• Unit test for bug #3040273. 6d9b3d2
• Fixed hybridization information: these are sp3 hybridized systems a01de91
• More missing elements for SMILES parsing problems reported in bug #3048501 31f7462
• Unit tests for SMILES parsing bugs reported in #3048501 bf8defd
• A few more missing elements in the SMILES two-character element symbol parsing 5cf9334
• Added missing elements, fixing several problems reported in bug #3048501 6ab74bc
• Upper case the first character to also properly recognize lower cased 'aromatic' two-character element symbols (fixes SMILES parsign of things like c1[se]ccccc1 3ec1480
• JavaDoc fixes: correct @cdk.cite use, and small typo 394f9ed
• Updated the JavaDoc for an API changed a while ago: the getInChIToStructure() method now takes an IChemObjectBuilder as second argument (fixes #3035890) be56aac
• Updated the JavaDoc for the atoms() Iterable API change (fixes #3034824) 38873dc

The authors
The below numbers are based on the number of commits, but keep in mind that some developers, like myself, need more commits for the same number of changed lines.
13  Egon Willighagen
2  Saravanaraj

The reviewers
The below list is based on who signed off the patches. Anyone who reviews patches in the patch tracker can basically do this. Ask on cdk-devel on how to do this.
 8  Rajarshi Guha
4  Gilleain Torrance
2  Egon Willighagen


## Friday, September 10, 2010

### Pulling out data as JSON from XHTML+RDFa

I am keen on RDFa and RDF in general; that should not be a surprise. RDFa is a serialization of RDF triples embedded in (X)HTML. I recently posted about chemical examples of XHTML+RDFa. Now, the reason for putting data in HTML as RDFa is that we can easily pull it out again, e.g. with this distiller. But the fun goes on, and we can actually also run SPARQL directly on it, for example with RDFaDev which I recently blogged about.

Now, consider we have all these nice visualization tools written in JavaScript which can visualize data from JSON sources, the mashup requires a JSON serialization of that data embedded in HTML pages. Now, I have no experience with the cool JavaScript tools, and hope someone can help me out here, but the JSON bit I already got help with before on SemanticOverflow (thanx to Comment Bot!). The service mentioned no longer works, but there are plenty of alternatives.

Now, Peter is creating this nice data set about green solvents from patents, and it would be great of that data ends up online as RDFa, so that we can easily visualize the trends in solvent use over the years. But as I do not have this data as XHTML+RDFa yet, you will have to do with another example: boiling points.

So, let's consider the data on this page, relating paraffin molecules to boiling points, and we'll take a complexity descriptor (w0, Wiener descriptor) and the boilingpoint (t0). so we get this SPARQL query:
PREFIX cc: <http://github.com/egonw/cheminformatics.classics/1/#>

SELECT * {
?mol cc:w0 ?w ;
cc:p0 ?p .
}

Now, we want to run this query on the aforementioned page, so we add a FROM clause:
PREFIX cc: <http://github.com/egonw/cheminformatics.classics/1/#>

SELECT *
FROM <http://www.w3.org/2007/08/pyRdfa/extract?uri=http%3A%2F%2Fegonw.github.com%2Fcheminformatics.classics%2Fclassic1.html&format=pretty-xml&warnings=false&parser=lax&space-preserve=true>
{
?mol cc:w0 ?w ;
cc:p0 ?p .
}

Notice the use of the distiller here. This way, with a service like that on sparql.org, we can get JSON returned. The result is a bit verbose, but that can perhaps be tuned:
{
"vars": [ "w" , "p" ]
} ,
"results": {
"bindings": [
{
"w": { "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "type": "typed-literal" , "value": "56" } ,
"p": { "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "type": "typed-literal" , "value": "4" }
} ,
{
"w": { "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "type": "typed-literal" , "value": "35" } ,
"p": { "datatype": "http://www.w3.org/2001/XMLSchema#integer" , "type": "typed-literal" , "value": "3" }
}
]
}
}

The point is, I am sure at least one of my readers knows how to visualize the data in this JSON with, for example, Google Chart, particularly, because all the mashing up is embedded in the just linked-to, though obscure, URL. And, if it helps, you can otherwise use the CSV or TSV output. The output of that is even more simple (CSV):
w,p
56,4
286,9
35,3
220,8
20,2
84,5
10,1
165,7
120,6

The first one who can use one of the above URLs to extract the data from that XHTML+RDFa page to create a scatter plot in a HTML page with some JavaScript library, wins a free mention in my blog! ;)

## Wednesday, September 08, 2010

### Second #acs_boston talk: teaching scientific communication

Below are the slides of my ACS CHED talk in Boston:Teaching Scientific Communication
View more presentations from egonw.

## Sunday, September 05, 2010

### Handbook of Chemoinformatics Algorithms

As I was originally waiting for an actual copy inbox, which I still have not received, I had not blogged about it, but earlier this year the book "Handbook of Chemoinformatics Algorithms" by Jean-Loup Faulon and Andreas Bender got released for which I wrote a chapter on 3D molecular representation. Just wanted you to know.

The full list of chapters is: Representing 2D Chemical Structures with Molecular Graphs, Algorithms to Store and Retrieve 2D Chemical Structures, 3D Molecular Representations, Molecular Descriptors, Ligand- and Structure-Based Virtual Screening, Predictive Quantitative Structure–Activity Relationships Modeling: Data Preparation and the General Modeling Workflow, Predictive Quantitative Structure–Activity Relationships Modeling: Development and Validation of QSAR Models, Structure Enumeration and Sampling, Computer-Aided Molecular Design: Inverse Design, Computer-Aided Molecular Design: De Novo Design, Reaction Network Generation, Open Source Chemoinformatics Software and Database Technologies, Sequence Alignment Algorithms: Applications to Glycans, Trees, and Tree-Like Structures, Machine Learning-Based Bioinformatics Algorithms: Application to Chemicals, Using Systems Biology Techniques to Determine Metabolic Fluxes and Metabolite Pool Sizes.

This makes it a quite interesting read, I think. Let's hope the publisher gets me my copy soon.

## Saturday, September 04, 2010

### Data duplication at Mendeley

Earlier this year I gave Mendeley a try, after having been a happy JabRef user, unhappy Connotea user (main problem was that any URI can be bookmarked, not just papers, so very noisy), happy CiteULike user (and still am). But the client did not bring me what I needed, and I canceled my account again.

Since then, Mendeley has undergone a transformation, and there is talk about OpenSourcing the client (or not), Open Data, and an Open Standard API. But, importantly, I no longer need the client and can do everything in the browser.

Moreover, Mendeley has momentum and is starting to provide interesting apps around the API, such as readermeter.org. And since being a scientist is playing the publishing game, one just must add once papers to these systems, just advertise them:

This brings us to problem #1: author identity, which is a general problem and addressed by projects like ORCID. So, besides the page shown above, I have a second page under an entry with just my first name.

But, as the title of the post suggests, Mendeley suffers from a second problem, which was recently brought up by Duncan in his How many unique papers are there in Mendeley? post. Mendeley, apparently, claims 36M papers, but the number of unique papers is much smaller, as detailedly outline by Duncan. Mr. Gunn replied that [d]uplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher (see this comment), but I do not buy that.

I replied in the blog about that claim and also made a suggestion: this dereplication should really be a crowd-sourcing event, but I found it impossible to find a place to report duplication, so I had to use a message to support form and a uninformative category Other. If I was working in Mendeley, I would make this reporting a key technology behind their dereplication efforts.

Anyway, the duplication goes deep, very deep into the long tail. And really, my papers are fairly well received in general (many of my papers in BMC journals are 'Highly Accessed'; I did request some distinction there, using the StackOverflow gold, silver, bronze system), but incomparable with the highly bookmarked papers in Mendeley. I know this is probably not something Mendeley likes to hear, but the paper duplication goes deep, very deep too: a majority of my papers show duplicates. A semi-exhaustive scan showed me duplication for the XMPP paper (here and here), the Blue Obelisk paper (here, here, and here; yes, three copies), the CDK-Taverna paper (here and here), the Bioclipse 2 paper (here and here), the userscripts paper (here and here), the CDK I paper (here and here), and the CDK II paper (here and here).

Hopefully, by the time you read this post, at least some above the above links no longer work. In that respect, I would also like to request URIs based on the DOI instead.

### BioMed Central is going to support Open Data

I had a glance at the plans already in the ACS in Boston, but this week BioMed Central announced a draft call for Open Data in their journals:
The decision to mandate data deposition as a condition of publication is another decision best made by the scientific community concerned rather than a single journal or publisher as, for example, has been established in the microarray and evolutionary biology communities [19]. We will, therefore, support data publication when it is mandated, but will also enable, encourage and recognize [20] data sharing and publication on a voluntary basis for scientists wishing to show leadership in their field.
Now, as the journal already allows reuse of papers (CC-BY license), this also applies to data (and in at least several countries data cannot be copyrighted at all, but we need a world-wide solution; it's the 21st century). However, earlier this year the Panton Principles were introduced which formalize the idea behind public domain waiving, and suggest the CC0 waiver as one valid approach. This is where BioMed Central wants to go too; they write:
All research articles published in BioMed Central journals are published under the Creative Commons attribution licence [22] (CC-BY), with which authors retain the copyright to their work. This licence allows unrestricted distribution and re-use provided that the original article is cited. We support the Panton Principles for open data in science [23] and open data should therefore mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. We encourage the use of fully open file formats wherever possible.

...

Therefore, to eliminate potential legal impediments to integration and re-use of data, specifically, and to help enable long-term interoperability of data we believe an appropriate licence or waiver specific to data should be applied, and made explicit by the authors and publishers. There are a number of conformant licences [25] for open data, of which Creative Commons CC0 [26] is widely recognised. Under CC0, authors waive all of his or her rights to the work worldwide under copyright law and all related or neighboring legal rights he or she had in the work, to the extent allowable by law.
The above quoted text are extracted from the draft, and your comments are most welcome. You can leave them as comment here, which I strongly encourage you to do. If even just to support the idea (see McPrinciple3).

The draft also touches the issue of Open Standards, but I feel this problem with resolve itself. More interestingly, it is now time for the journal editors to make a move, and let the community know if they will require these open data waivers for there journal. For example, cheminformatics as field would benefit very much if the Journal of Cheminformatics would make this move. But at the same time, I fully understand that a young journal may not yet be in the position to do just that yet.