cheminfo

=Automation Components in UsefulChem=

Key Links
[|Molecules Blog] (SMILES manual input) Automated Info (MW, InChI, eMolecules search): Docking/Molecular Modeling Pending Items
 * [|RSS Feed]
 * [|Drop-Down Page]
 * [|Spreadsheet Summary] (includes Excel format)

This page describes the evolution of software tools which process the usefulchem-molecules blog into a variety of useful formats, e.g., spreadsheets, RSS feeds, and CML for molecular visualization/manipulation tools such as Jmol, as well as adding additional chemical information (InChIs, MWs, supplier info) for the molecules in the UsefulChem project. I will also discuss the on-going development of an automated RSS feed reader for extracting and performing further processing this chemical information, and potential future work in these areas. For more information on this work, and to follow new developments, please refer to my blog entries at http://usefulchem.blogspot.com.

Initial work with Excel / Excel VBA:
Molecule entries in http://usefulchem-molecules.blogspot.com are characterized primarily by a UC number (e.g., UC0188), a SMILES notation, and an image, although other information, such as CAS number, is often added. To summarize and expand on this data in a convenient format, a program in Microsoft Excel Visual Basic for Applications (VBA) ([|MoleculeBlogInfo.zip]) was developed which downloads this page, parses out the desired information, and generates a spreadsheet ([|usefulchem-molecules.xls]) in which each row represents one blog entry. Given that the blog format itself is rather loose - for example, the SMILES entry might be prefixed by "SMILES" or "SMILES:" - and can change over time, the search criteria for fields were made fully configurable by placing them in an initialization (.ini) file.

Additional information beyond that provided by the blog, such as links to suppliers, were desired, and for this purpose several different freely available software packages and libraries were used. Molecular weight information and molecular format files (CML, MOL) were generated from the SMILES using the CDK Java libraries, while InChI descriptors were produced by OpenBabel. Image files were at first generated using ChemSketch, although these are now simply downloaded directly from the blog itself. Supplier information was acquired by sending HTTP GET requests to chmoogle.com (now eMolecules.com), and processing the responses gleaned from this service.

In addition to the spreadsheet, this software also creates HTML and CML files (e. g., [|UC0088] for each blog entry), which in combination allow the molecules in the blog to be viewed with the Jmol applet.

RSS feeds and Automation Software in Java:
The spreadsheet format for the usefulchem-molecules blog was a useful beginning. It was, however, not very amenable to automated data processing or other kinds of display desired, particularly for the internet/web. An initial attempt to address these deficiencies involved modifying the Excel VBA software to generate an RSS 1.0 feed ([|usefulchem-molecules.rss]) of the blog data in addition to its other output. The advantage to having the data in a feed is that can then be viewed using any number of available desktop or web-based readers, such as RSS Bandit (http://www.rssbandit.org) or Bloglines (http://www.bloglines.com). Furthermore, as RSS is simply XML, feeds can contain other XML formatted data, such as Chemical Markup Language (CML). Thus, a feed can be downloaded and parsed for its CML by software such as Bioclipse (http://www.bioclipse.net) or Jmol (http://jmol.sourceforge.net).

A shortcoming of using Excel VBA is that it does not easily lend itself to automation. Also, it is neither truly an open source development platform nor portable to other operating systems such as Unix or Macintosh. Therefore, to address these shortcomings, I rewrote the VBA code in the Java programming language, which is both free (see http://java.sun.com/javase/downloads/index.jsp to download the Java Development Kit) and is implemented on all major operating systems. Once in Java, it was straightforward to set the software up as an service to be run periodically. As a result, the RSS feed and associated files are now regenerated automatically whenever additions or changes are made the usefulchem-molecules blog.

A zip file containing both the source and compiled code for the Java software to convert the usefulchem-molecules blog to an RSS feed can be found at [|MoleculeBlogInfo.zip].

CMLRSSReader:
Having an RSS feed with special fields provides a launching platform of essentially unlimited opportunities for further treatment of chemical information. Standard RSS readers, however, rarely display little more the and several other standard fields in a feed. Furthermore, they are not extendable or configurable to include additional processing via plug-ins or "hook" programs on a feed, its entries, or the various specialized fields it can contain. Thus, a specialized reader seemed necessary.

Writing a simple feed reader is actually not a particularly difficult software project, and there is a lot of help available in books and web sites (I used "RSS and Atom Programming" from Wrox books (Wrox.com) as a guide for all my RSS programming). I have developed such a reader, again using Java, which begins to address some of our specialized requirements for feeds containing CML and other chemical information. This reader and associated software, which can be downloaded from [|CMLRSSReader.zip], is still admittedly at an early stage in development and can currently handle only RSS 1.0 feeds (and so far has only been tested on the usefulchem-molecules and two other closely related feeds), but demonstrates some of what can be done along lines described above. In addition to the standard reader features of automatically downloading and managing multiple feeds, displaying information contained their item entries, and as tracking new or changed items, the software also allows specialized programs to be executed on the feeds themselves and their contents. In its current form, programs can be configured to run after feed file download and/or processing. These programs can be written in any language, even DOS BAT files (although Java must be used on processed feeds, as they are stored via Java serialization), and can perform any processing/reporting desired, such as calculations using the CML in the feed, internet searches, database entry, and/or e-mailing results to the interested parties.

Two examples of this capability are already being used to automatically generate and upload information for display on the web. One, ExtractHTMLPages, is a Java program that parses the usefulchem-molecules feed file for its item fields and generates an HTML file for each item. ExtractHTMLPages also generates an index file ([|UsefulChemistryMolecules]) of the item HTML files which, using a combination of JavaScript and HTML iframes, allows any of them to be selected for viewing from a drop-down list. When CMLRSSReader downloads a feed, which it does whenever the feed has been updated (which in the case of usefulchem-molecules, occurs whenever the blog is updated), it automatically runs ExtractHTMLPages, generating and uploading all of these files to the web server.

The other example, ExtractNewItems, is a Java program which works with processed feeds to record and detail changes to the feed. When new items are added to the usefulchem-molecules feed, or new information about an item is added or modified, ExtractNewItems generates and uploads two files: newItems.html ([|newItems]) and newItems.xls. True to their names, these files list items that have been added or updated since the last time the program was run. Ultimately, the reason for a new listing will also be given, such as new supplier information, but this is not currently implemented.

Future Directions:
Quite a bit of ground has been covered, and a lot of evolution occurred, since the initial work with Excel VBA. A certain amount of consolidation and strategic consideration would seem to be worthwhile at this point. To begin, the numerous web sites and pages generated would benefit from some organization. This can be done with a single page, or small set of pages, providing links to and descriptions of the various software tools and the pages they generate.

Second, although I have tried to make the CML RSS reader software highly flexible, it needs to be tested for compatibility with other RSS 1.0 feeds containing CML if it is to become of general use to the scientific community. Additional development is almost certainly going to be needed here (no one should expect to be that lucky!). I am also eager to see how the reader might interact with other software, such as Bioclipse, for example in providing CML and other data in automated fashion. This should prove fruitful, as Bioclipse obviously provides so much more in the way of processing and visualization tools than the reader itself. Other enhancements include a replacement for Java's JEditorPane for displaying item data (JEditorPane's handling of HTML is fairly primitive), other improvements to the user interface, and more configurable program extensions and/or plug-ins.

Finally, a lot of technologies have yet to be explored in this area. One excellent candidate is the combination of Ajax in HTML pages with chemical information web services. Ajax provides the ability to dynamically query web sites and services without the overhead in time and resources of retransmitting/reloading entire pages. In conjunction with JavaScript events and dynamic HTML, this can essentially turn an ordinary browser into a full-featured software user interface. Ajax also appears quite easy to use. For some simple examples of what can be done with Ajax, see [|UsefulChemistryMolecules] and [|UsefulChemistryMolecules2] (simply hover over any of the UC numbers).

media type="custom" key="7674"