Saxon SQL Extension: Importing XML into a Relational Database 1

Posted by bob Tue, 31 Jul 2007 14:37:00 GMT

I finally started making time for another project that Mr. Parker and I have been discussing now for well over a year. Currently we're calling it ossmosis, and the purposefully vague description of it is a contextual resource for open source software - targeting both developers and less technical project manager types.

Research has led me to discover a really useful project called FLOSSmole, which is described as a "collaborative collection and analysis of free/libre/open source project data". I was originally planning on developing some sort of a crawler to retrieve this type of data myself, but luckily came across FLOSSmole before writing a single line of code.

Part of the FLOSSmole data is all freshmeat.net projects and their associated "troves", or facets. This is what I'm currently most interested in. However, the actual trove hierarchy (facet names) is currently not available in FLOSSmole, but hosted by freshmeat.net here (this is a large file so think twice before telling your browser to view it!)

On to the specific topic of this article - importing an XML document (the trove hierarchy) into a relational database (MySQL). There are a 1001 different ways to do this. You could probably do it in three lines of Ruby code. Maybe. I chose the approach of using XSLT and SQL extension functions available within Saxon. Why? Because all that is needed is a single template matching on <descriminator> nodes, which then fires <sql:insert> calls. A script written in (Perl, Ruby, Python, etc.) would have to establish the database connection, parse the XML tree and find the <descriminator> nodes, extract the values of the child elements under <descriminator>, construct SQL insert statements using those values, and then interact with the database API to do the actual inserts. SAX is another option but the SQL inserts would still have to be constructed, as well as interaction with a database API.


Here's the stuff:

ImportFMTroveDefs.xslt (includes SQL to create target table)

Saxon jars available below:
saxon8.jar   saxon8-sql.jar

You will also need a JDBC driver. The best one for MySQL is here.

Example command line java call to run this:

java -Xmx84M -cp saxon8.jar:saxon8-sql.jar:mysql-connector-java-5.0.7-bin.jar net.sf.saxon.Transform fm-trove.rdf ImportFMTroveDefs.xslt driver="com.mysql.jdbc.Driver"
database="jdbc:mysql://localhost/ossmosis" user="ossmosis" password="ossmosis" datasourceid="81"

The JVM memory argument was necessary to boost the default due to the size of the 'fm-trove.rdf' file. Also - watch out for empty string database user passwords - passing a parameter to Saxon from the command line can be funky for empty strings.

Finally, here is the dump of the data once imported into MySQL.

Generating RSS in Java Web Frameworks 1

Posted by bob Sun, 04 Mar 2007 00:41:00 GMT

Sportsvite, a web start-up focused on 1) connecting recreational sports enthusiasts and 2) facilitating communication and scheduling of sports activities (e.g team and league games), is expanding the role of RSS across the site. The most recent addition is a feed of the Sportsvite classifieds section. For example, if I wanted a list of all of the soccer teams looking for more players in my area by zip code, the relevant url would be: http://www.sportsvite.com/xml/rss/listings?type=1&zip=20009&sportId=14

The task of generating and serving up RSS first involves initiating the appropriate query to the persistence layer from a given HTTP GET request. The query string parameters in the example above are defining the "type" of classified listing (teams looking for players), the zip code, and the associated sport identifier with the team. Sportsvite is built with the popular Java framework Struts and simply requires an Action class mapped to the endpoint url.

Hibernate is the object-relational mapping and persistence layer of choice, so the query for soccer teams looking for players in northwest Washington DC will return a set of POJOs derived from the data model for Sportsvite classifieds. For each object in the result set list, an instance of a class representing an individual RSS "item" is instantiated. This RSS object is then populated with "get" method calls on the object from the result set list. The RSS class is very basic and defined as:

public class RSS {
	public RSS() {}
	private String title;
	private String link;
	private String guid;
	private String pubDate;
	private String description;
	// (getter and setter methods...)
As an example, populating the pubDate involves creating the RFC822 date format as per the RSS specification. Using java.text.SimpleDateFormat, an RFC822 date string can be formatted with:
SimpleDateFormat RFC822DATEFORMAT = new SimpleDateFormat
("EEE', 'dd' 'MMM' 'yyyy' 'HH:mm:ss' 'Z",Locale.US);
The RSS object would then be populated with:
Date createdOn = (Date)listing.getCreatedOn();
rssItem.setPubDate(RFC822DATEFORMAT.format(createdOn));
...where 'listing' is the name of the result set object and 'rssItem' is the instance of the RSS class.

As RSS objects are created and populated, they are collected in a list. The next step is to convert this list to XML using the ever powerful Castor marshalling tool. Castorization will produce an XML representation of the list data structure containing the RSS objects, which will then be transformed to the actual RSS feed. Here is how Castor is called:

private Document marshallRSSXML(ArrayList RSSList) {
     Document doc = buildDocument();
     try {
	  Marshaller.marshal(RSSList, (Node)doc);
	} catch (MarshalException e) {
		e.printStackTrace();
	} catch (ValidationException e) {
		e.printStackTrace();
	}			
	return doc;	 
}
The Document object is of type org.w3c.dom, and the 'buildDocument' method simply creates an empty Document (using javax.xml.parsers.DocumentBuilderFactory) to accept the result of the 'marshal' call.

The last step involves passing this XML Document in memory to a separate servlet of the web application that functions as an XSLT serialization service. In this case, it is serialization in the sense of sending the output of the XSLT engine over HTTP to the web client. The XSLT servlet is very straightforward and the key lines of code are below:

 DOMSource source = new DOMSource(doc);
 transformer.transform( source, 
       new StreamResult(response.getOutputStream()));
The javax.xml.transform.dom.DOMSource instance is created using the generated Document object available in memory (stored in the session by the forwarding Action class). The javax.xml.transform.Transformer instance is grabbed from a pre-compiled cache of XSLT modules, and using the 'getOutputStream' method on the javax.servlet.http.HttpServletResponse of the main 'service' method of the servlet class allows us to send the transformation output to the requesting web client.

The XSLT is not even worth including as it just matches on '/array-list' (the root of the Castor generated XML document), creates the top level RSS 'channel' elements from passed in parameters, and applies-templates on <RSS> elements to create <item> elements.

Removing Extraneous Namespace Declarations - By Force

Posted by bob Fri, 20 Oct 2006 01:35:00 GMT

In developing XML->XML transformations (XSLT 1.x), I've frequently experienced unnecessary ancestor namespace declarations propagating to child elements of the output tree (in Saxon, and Xalan too, I believe). The ultimate consumer of the resultant XML document usually wouldn't care; I just viewed it as annoying behavior...highly annoying, but not a show stopping problem.

The exclude-result-prefixes attribute placed at the root of a stylesheet never seems to perform as advertised, that is - exclude the namespace declarations for the following space separated list of prefixes (the value of the attribute). It is working as it should...but only for literal result elements. Literal result elements, by the way, are elements explicitly placed in the XSLT module to be written to the output (e.g using <myelement> vs. <xsl:element>). There is a decent example here, with a nice comment from Michael Kay.

So, when using <xsl:copy> and <xsl:copy-of>, the processor is going to output any ancestor namespace declarations of the input document it feels it should copy (more detailed explanation of why this is the case is out there, but I'm not going to touch for now).

For various reasons, let's say you really, really wanted to prevent the extraneous namespace declarations when transforming a particular XML document.

A template matching node() and two separate templates matching attributes and text() function as a modified version of the traditional identity template. The node() template creates entirely new elements using <xsl:element>, with an explicit namespace setting of the context node's namespace URI. Thus, the newly created element will be free of any propagated namespace declarations from ancestors.

The template matching/copying only attributes is separate, as the behavior/output of the node() template should not be triggered for attributes.

The template matching text() is separate for the same reason, and requires a priority to override the selection between node() and text() templates when a text node is encountered.

<xsl:template match="node()"> 
   <xsl:element name="{local-name()}" 
           namespace="{namespace-uri()}">
      <xsl:apply-templates select="@*|node()"/>
   </xsl:element>
</xsl:template>

<xsl:template match="@*">
   <xsl:copy/>
</xsl:template>

<xsl:template match="text()" priority="1">
   <xsl:value-of select="."/>
</xsl:template>

On a slightly related note, here is a classic Michael Kay mailing list response.

XSLT 1.0 string reversal & tail recursion

Posted by bob Tue, 29 Aug 2006 16:29:00 GMT

The need or desire to reverse a string in XSLT (1.0) might be rare, but I have found it useful in the past to manipulate or deal with characters at the end. However, I will admit that I could not immediately determine the exact purpose for its use looking at some older code I had written - having the input XML would have helped.

Here was my first attempt at writing a named template to do reversal:

<xsl:template name="reverseString">
 <xsl:param name="inputStr"/>
 <xsl:variable name="strLength"
       select="string-length($inputStr)"/>
 <xsl:choose>
   <xsl:when test="$strLength &lt; 2">
     <xsl:value-of select="$inputStr"/>
   </xsl:when>
   <xsl:when test="$strLength = 2">
     <xsl:value-of select="substring($inputStr,2,1)"/>
     <xsl:value-of select="substring($inputStr,1,1)"/>
   </xsl:when>
   <xsl:otherwise>
     <xsl:value-of select="substring
       ($inputStr,$strLength,1)"/>
     <xsl:call-template name="reverseString">
       <xsl:with-param name="inputStr"
         select="substring($inputStr,1,$strLength - 1)"/>
     </xsl:call-template>
   </xsl:otherwise>
 </xsl:choose>
</xsl:template>

I was curious how XSLT 2.0 would accomplish this, and consulted my copy of O'Reilly's XSLT Cookbook, 2nd Edition by Sal Mangano (the solution, by the way, is only three lines, using codepoints-to-string and string-to-codepoints).

Included in the book is a section on string reversal - with a close to identical recursive template example to the one above that is deemed "an ineffcient tail recursive implementation". This led to me look up tail recursion as the term was not totally familiar to me. Tail recursion involves a function calling itself as the very last step in the function - allowing the existing underlying stack frame to be used instead of creating a new one (the recursion is handled by iteration). I believe most major XSLT engines are optimized for tail recursion - Saxon (my favorite) definitely is.

Doing the recursive, self call at the end of a named template seems natural and logical to me anyway - without knowing the processor is optimized for it.

Back to my inefficient reversal function - it's inefficient because each recursive call is only repositioning a single character. I failed to heed CS 101 (or maybe 201) basics of dividing and conquering, attempting to cut work in half.

Sal's most efficient solution to XSLT 1.0 string reversal is fairly similar to the one above, but with two tail recursive calls, each one reversing one-half of the input string.