Saxon SQL Extension: Importing XML into a Relational Database 1
I finally started making time for another project that Mr. Parker and I have been discussing now for well over a year. Currently we're calling it ossmosis, and the purposefully vague description of it is a contextual resource for open source software - targeting both developers and less technical project manager types.
Research has led me to discover a really useful project called FLOSSmole, which is described as a "collaborative collection and analysis of free/libre/open source project data". I was originally planning on developing some sort of a crawler to retrieve this type of data myself, but luckily came across FLOSSmole before writing a single line of code.
Part of the FLOSSmole data is all freshmeat.net projects and their associated "troves", or facets. This is what I'm currently most interested in. However, the actual trove hierarchy (facet names) is currently not available in FLOSSmole, but hosted by freshmeat.net here (this is a large file so think twice before telling your browser to view it!)
On to the specific topic of this article - importing an XML document (the trove hierarchy) into a relational database (MySQL). There are a 1001 different ways to do this. You could probably do it in three lines of Ruby code. Maybe. I chose the approach of using XSLT and SQL extension functions available within Saxon. Why? Because all that is needed is a single template matching on <descriminator> nodes, which then fires <sql:insert> calls. A script written in (Perl, Ruby, Python, etc.) would have to establish the database connection, parse the XML tree and find the <descriminator> nodes, extract the values of the child elements under <descriminator>, construct SQL insert statements using those values, and then interact with the database API to do the actual inserts. SAX is another option but the SQL inserts would still have to be constructed, as well as interaction with a database API.
Here's the stuff:
ImportFMTroveDefs.xslt (includes SQL to create target table)
Saxon jars available below:
saxon8.jar saxon8-sql.jar
You will also need a JDBC driver. The best one for MySQL is here.
Example command line java call to run this:
java -Xmx84M -cp saxon8.jar:saxon8-sql.jar:mysql-connector-java-5.0.7-bin.jar net.sf.saxon.Transform fm-trove.rdf ImportFMTroveDefs.xslt driver="com.mysql.jdbc.Driver"
database="jdbc:mysql://localhost/ossmosis" user="ossmosis" password="ossmosis" datasourceid="81"
The JVM memory argument was necessary to boost the default due to the size of the 'fm-trove.rdf' file.
Also - watch out for empty string database user passwords - passing a parameter to Saxon from the command line can be funky for empty strings.
Finally, here is the dump of the data once imported into MySQL.
I'm working on a project similar to ossmosis. I'm creating DOAP records for every project release listed on freshmeat and several other software repositories (CPAN, Ruby Forge etc.)
I ran across your site while doing research before I started crawling software repositories and thought I'd say 'hello'.
Rob
http://doapspace.gentooexperimental.org/