Pursuing an RDF Epiphany

Posted by bob Thu, 13 Dec 2007 03:24:27 GMT

I've worked with XML and related web technologies for a while now, and I've struggled to fully grasp or otherwise grok the Resource Description Framework. It seems to have an unfortunate taboo of being too complicated, esoteric, or impractical. At first glance, it can appear to be a solution looking for a problem.

The main question that came up for me has always been why one would use RDF (or RDF/XML) instead of a plain, non-RDF XML vocabulary for data transfer, sharing, integration, etc. I'll attempt to answer that question...primarily for my own edification...and then a real world example will be briefly discussed to hopefully dislodge the lingering RDF monkey from my back.

Flexibility: RDF is the Model
The plain XML vs. RDF question was posed by Leigh Dodds a while ago, with a few members of the RDF community responding. One of the most concise answers was from Shelley Powers: RDF is based on a domain-neutral model that allows one set of statements to be merged with another set of statements, even though the information contained in each set of statements may differ dramatically. Plain XML is hierarchical and only needs to be well-formed (and hopefully valid against a schema); extracting anything semantically within the document is dependent upon some shared, explicit understanding between consumers and producers of the XML. In contrast, RDF is composed of simple statements (subject-predicate-object triples) which facilitate immediate consumption without having to worry about structure or order (i.e, elements, child nodes, attributes, etc.). RDF is the model. The processing of triples is highly predictable and static, reducing the effort involved when things change. Plain XML has an ever varying model depending on the vocabulary - only its syntax remains the same. What happens if a plain XML schema changes, structurally and/or semantically? Combined with an environment of distributed data and multiple parties involved in owning or generating that data, the time and effort required to accommodate modifications could be quite high.

Efficient Integration of Decentralized Data
As alluded to above, perhaps the most significant aspect of RDF is how the basic triple model enables the merging and integration of decentralized data. The processing of triples from two or more sources (and with different RDF vocabularies all together) can occur immediately thanks to namespaces. Integration of decentralized data also requires the ability to uniquely identify resources. RDF's reliance on URIs (which by definition and nature of the Internet must be unique) provides this uniqueness in a simple and elegant manner.

Graph Based Data Models
Data sets that adhere to a basic graph model are especially well suited for representation in RDF. The simple hyperlinked characteristics of RDF allow loose coupling and late binding of resources directly in the RDF model. The object in a triple statement is often another resource with its own URL, effectively creating a relationship between resources that may not reside in the same domain. In addition, the RDF construct rdfs:seeAlso offers extension and linking of other sets of RDF data that exist elsewhere on the web.

Plays Nicely with RESTful Web APIs
RDF fits nicely with web services adhering to a REST architecture. In an article about connecting social content services, Leigh Dodds points out the complimentary features of RDF and REST: as RDF uses URIs as the means of identifying resources, the API URL structure and the response format can be closely related. Not that a plain XML vocabulary does not have a place in RESTful web APIs, but both RDF and REST are inherently "resource" centric and can result in a more elegant and flexible service.

DOAP & the OSS Community
My interest in RDF has been piqued by the DOAP vocabulary created by Edd Dumbill, which describes open source software projects. A basic goal of DOAP is to allow people managing a project to maintain and control project meta data on their own terms and in one place...and, in theory, avoid the time and effort involved in notifying various repositories or services that updates have occurred.

An interesting project called DOAPspace started by Rob Cakebread is a DOAP repository being actively seeded from freely available project data from sources such as Freshmeat and SourceForge. Rob also has a solution (doapurl.org) to provide an authoritative source of DOAP project URLs following the model of Persistent URLs (PURLs). DOAP URLs will essentially be permanent, allowing authorized project members to edit the PURL-like DOAP URL if the actual project URL they control ever changes. DOAPspace can then reference doapurl.org managed URLs. These services are basically a platform for enabling the decentralized nature of DOAP and allowing project members to maintain their project data. However, there is one more critical piece here - notifying interested parties and services of DOAP updates. It involves an intermediary service called Ping the Semantic Web (PTSW). When DOAP is updated, the PTSW service can be pinged and the update event will be archived and time-stamped. DOAPspace (or any other service) can then use PTSW to learn of any DOAP update events.

ossmosis, a nascent web service I've briefly mentioned before, is in the same 'semantic' realm as it were and the role of DOAP with respect to ossmosis is evolving. The service is focused on contextual aspects of OSS projects and people, and we hope to contribute to (as well as benefit from) the emerging DOAP friendly OSS community. I hope to write a bunch more on this in the future when various pieces and thoughts have solidified.

More on the Software Stack & Components vs. Tools

Posted by bob Mon, 24 Sep 2007 02:26:51 GMT

Steve Parker has described a simple and logical classification scheme for components and tools related to software development. In essence, most software can be associated with a location on a vertical stack that represents low to high functionality. Low in this sense meaning closer to the hardware level (e.g. the Linux kernel), and high being related to a user facing, fully functional application. The stack is composed of a limited number of generic "way-point" categories, such as data or middle-ware, to establish reference points as one moves up or down the stack. This single hierarchy classification scheme should be sufficient in organizing a foundation or substrate of software without attempting to create the perfect categorization scheme (which is arguably impossible for any non trivial topic), or relying on complex or esoteric approaches to categorization.

We have designated the highest section of the stack as "packaged applications". This would be anything that is not a software component, and would not traditionally play a role in a living system. For example: Firefox, OpenOffice, etc. However, there is quite a bit of open source software that fits into that category of full user facing applications. It seems the stack would lose some of its value if all user-facing applications were at the top and not really differentiated in some way.

Consider MySQL Query Browser. It's a useful front-end tool for interacting with a MySQL database. If it were placed high on the stack in the miscellaneous range of "applications", its simple function as a database front-end would not be properly reflected. I would say that the MySQL Query Browser belongs near the MySQL database server itself in the "data" section, but with a slightly higher placement on the stack. This suggests that if a user-facing tool can be logically associated with a given range within the stack, then it belongs there. Ideally, the highest level "packaged applications" section would only contain applications that could not otherwise be reliably placed within lower ranges of the stack. The stack favors function over form.

There is one more simple addition to the stack model to help further classify software at all levels. We are viewing all software as a component or a tool. A component is a piece of software directly involved in the development process and plays a role at runtime; components participate in a living system. Examples: database servers, web development frameworks, code libraries, etc. A tool is a stand-alone, fully functional application or utility - such as MySQL Query Browser, but of course they do not have to be full GUI based applications (e.g. mytop).

I have not yet been able to think of something that falls into a gray area of both component and tool. Eclipse came to mind, but it is usually viewed as individual pieces of software anyway. Please let me know if you can think of something that fits both roles...

Saxon SQL Extension: Importing XML into a Relational Database 1

Posted by bob Tue, 31 Jul 2007 14:37:00 GMT

I finally started making time for another project that Mr. Parker and I have been discussing now for well over a year. Currently we're calling it ossmosis, and the purposefully vague description of it is a contextual resource for open source software - targeting both developers and less technical project manager types.

Research has led me to discover a really useful project called FLOSSmole, which is described as a "collaborative collection and analysis of free/libre/open source project data". I was originally planning on developing some sort of a crawler to retrieve this type of data myself, but luckily came across FLOSSmole before writing a single line of code.

Part of the FLOSSmole data is all freshmeat.net projects and their associated "troves", or facets. This is what I'm currently most interested in. However, the actual trove hierarchy (facet names) is currently not available in FLOSSmole, but hosted by freshmeat.net here (this is a large file so think twice before telling your browser to view it!)

On to the specific topic of this article - importing an XML document (the trove hierarchy) into a relational database (MySQL). There are a 1001 different ways to do this. You could probably do it in three lines of Ruby code. Maybe. I chose the approach of using XSLT and SQL extension functions available within Saxon. Why? Because all that is needed is a single template matching on <descriminator> nodes, which then fires <sql:insert> calls. A script written in (Perl, Ruby, Python, etc.) would have to establish the database connection, parse the XML tree and find the <descriminator> nodes, extract the values of the child elements under <descriminator>, construct SQL insert statements using those values, and then interact with the database API to do the actual inserts. SAX is another option but the SQL inserts would still have to be constructed, as well as interaction with a database API.


Here's the stuff:

ImportFMTroveDefs.xslt (includes SQL to create target table)

Saxon jars available below:
saxon8.jar   saxon8-sql.jar

You will also need a JDBC driver. The best one for MySQL is here.

Example command line java call to run this:

java -Xmx84M -cp saxon8.jar:saxon8-sql.jar:mysql-connector-java-5.0.7-bin.jar net.sf.saxon.Transform fm-trove.rdf ImportFMTroveDefs.xslt driver="com.mysql.jdbc.Driver"
database="jdbc:mysql://localhost/ossmosis" user="ossmosis" password="ossmosis" datasourceid="81"

The JVM memory argument was necessary to boost the default due to the size of the 'fm-trove.rdf' file. Also - watch out for empty string database user passwords - passing a parameter to Saxon from the command line can be funky for empty strings.

Finally, here is the dump of the data once imported into MySQL.