Now blogging at diego's weblog. See you over there!

java xml pull and push: a comparison


Yesterday Russ pointed to an article at O'Reilly's XML.com about StAX (XML Streaming API), a Java API that allows parsing of XML through a pull-mechanism, currently in the final laps of the JSR process as JSR 173. I was immediately intrigued. While many find it common to use additional APIs to solve some problems, I tend to prefer removing layers of complexity and abstraction that aren't absolutely necessary, using only JDK-standard (or standard extension) classes as much as possible. Since this API appears to be not only frozen, but also on track to be a standard extension (and hopefully it will be included in the next JDK release!), I decided to give it a try.

While the benefits in simplicity of parsing are quite obvious, I was a little weary of the performance of this package, since it's a reference implementation and it is almost certainly not fully optimized (one of the main uses that I'd give it is to parse RSS, and I was already looking at how to improve performance of RSS checking--but that's another story). So I created two parsers, one using StAX and one using plain SAX, and compared them in terms of usability and performance.

I ran the tests agains an RSS 2 feed of 500 KB, essentially 20 copies of this morning's RSS feed for my weblog.

The results are pretty surprising. StAX wins by a mile. Check out the results:

SAX results

-----------------------------------------------
Start element count = 2109
Characters event count = 4303
-----------------------------------------------
Time elapsed = 140 (msec)
-----------------------------------------------
StAX (Streaming) results
-----------------------------------------------
Start element count = 2109
Characters event count = 4236
-----------------------------------------------
Time elapsed = 63 (msec)
-----------------------------------------------
A couple of notes: "characters event" count refers to the time that an element is of type characters in StAX or when the characters method is called in SAX by the parser. For an entry that contains text (e.g., CDATA), multiple character calls may be received, depending on new lines, etc. Apparently StAX splits the text a bit differently than plain SAX, since it finds more character elements, but that's ok (to parse those you only need to keep state and append the new characters value to the current element you're parsing). Both element counts are the same, which only proves that both are parsing the same structure properly. The results, are, of course, consistent over several runs, with the expected slight differences. (And, btw, I tried testing memory usage but the variability in initial free memory, etc, was too big to be able to measure which one is better. At a minimum, they appear to be equivalent in that sense).

Here is the code used in the tests: for the TestSAXParser class and the TestXMLStream class. Note: to run the code you'll need to download the current specification of the JSR. In that ZIP file there's the spec itself (a PDF) and a JAR file, jsr173.jar, which contains a number of classes, API docs and such. Of those jars, the only ones necessary to run the example (ie., that must be added to the classpath) are jsr173_07_api.jar and jsr173_07_ri.jar, that is, the API and the reference implementation respectively.

As it is plain to see, the StAX code is a lot simpler than the SAX code, because it doesn't require a wrapper to make it look more event-based. Add to that the fact that StAX code runs at more than twice the speed of SAX, and, as I said, StAX wins hands down. No contest.

Cool eh?

Categories: soft.dev
Posted by diego on September 19 2003 at 10:40 AM

Copyright © Diego Doval 2002-2011.
Powered by
Movable Type 4.37