Performance Tip: Boost Your SAX Life
SAX, or Simple API for XML has grown to be the standard in Java XML parsing. Numerous studies show how its performance, in the majority of circumstances, beats others such as PULL.
Does that mean you should just bang out a custom handler and expect solid performance, hardly.
This article has only two tips, but it can have a huge impact.
Sample Code
In case it has been a while since you wrote your last SAX XML handler, below is a sample method for the start of each element in an XML stream. This method will be called each and every time a new element starts, no matter how nested or trivial.
public void startElement(String uri, String name, String qName, Attributes atts) { if (name.trim().equals("category")){ inCategory = true; }else if (name.trim().equals("availability")){ inAvailability = true; }else if (name.trim().equals("delivery_formats")){ inFormats = true; }else if (name.trim().equals("queue_item")){ inQueueItem = true; }else if (name.trim().equals("id")){ inId = true; }else if (name.trim().equals("average_rating")){ inRating = true; }else if (name.trim().equals("position")){ inPosition = true; }else if (name.trim().equals("title")){ inTitle = true; }else if (name.trim().equals("box_art")){ inBoxArt = true; }else if (name.trim().equals("etag")){ inETag = true; }else if (name.trim().equals("number_of_results")){ inResultsTotal = true; }else if (name.trim().equals("results_per_page")){ inResultsPerPage = true; } }
(That is part of a handler I use for Netflix’s API responses.)
Keep it Simple SAX
So this may seem obvious to some of you, and is based on a well understood matter. String comparisons are not cheap. They’re not explicitly expensive, and having a half a dozen in one method would not usually cause any performance bottle necks.
But now imagine calling the same method 1,000′s of times. That’s 6,000 string comparisons!
Each method in a SAX parse will do just that! So how do we reduce the number of comparisons needed?
So if your not using “reasults_per_page” then there is no need to check for that element. It may be useful to create separate handlers for various needs. One can get the quick and dirty summary, while the other can capture all the gritty details.
Hooray! we just knocked off 1,000 comparisons!
Order is Everything!
So keeping the If Else blocks short makes sense, the less comparisons, the faster the parsing. But even for those blocks that require numerous element checks we can tune performance substantially by adjusting the order.
The next tip, and a hugely important one, is to carefully consider the frequency of each element called. To illustrate is a sample XML response form Netflix.
<?xml version="1.0" standalone="yes"?> <queue> <etag>26049040137</etag> <url_template>http://api.netflix.com/users/se3cret/queues/disc?{-join|&|sort|start_index|max_results} </url_template> <number_of_results>32</number_of_results> <start_index>0</start_index> <results_per_page>25</results_per_page> <queue_item> <id>http://api.netflix.com/users/secret/queues/disc/available/1/70056440 </id> <position>1</position> <link href="http://api.netflix.com/users/secret/queues/disc/available" rel="http://schemas.netflix.com/queues.available" title="available queue"></link> <category scheme="http://api.netflix.com/categories/queue_availability" label="available now" term="available now">Available Now</category> <updated>1185814282</updated> <category scheme="http://api.netflix.com/categories/title_formats" label="DVD" term="DVD"></category> <link href="http://api.netflix.com/catalog/titles/movies/70056440" rel="http://schemas.netflix.com/catalog/title" title="300"></link> <title short="300" regular="300"></title> <box_art small="http://cdn-0.nflximg.com/us/boxshots/tiny/70056440.jpg" medium="http://cdn-0.nflximg.com/us/boxshots/small/70056440.jpg" large="http://cdn-0.nflximg.com/us/boxshots/large/70056440.jpg"></box_art> <link href="http://api.netflix.com/catalog/titles/movies/70056440/synopsis" rel="http://schemas.netflix.com/catalog/titles/synopsis" title="synopsis"> <synopsis><![CDATA[<a href="http://www.netflix.com/RoleDisplay/Zack_Snyder/20057354">Zack Snyder</a> directs this faithful adaptation of Frank Miller's (Sin City) graphic novel about the storied Battle of Thermopylae, a conflict that pitted the ancient Greeks against the Persians in 480 B.C. The film, which blends live-action shots with virtual backgrounds to capture Miller's original vision, co-stars <a href="http://www.netflix.com/RoleDisplay/Gerard_Butler/20015853">Gerard Butler</a> as the Spartan King Leonidas, who leads his small band of 300 soldiers against an army of more than one million. ]]></synopsis> </link> <link href="http://api.netflix.com/catalog/titles/movies/70056440/bonus_materials" rel="http://schemas.netflix.com/catalog/titles/bonus_materials" title="bonus materials"></link> <release_year>2007</release_year> <category scheme="http://api.netflix.com/categories/mpaa_ratings" label="R" term="R"></category> <category scheme="http://api.netflix.com/categories/genres" label="Action & Adventure" term="Action & Adventure"></category> <category scheme="http://api.netflix.com/categories/genres" label="Adventures" term="Adventures"></category> <category scheme="http://api.netflix.com/categories/genres" label="Military & War Action" term="Military & War Action"></category> <category scheme="http://api.netflix.com/categories/genres" label="Epics" term="Epics"></category> <category scheme="http://api.netflix.com/categories/genres" label="Comic Books and Superheroes" term="Comic Books and Superheroes"></category> <category scheme="http://api.netflix.com/categories/genres" label="Blu-ray" term="Blu-ray"></category> <category scheme="http://api.netflix.com/categories/genres" label="Warner Home Video" term="Warner Home Video"></category> <runtime>6960</runtime> <link href="http://api.netflix.com/catalog/titles/movies/70056440/format_availability" rel="http://schemas.netflix.com/catalog/titles/format_availability" title="formats"> <delivery_formats> <availability available_from="1185840000"> <category scheme="http://api.netflix.com/categories/title_formats" label="Blu-ray" term="Blu-ray"></category> </availability> <availability available_from="1185840000"> <category scheme="http://api.netflix.com/categories/title_formats" label="DVD" term="DVD"></category> </availability> </delivery_formats> </link> <link href="http://api.netflix.com/catalog/titles/movies/70056440/screen_formats" rel="http://schemas.netflix.com/catalog/titles/screen_formats" title="screen formats"></link> <link href="http://api.netflix.com/catalog/titles/movies/70056440/cast" rel="http://schemas.netflix.com/catalog/people.cast" title="cast"></link> <link href="http://api.netflix.com/catalog/titles/movies/70056440/directors" rel="http://schemas.netflix.com/catalog/people.directors" title="directors"></link> <link href="http://api.netflix.com/catalog/titles/movies/70056440/languages_and_audio" rel="http://schemas.netflix.com/catalog/titles/languages_and_audio" title="languages and audio"></link> <average_rating>4.0</average_rating> <link href="http://api.netflix.com/catalog/titles/movies/70056440/similars" rel="http://schemas.netflix.com/catalog/titles.similars" title="similars"></link> <link href="http://www.netflix.com/Movie/300/70056440" rel="alternate" title="web page"></link> </queue_item> <queue_item> ... </queue_item> </queue>
So there are some important things to note;
- ‘queue’ occurs once, and only once
- Likewise ‘etag’, and result counts occur once and only once
- ‘queue_item’ occurs once for each movie (1 to many times)
- ‘id’, ‘position’ and movie details also occur once for each listing (1 to many times)
- ‘availability’ may occur 1 to 3 times for each movie (3 to (3*many) times )
- ‘category’ is by far the most notorious, occurring within each title multiple times, and within each availability (lots of times)
So what does that mean? It means the very first element we’ll check for is ‘category’, then ‘availability’, and work our way to the least common element, ‘queue’.
This is a very effective way to increase performance because once a match is found, we skip all the rest.
So a 12 comparison method effectively becomes a single comparison if we are in a ‘category’, and a 2 comparison method if we are in ‘availability’, etc. Until finally on the rarest ‘queue’ which occurs only once do we evaluate all 12 comparisons.
Results
This test is hardly conclusive, but you can see the average and max parse times of a randomly arranged handler and a popularity arranged handler.

Nice article.
How about nesting IF statements to further reduce # of evaluations. For instance if we are not in “Formats” then we know we can’t be in availability, and need not check so.