Showdown - Java HTML Parsing Comparison
I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like the markup created here at Lumidant. Missing end tags and other broken syntax throws a wrench into the situation. Luckily, others have already addressed this issue. Many times over in fact, leaving many to wonder which solution to implement.
Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. I know that there are many others I could have chosen from as well, but this seemed to be a good sampling and there’s only so much time in the day. I also chose 10 URLs to parse. Being a true Clevelander I picked the sites of a number of local attractions. I’m right near all of the stadiums, so the Quicken Loans Arena website was my first target. I sometimes jokingly refer to my city as the “Mistake on the Lake” and the pure awfulness of the HTML from my city did not fail me. The ten URLs I chose are:
http://www.theqarena.com
http://cleveland.indians.mlb.com
http://www.clevelandbrowns.com
http://www.cbgarden.org
http://www.clemetzoo.com
http://www.cmnh.org
http://www.clevelandart.org
http://www.mocacleveland.org
http://www.glsc.org
http://www.rockhall.com
I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. This was a design tip fresh in my mind from reading my all-time favorite technical book: Effective Java by Josh Bloch. The implementation specific code for each library is below:
NekoHTML:
final DOMParser parser = new DOMParser();
try {
parser.parse(new InputSource(urlIS));
document = parser.getDocument();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
TagSoup:
final Parser parser = new Parser();
SAX2DOM sax2dom = null;
try {
sax2dom = new SAX2DOM();
parser.setContentHandler(sax2dom);
parser.setFeature(Parser.namespacesFeature, false);
parser.parse(new InputSource(urlIS));
} catch (Exception e) {
e.printStackTrace();
}
document = sax2dom.getDOM();
jTidy:
final Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
document = tidy.parseDOM(urlIS, null);
HtmlCleaner:
final HtmlCleaner cleaner = new HtmlCleaner(urlIS);
try {
cleaner.clean();
document = cleaner.createDOM();
} catch (Exception e) {
e.printStackTrace();
}
Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document. The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. Most of the others were not able to make it past even the very first link I provided, which was to Quicken Loans Arena site. HtmlCleaner’s full results:
Found 87 links at http://www.theqarena.com/
Found 156 links at http://cleveland.indians.mlb.com/
Found 96 links at http://www.clevelandbrowns.com/
Found 106 links at http://www.cbgarden.org/
Found 70 links at http://www.clemetzoo.com/
Found 23 links at http://www.cmnh.org/site/
Found 27 links at http://www.clevelandart.org/
Found 51 links at http://www.mocacleveland.org/
Found 27 links at http://www.glsc.org/
Found 90 links at http://www.rockhall.com/
One disclaimer that I will make is that I did not go out of my way to improve the performance of any of these libraries. Some of them had additional options that could be set to possibly improve performance. I did not delve into wading through the documentation to figure out what these options were and simply used the plain vanilla incantations. HtmlCleaner seems to offer me everything I need and was quick and easy to implement.
Update: I’ve found a new winner.











anjan bacchu said,
February 5, 2008 at 10:35 am
hi there,
thanks for the post. I frequently need an ability to get hold of links from documents and download them automatically. I almost always do it manually, especially if I am behind a corporate firewall. But with HtmlCleaner and XQuery, I should be able to automate most of it NEXT time.
which xquery tool do you use ?
BR,
~A
Ben said,
February 5, 2008 at 10:12 pm
Hi Anjan,
This method will easily beat the manual one. I use Saxon to run the actual XQuery on the DOM once I’ve received it back from the parse operation. Perhaps if there is interest I can write a post on how to use Saxon.
-Ben
Lance said,
February 6, 2008 at 3:50 pm
I’m curious about your TagSoup test, as I’ve had good results with it. One thing I found was that it actually created a namespace for the nodes, so I had to preface my xquery with “h:”. Following is from some code I used to print out the movies in my Blockbuster Queue (of course I wrote helper code - not included):
Dom4jXPath xpath = new Dom4jXPath(”//h:div[@class='title']“);
xpath.addNamespace(”h”, “http://www.w3.org/1999/xhtml”);
java.util.List divs =xpath.selectNodes(”//h:div[@class='title']“, doc);
…
later
…
Dom4jXPath xpath = new Dom4jXPath(”h:a”);
xpath.addNamespace(”h”, “http://www.w3.org/1999/xhtml”);
xpath.stringValue(element);
Lance said,
February 6, 2008 at 3:53 pm
I do like how HtmlCleaner just required a constructor taking a URL, rather than all the extra code the others require. I will look into it.
Cemo said,
February 7, 2008 at 2:56 pm
Very nice article Ben. I was looking for such a helpful post.
Ben said,
February 7, 2008 at 7:13 pm
Hi Lance,
You are correct about the TagSoup namespace issue and I’ve updated the code above to reflect an easier fix.
Since I didn’t post the actual TagSoup results I was going to rerun the test and share the results with you. Unfortunately, I’m afraid I’ve changed the code in some manner since I’ve written this post. I’m not able to get very far since I’m now being presented with an exception: “org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.” I’m not going to spend any time debugging this since HTML Cleaner seems to work well enough for me, but if you’ve run into the same problem and know what I’m doing wrong I’d be happy to take another look.
-Ben
casino juegos de azar said,
February 10, 2008 at 4:44 am
Nice! We rather appreciated the website
HTML Parsing Showdown - New Contender Takes Title | Lumidant said,
March 21, 2008 at 3:51 pm
[...] of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared [...]
Bambarbia Kirkudu said,
April 11, 2008 at 9:48 am
“Only URL to nice internet shop (for beauties!) shows the difference, 144 links found with HtmlCleaner, and 116 with NekoHTML. After quick copy-paste to Excel and sorting links I found that some links are simply repeated by HtmlCleaner probably due to bug… so that all parsers behave the same, correctly parsing ugliest HTML.”

Ben said,
April 13, 2008 at 2:35 pm
Bambaria,
I don’t see your results anywhere. Did you try any of the links I published? I found that most of the parsers did not handle these URLs well. The Mozilla Parser is the best I’ve come across for malformed xml.
-Ben
Tom said,
June 16, 2008 at 5:22 pm
Hi,
thanx for the article. Unfortunately, I am not able to use NekoHTML with Saxon. Saxon always crushes down with an error that the document ins’t valid. Do you think you could leave there an example of your code?
Thank you
Tom