<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Showdown - Java HTML Parsing Comparison</title>
	<atom:link href="http://www.lumidant.com/blog/java-html-parsing-library-comparison/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/</link>
	<description>The weblog of Lumidant LLC.</description>
	<pubDate>Fri, 04 Jul 2008 00:27:39 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Tom</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-341</link>
		<dc:creator>Tom</dc:creator>
		<pubDate>Mon, 16 Jun 2008 22:22:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-341</guid>
		<description>Hi,

thanx for the article. Unfortunately, I am not able to use NekoHTML with Saxon. Saxon always crushes down with an error that the document ins't valid. Do you think you could leave there an example of your code?

Thank you

Tom</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>thanx for the article. Unfortunately, I am not able to use NekoHTML with Saxon. Saxon always crushes down with an error that the document ins&#8217;t valid. Do you think you could leave there an example of your code?</p>
<p>Thank you</p>
<p>Tom</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-169</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Sun, 13 Apr 2008 19:35:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-169</guid>
		<description>Bambaria, 
I don't see your results anywhere.  Did you try any of the links I published?  I found that most of the parsers did not handle these URLs well.  The &lt;a href="http://www.lumidant.com/blog/html-parsing-with-java-mozilla-html-parser/" rel="nofollow"&gt;Mozilla Parser&lt;/a&gt; is the best I've come across for malformed xml.

-Ben</description>
		<content:encoded><![CDATA[<p>Bambaria,<br />
I don&#8217;t see your results anywhere.  Did you try any of the links I published?  I found that most of the parsers did not handle these URLs well.  The <a href="http://www.lumidant.com/blog/html-parsing-with-java-mozilla-html-parser/" rel="nofollow">Mozilla Parser</a> is the best I&#8217;ve come across for malformed xml.</p>
<p>-Ben</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bambarbia Kirkudu</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-166</link>
		<dc:creator>Bambarbia Kirkudu</dc:creator>
		<pubDate>Fri, 11 Apr 2008 14:48:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-166</guid>
		<description>"Only URL to nice internet shop (for beauties!) shows the difference, 144 links found with HtmlCleaner, and 116 with NekoHTML. After quick copy-paste to Excel and sorting links I found that some links are simply repeated by HtmlCleaner probably due to bug... so that all parsers behave the same, correctly parsing ugliest HTML."
;)</description>
		<content:encoded><![CDATA[<p>&#8220;Only URL to nice internet shop (for beauties!) shows the difference, 144 links found with HtmlCleaner, and 116 with NekoHTML. After quick copy-paste to Excel and sorting links I found that some links are simply repeated by HtmlCleaner probably due to bug&#8230; so that all parsers behave the same, correctly parsing ugliest HTML.&#8221;<br />
 <img src='http://www.lumidant.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: HTML Parsing Showdown - New Contender Takes Title &#124; Lumidant</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-106</link>
		<dc:creator>HTML Parsing Showdown - New Contender Takes Title &#124; Lumidant</dc:creator>
		<pubDate>Fri, 21 Mar 2008 20:51:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-106</guid>
		<description>[...] of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared [...]</description>
		<content:encoded><![CDATA[<p>[...] of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: casino juegos de azar</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-15</link>
		<dc:creator>casino juegos de azar</dc:creator>
		<pubDate>Sun, 10 Feb 2008 09:44:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-15</guid>
		<description>Nice! We rather appreciated the website</description>
		<content:encoded><![CDATA[<p>Nice! We rather appreciated the website</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-14</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Fri, 08 Feb 2008 00:13:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-14</guid>
		<description>Hi Lance,
You are correct about the TagSoup namespace issue and I've updated the code above to reflect an easier fix.  
Since I didn't post the actual TagSoup results I was going to rerun the test and share the results with you.  Unfortunately, I'm afraid I've changed the code in some manner since I've written this post.  I'm not able to get very far since I'm now being presented with an exception: "org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist."  I'm not going to spend any time debugging this since HTML Cleaner seems to work well enough for me, but if you've run into the same problem and know what I'm doing wrong I'd be happy to take another look.

-Ben</description>
		<content:encoded><![CDATA[<p>Hi Lance,<br />
You are correct about the TagSoup namespace issue and I&#8217;ve updated the code above to reflect an easier fix.<br />
Since I didn&#8217;t post the actual TagSoup results I was going to rerun the test and share the results with you.  Unfortunately, I&#8217;m afraid I&#8217;ve changed the code in some manner since I&#8217;ve written this post.  I&#8217;m not able to get very far since I&#8217;m now being presented with an exception: &#8220;org.w3c.dom.DOMException: NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.&#8221;  I&#8217;m not going to spend any time debugging this since HTML Cleaner seems to work well enough for me, but if you&#8217;ve run into the same problem and know what I&#8217;m doing wrong I&#8217;d be happy to take another look.</p>
<p>-Ben</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cemo</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-13</link>
		<dc:creator>Cemo</dc:creator>
		<pubDate>Thu, 07 Feb 2008 19:56:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-13</guid>
		<description>Very nice article Ben. I was looking for such a helpful post.</description>
		<content:encoded><![CDATA[<p>Very nice article Ben. I was looking for such a helpful post.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lance</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-12</link>
		<dc:creator>Lance</dc:creator>
		<pubDate>Wed, 06 Feb 2008 20:53:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-12</guid>
		<description>I do like how HtmlCleaner just required a constructor taking a URL, rather than all the extra code the others require.  I will look into it.</description>
		<content:encoded><![CDATA[<p>I do like how HtmlCleaner just required a constructor taking a URL, rather than all the extra code the others require.  I will look into it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lance</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-11</link>
		<dc:creator>Lance</dc:creator>
		<pubDate>Wed, 06 Feb 2008 20:50:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-11</guid>
		<description>I'm curious about your TagSoup test, as I've had good results with it.  One thing I found was that it actually created a namespace for the nodes, so I had to preface my xquery with "h:".  Following is from some code I used to print out the movies in my Blockbuster Queue (of course I wrote helper code - not included):

  Dom4jXPath xpath = new Dom4jXPath("//h:div[@class='title']");
xpath.addNamespace("h", "http://www.w3.org/1999/xhtml");
java.util.List divs =xpath.selectNodes("//h:div[@class='title']", doc);
...
later
...
Dom4jXPath xpath = new Dom4jXPath("h:a");
xpath.addNamespace("h", "http://www.w3.org/1999/xhtml");
xpath.stringValue(element);</description>
		<content:encoded><![CDATA[<p>I&#8217;m curious about your TagSoup test, as I&#8217;ve had good results with it.  One thing I found was that it actually created a namespace for the nodes, so I had to preface my xquery with &#8220;h:&#8221;.  Following is from some code I used to print out the movies in my Blockbuster Queue (of course I wrote helper code - not included):</p>
<p>  Dom4jXPath xpath = new Dom4jXPath(&#8221;//h:div[@class='title']&#8220;);<br />
xpath.addNamespace(&#8221;h&#8221;, &#8220;http://www.w3.org/1999/xhtml&#8221;);<br />
java.util.List divs =xpath.selectNodes(&#8221;//h:div[@class='title']&#8220;, doc);<br />
&#8230;<br />
later<br />
&#8230;<br />
Dom4jXPath xpath = new Dom4jXPath(&#8221;h:a&#8221;);<br />
xpath.addNamespace(&#8221;h&#8221;, &#8220;http://www.w3.org/1999/xhtml&#8221;);<br />
xpath.stringValue(element);</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-9</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Wed, 06 Feb 2008 03:12:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-9</guid>
		<description>Hi Anjan,
This method will easily beat the manual one.  I use Saxon to run the actual XQuery on the DOM once I've received it back from the parse operation.  Perhaps if there is interest I can write a post on how to use Saxon.

-Ben</description>
		<content:encoded><![CDATA[<p>Hi Anjan,<br />
This method will easily beat the manual one.  I use Saxon to run the actual XQuery on the DOM once I&#8217;ve received it back from the parse operation.  Perhaps if there is interest I can write a post on how to use Saxon.</p>
<p>-Ben</p>
]]></content:encoded>
	</item>
</channel>
</rss>
