Now blogging at diego's weblog. See you over there!

conversation finder, part 3

The conversation finder saga continues! :) (Parts one, two)

Parser's done, at least in basic form. Both parser and bot seem to be running and playing along nicely. I created a simple conversation finder site to have a fixed point of reference for all this, particularly for the Bot which should start showing up in some logs any minute now. I am keeping it under control by manually specifying which sites it can download (I know this isn't scalable, but it's an easy fix once the rest is done), and at the moment only three are active: My weblog, Don's and Tim's, since the core of the idea came from a conversation Don was having with Tim, I'll use that as the "index case". If the finder actually finds conversations there, I'll start expanding the field, or possibly add a form so that others can "activate" spidering of a site.

But that's for later. Now, regarding the parser, a few interesting things.

As it turns out, parsing itself was a lot simpler than interpreting the information. I am using the HTML Parser in the JDK's HTMLEditorKit, which is actually quite easy to use: just define a Callback and specify what to do with each tag opening, closing, etc. But for the algorithm I'm using links between pages, which sounds simple enough... until you realize that links come in many shapes and sizes. Normalization of links into full URIs took a bit of figuring out. What I needed to do was, starting from any possible HREF, end up with a full URI, of the form: scheme + hostinfo + path + query + fragment (hostinfo actually has components, but let's leave that aside for now).

But URLs in HREF can be both relative and absolute. Relative URLs can be absolute within the site (e.g., /d2r/) or relative to the current page (e.g., ../index.html). Absolute URLs vary in form even if they point at the same thing (e.g., www.dynamicobjects.com and dynamicobjects.com), and you can also use IP numbers.

Then there's parsing errors, and URLs that are malformed but may be "recoverable" through fancier parsing, but I decided to ignore that for the moment. Another decision was to ignore IP URLs (i.e., not attempt to match them with the site), but this is an easy fix and not that critical I think--no weblogs that I know of use IP URLs for permalinks.

For separating the pieces I'm using elements of the following regular expression:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
which is specified in Appendix B of the URI RFC, with the java.util.regex.* classes.

Here's how I'm normalizing URLs at the moment.

For absolute URLs, the main case that has to be disambiguated is something.com vs. www.something.com, which in the vast majority of the cases applies only to the root domain (i.e., www.something.com might point to something.com, but www.other.something.com probably won't exist at all for other.something.com). So there I'm doing a couple of checks and converting something.com to www.something.com when necessary.

For relative URLs I use the current site and page to generate the full URI. With the full URI, I then normalize the path when the reference is relative-relative (e.g., ../index.html).

This solution is far, far from perfect (I don't even want to think about how many special cases I'm not covering), but it's good enough for now.

I'm now working on the algorithm to find conversations. Getting close!

Categories: soft.dev
Posted by diego on December 2, 2004 at 12:17 PM

Copyright © Diego Doval 2002-2011.