| d2r diego's weblog |
conversation finder, part 3The conversation finder saga continues! :) (Parts one, two) Parser's done, at least in basic form. Both parser and bot seem to be running and playing along nicely. I created a simple conversation finder site to have a fixed point of reference for all this, particularly for the Bot which should start showing up in some logs any minute now. I am keeping it under control by manually specifying which sites it can download (I know this isn't scalable, but it's an easy fix once the rest is done), and at the moment only three are active: My weblog, Don's and Tim's, since the core of the idea came from a conversation Don was having with Tim, I'll use that as the "index case". If the finder actually finds conversations there, I'll start expanding the field, or possibly add a form so that others can "activate" spidering of a site. But that's for later. Now, regarding the parser, a few interesting things. As it turns out, parsing itself was a lot simpler than interpreting the information. I am using the HTML Parser in the JDK's HTMLEditorKit, which is actually quite easy to use: just define a Callback and specify what to do with each tag opening, closing, etc. But for the algorithm I'm using links between pages, which sounds simple enough... until you realize that links come in many shapes and sizes. Normalization of links into full URIs took a bit of figuring out. What I needed to do was, starting from any possible HREF, end up with a full URI, of the form: scheme + hostinfo + path + query + fragment (hostinfo actually has components, but let's leave that aside for now). But URLs in HREF can be both relative and absolute. Relative URLs can be absolute within the site (e.g., /d2r/) or relative to the current page (e.g., ../index.html). Absolute URLs vary in form even if they point at the same thing (e.g., www.dynamicobjects.com and dynamicobjects.com), and you can also use IP numbers. Then there's parsing errors, and URLs that are malformed but may be "recoverable" through fancier parsing, but I decided to ignore that for the moment. Another decision was to ignore IP URLs (i.e., not attempt to match them with the site), but this is an easy fix and not that critical I think--no weblogs that I know of use IP URLs for permalinks. For separating the pieces I'm using elements of the following regular expression: which is specified in Appendix B of the URI RFC, with the java.util.regex.* classes.
Here's how I'm normalizing URLs at the moment. For absolute URLs, the main case that has to be disambiguated is something.com vs. www.something.com, which in the vast majority of the cases applies only to the root domain (i.e., www.something.com might point to something.com, but www.other.something.com probably won't exist at all for other.something.com). So there I'm doing a couple of checks and converting something.com to www.something.com when necessary. For relative URLs I use the current site and page to generate the full URI. With the full URI, I then normalize the path when the reference is relative-relative (e.g., ../index.html). This solution is far, far from perfect (I don't even want to think about how many special cases I'm not covering), but it's good enough for now. I'm now working on the algorithm to find conversations. Getting close! Categories: soft.devPosted by diego on December 2 2004 at 12:17 PM Comments (please see the comments & trackback policy).
Looks like you are having fun, Diego. Posted by: Don Park at December 2, 2004 5:36 PMRather than finding the conversations ... what about having them come to you? I know that finding it is part of your fun, Diego, but what about something like Ping-O-Matic telling you, "Yeah, we have an entry," which would tell you the URL to go off and scrape. You scrape the URL, sense if there are links, and go from there. That'd allow you to bypass Technorati and Feedster. But with P-O-M, you'll get killed if you can't handle the load. ;) Just tossing you an idea. GFM Posted by: Geof F. Morris at December 2, 2004 6:18 PMCopyright © Diego Doval 2002-2007.
|
