| d2r diego's weblog |
conversation engine: the next stepSince the recent integration of Feedster results into the conversation engine, I stopped coding for a bit and while doing other stuff I've been thinking of how to make it more scalable, covering more weblogs, and not wasting resources in looking at pages with no meaning (read: make it more useful) --- in short, how to solve the problems I mentioned in that entry. The crucial problem is that Feedster provides only part of the picture. Scott Rafer (Feedster CEO) mentioned in the comments that I could use the Feedster links output, which provides a list of the references to a particular weblog. This doesn't quite do what I need however. The reason is simple: Feedster indexes RSS feeds, not entire sites, and so if someone is providing summary feeds, then Feedster will not be able to find links between weblogs, even if they exist. Because, many, many weblogs provide summary feeds, it is clear that the only way to get the links between entries is to get the actual contents of the HTML page. But. But what I can do is use Feedster as the source point for the list of pages to index. Right now I am indexing everything on a given website. This has two drawbacks. First, I am forced to download, store, and analyze, waaay more content than I need (which accounts for the small amount of sites the bot is crawling at the moment), particularly when weblogs point to other parts of a site, including Wikis, dynamic apps, etc. Second, it slows down the processing for conversations, which depends on walking the link graph between two sites. This is a problem now, but if I move in the direction of adding multiple-participant conversations (as Don suggests in a comment to my previous conv. engine post, linked above) then this will be even more important. So. Next step, then, is to use Feedster as the data source for the entries of a given weblog. Then download/process the pages for each entry's permalink. Then analyze that and combine the results with the Feedster information. Stay tuned! More in the next few days. Categories: soft.devPosted by diego on December 7 2004 at 10:36 AM Comments (please see the comments & trackback policy).
A blog's RSS feed typically contains contents of the blog's main page. Minus a pocket full of usual gotchas, you can compare the two and apply the resulting content extraction template on rest of the pages. The fact that RSS enables 'intelligent AND adaptive' web page scraping is under-appreciated IMHO. Posted by: Don Park at December 7, 2004 3:56 PMDon, sorry but I'm not sure I follow how the RSS of the index applies to the rest of the pages. The problem is that the links within the feed (the permalinks I mean) which allow for more intelligent scraping really don't say that much about the pattern to follow to define what is an entry as opposed to, say, an archive. Just as one example, many weblogs with WordPress use index.php with parameters to give index, archives, and posts, all with the same base URL but changing the Query section. That isn't counting homegrown systems... so although in some cases it might help, it seems to me that it's not easy to generalize from there. Or am I missing something? Maybe you had another example in mind... Posted by: Diego at December 7, 2004 4:03 PMSorry to confuse you Diego. I was referring to RSS feeds of blogs, not Feedster feeds. Since same posts are present in both the RSS and HTML resources, you can use a blog's RSS feed to identify enough salient points to aid in parsing rest of the blog's web pages. Posted by: Don Park at December 7, 2004 4:48 PMI think typed links would make for much richer conversation picture: imagine if engine had information that this post agrees, disagrees, or brings new angle to discussion. More about it in I gathered some links about it at It is also worth to notice that WWW proposal had very good example of use of typed links Copyright © Diego Doval 2002-2007.
|
