Now blogging at diego's weblog. See you over there!

atomflow


I'll probably continue babbling about EuroFoo-related ideas for a week (or a month), but this is a good one to start with.

First there was a conversation with Matt and Ben on Friday night regarding syndication and Atom. Matt was describing something he had written about a few days ago: how he'd like to have a sort of "atom storage/query core" that would allow you do a) add, remove and update entries and then b) query those entries (postIDs, dates, etc, to which I immediately added keyword-based queries in my head).

Matt had two points.

One, that by using Atom as input format, you could simplify entry into this black-box system and use it, for example, on the receiving end of a UNIX pipe. Content on the source could be either straight Atom or come in some other form that would require transforming it into Atom, but that'd be easy to do, since transforming XML is pretty easy these days.

Two, that by using Atom as the output format you'd have the same flexibility. To generate a feed if you wanted, or transform it into something else, say, a weblog.

(This is, btw, my own reprocessing of what was said, Matt's entry, to which I linked above, is a more thorough description of these ideas).

So, for example, you'd have a command line that would look like this for storage (starting from a set of entries in text format):

cat entries.txt | transform-to-atom | store

and then you'd be able to do something like
retrieve --date-range 2004-04-08 2004-04-09 | transform-to-html
Now, here's what's interesting. I have of course been using pipes for years. And yet the power and simplicity of this approach had simply not occurred to me at all. I have been so focused on end-user products for so long that my thoughts naturally move to complex uber-systems that do everything in an integrated way. But that is overkill in this case.

I kept thinking that a system like that shouldn't be too hard to do.

So far so good.

Then on Saturday Ben gave a talk on "Better living through RSS" where he described some of the ways in which he uses feeds to create personal information flows that he can then read in a newsreader, and other ways in which content can be syndicated to be reprocessed and reused.

This fit perfectly with what Matt had been talking about on Friday. During the talk and after we talked a bit more about it, and by early afternoon I simply couldn't wait. So I sat down for an hour or so and coded it. By Saturday night it was pretty much ready, but by then there there were other things going on so I let it be. I nibbled at it yesterday and today, improving the command line parameters to make it more usable and now it's good enough to release.

More importantly, I settled on a name: atomflow. :)

But of course, what would be the point of releasing something without a good use example?

So after another a bit of thinking I realized there was something that I thought would be a good example for atomflow, for, well, educational purposes.

News.com provides feeds, but they don't have the whole stories. Sometimes, after some time has passed the story becomes unavailable (I'm not sure when that happens, but it has to me). Other times you want to look for a particular story, based on keywords, but within a certain timeframe, to avoid getting dozens of irrelevant results from more recent news items.

So I built a scraper for News.com that takes their stories on the frontpage, and outputs them to stdout as Atom entries. Then I pipe the result into atomflow, which allows me to query the result anyway I need, and, more interestingly, subsequent pipes to atomflow calls that can narrow down content when the query parameters are not enough.

So, without further ado, here's the first release: atomflow.zip.

There's a README.txt in the ZIP file that explains how to use it, but here's an example:

To add items:

java -jar newscraper.jar | java -jar atomflow.jar -add -storeLocation /tmp/atomflowstore -input stdio -type feed
The previous command, run at intervals (say, every twelve hours) will "inject" the latest stories into the local database. Then, at intervals (or at your leisure! :)) the database can be queried, for example, by doing:
java -jar atomflow.jar -storeLocation /tmp/atomflowstore -query "latest 3"
which will return an Atom feed with the latest 3 items.

The following are the query parameters currently supported:

<field-id> is an atom date field, eg "created"
--query "range <tag> 2004-04-20 2004-04-21" //date-range
--query "day <tag> 2004-04-20" //day
--query "week <tag> 2004-04-20" //week (starting on date)
--query "month <tag> 2004-04" //month (starting on date)

looks in summary, title and content
--query "keywords <keyw1 [keyw2 ... keywn]> [n]"

tag here must be one of [content|content/type|author/name|author/email|title|summary]
--query "keywords-tag <tag> <keyw1 [keyw2 ... keywn]> [n]"
--query "id <id>"
--query "latest <n>"

additionally, you can add a --sort parameter as
--sort <created|issued|modified> [false|true]

So that's it for now. Obviously there's more information on how the apps work, their options, and certain sticky issues that might not be so easy to solve cleanly (e.g., assigning IDs when entries don't have them, currently atomflow creates an ID itself but in most cases you'll want to specify a certain format). Also, the "scraping" of News.com is done as a quick-and-dirty example. In a few cases the parsing will fail and it will simply ignore it and move on (the sources tell the full story of how ugly that parsing is!).

Anyway, comments and questions most welcome. I hope this can be useful, it was certainly a blast to do it. :)

Update (24/08): The first rev of two new atomflow-tools, atomflow-spider and atomflow-templates released.

PS: I still have the comment script disabled, pending a solution (at least partial) to comment spam. In the meatime, please send me email (address is on my about page).

PS 2: atomflow uses lucene for indexing and kxml2 for XML parsing. (I repackaged the binaries into a single JAR to make it easier to run it.)

Categories: soft.dev
Posted by diego on August 23 2004 at 10:55 PM

Copyright © Diego Doval 2002-2011.
Powered by
Movable Type 4.37