Now blogging at diego's weblog. See you over there!

hey! it's that guy!

Random link of the day: Hey! It's that guy!, which tracks a number of actors and actresses who are (generally) instantly recognizable on screen and yet few can name them. I could identify a good number of those. But when I didn't, clicking on an unfamiliar name and seeing a familiar face is a strange experience. Recommended. :)

Posted by diego on August 26, 2004 at 2:08 PM

atomflow-templates and atomflow-spider

Before I start, there's been a good number of atomflow-related entries in the last day. To mention a few: Matt has explained further many of his ideas, as has Ben. Michael has more thoughts on it as well as links to other related tools. Matthew and Frank also added to the conversation, as did Danny and Grant.

Okay, back to the actual topic of this post.

Another hourlong session of hacking and there are two new tools in the atomflow-package (download): atomflow-templates and atomflow-spider.


atomflow-spider is a simple spidering program that outputs the contents downloaded from a URL to the standard output. There are a number of other programs that do this already (wget and curl being the most prominent) but a similar tool is included with atomflow for completeness, particularly for platforms that don't have wget (e.g., Windows installs without cygwin or similar). Plus, it's good practice (for me) to keep thinking along the lines of simple, loosely coupled components that do one thing well.

The spider's commandline parameters are as follows:

java -jar atomflow-spider.jar -url <URL> [-prefsFile <PATH_TO_PREFS_FILE>]

The -prefsFile parameter is optional. When used, the preferences file stores ETag and Last-Modified information on the URL, to minimize downloads when the content hasn't changed (useful for RSS feeds---I am not sure if other command-line tools support this, but I don't think it's all that common).

Additionally, the spider supports downloading GZIP and Deflate compressed content to speed up downloads.


atomflow-templates is the beginning of a templating system that can be used to transform content in (and eventually out) of atomflow, through pipes. This version supports only RSS to Atom conversion (basically all RSS formats are supported). I think this is pretty important as a basic tool in the package, since there's lots of content out there in RSS format.

atomflow-templates reads from standard input and writes to standard output. Currently it is run as follows:

java -jar atomflow-templates.jar -input rss -output atom

atomflow-templates can be, for example, connected with atomflow-spider and then to the storage core through a CRON job to monitor and store certain RSS feeds, as follows:

java -jar atomflow-spider.jar -url <URL>
| java -jar atomflow-templates.jar -input rss -output atom
| java -jar atomflow.jar -add -storeLocation <STORE_DIRECTORY> -input stdio -type feed

So that's it for tonight--between coding at work and then this, I'm all coded-out for the day :).

Posted by diego on August 24, 2004 at 11:14 PM


dasher.pngLast Saturday, during a conversation with Santiago he showed me dasher running on his Linux laptop (Here's an article on it from The Economist).

It is amazing. At first it is relatively weird to use, but that feeling goes away quite quickly. And then you start "typing" away without problems (I haven't tried adding a training file to it though, the basic package worked well enough for simple testing). But you won't really know until you try it. Go check it out--they have binaries for most platforms.

If this isn't squarely part of the future of typing on mobile and input-restricted devices (aside from Speech/Handwriting recognition and gestures), then I don't know what is.

Categories: technology
Posted by diego on August 24, 2004 at 6:36 PM


I'll probably continue babbling about EuroFoo-related ideas for a week (or a month), but this is a good one to start with.

First there was a conversation with Matt and Ben on Friday night regarding syndication and Atom. Matt was describing something he had written about a few days ago: how he'd like to have a sort of "atom storage/query core" that would allow you do a) add, remove and update entries and then b) query those entries (postIDs, dates, etc, to which I immediately added keyword-based queries in my head).

Matt had two points.

One, that by using Atom as input format, you could simplify entry into this black-box system and use it, for example, on the receiving end of a UNIX pipe. Content on the source could be either straight Atom or come in some other form that would require transforming it into Atom, but that'd be easy to do, since transforming XML is pretty easy these days.

Two, that by using Atom as the output format you'd have the same flexibility. To generate a feed if you wanted, or transform it into something else, say, a weblog.

(This is, btw, my own reprocessing of what was said, Matt's entry, to which I linked above, is a more thorough description of these ideas).

So, for example, you'd have a command line that would look like this for storage (starting from a set of entries in text format):

cat entries.txt | transform-to-atom | store

and then you'd be able to do something like
retrieve --date-range 2004-04-08 2004-04-09 | transform-to-html
Now, here's what's interesting. I have of course been using pipes for years. And yet the power and simplicity of this approach had simply not occurred to me at all. I have been so focused on end-user products for so long that my thoughts naturally move to complex uber-systems that do everything in an integrated way. But that is overkill in this case.

I kept thinking that a system like that shouldn't be too hard to do.

So far so good.

Then on Saturday Ben gave a talk on "Better living through RSS" where he described some of the ways in which he uses feeds to create personal information flows that he can then read in a newsreader, and other ways in which content can be syndicated to be reprocessed and reused.

This fit perfectly with what Matt had been talking about on Friday. During the talk and after we talked a bit more about it, and by early afternoon I simply couldn't wait. So I sat down for an hour or so and coded it. By Saturday night it was pretty much ready, but by then there there were other things going on so I let it be. I nibbled at it yesterday and today, improving the command line parameters to make it more usable and now it's good enough to release.

More importantly, I settled on a name: atomflow. :)

But of course, what would be the point of releasing something without a good use example?

So after another a bit of thinking I realized there was something that I thought would be a good example for atomflow, for, well, educational purposes. provides feeds, but they don't have the whole stories. Sometimes, after some time has passed the story becomes unavailable (I'm not sure when that happens, but it has to me). Other times you want to look for a particular story, based on keywords, but within a certain timeframe, to avoid getting dozens of irrelevant results from more recent news items.

So I built a scraper for that takes their stories on the frontpage, and outputs them to stdout as Atom entries. Then I pipe the result into atomflow, which allows me to query the result anyway I need, and, more interestingly, subsequent pipes to atomflow calls that can narrow down content when the query parameters are not enough.

So, without further ado, here's the first release:

There's a README.txt in the ZIP file that explains how to use it, but here's an example:

To add items:

java -jar newscraper.jar | java -jar atomflow.jar -add -storeLocation /tmp/atomflowstore -input stdio -type feed
The previous command, run at intervals (say, every twelve hours) will "inject" the latest stories into the local database. Then, at intervals (or at your leisure! :)) the database can be queried, for example, by doing:
java -jar atomflow.jar -storeLocation /tmp/atomflowstore -query "latest 3"
which will return an Atom feed with the latest 3 items.

The following are the query parameters currently supported:

<field-id> is an atom date field, eg "created"
--query "range <tag> 2004-04-20 2004-04-21" //date-range
--query "day <tag> 2004-04-20" //day
--query "week <tag> 2004-04-20" //week (starting on date)
--query "month <tag> 2004-04" //month (starting on date)

looks in summary, title and content
--query "keywords <keyw1 [keyw2 ... keywn]> [n]"

tag here must be one of [content|content/type|author/name|author/email|title|summary]
--query "keywords-tag <tag> <keyw1 [keyw2 ... keywn]> [n]"
--query "id <id>"
--query "latest <n>"

additionally, you can add a --sort parameter as
--sort <created|issued|modified> [false|true]

So that's it for now. Obviously there's more information on how the apps work, their options, and certain sticky issues that might not be so easy to solve cleanly (e.g., assigning IDs when entries don't have them, currently atomflow creates an ID itself but in most cases you'll want to specify a certain format). Also, the "scraping" of is done as a quick-and-dirty example. In a few cases the parsing will fail and it will simply ignore it and move on (the sources tell the full story of how ugly that parsing is!).

Anyway, comments and questions most welcome. I hope this can be useful, it was certainly a blast to do it. :)

Update (24/08): The first rev of two new atomflow-tools, atomflow-spider and atomflow-templates released.

PS: I still have the comment script disabled, pending a solution (at least partial) to comment spam. In the meatime, please send me email (address is on my about page).

PS 2: atomflow uses lucene for indexing and kxml2 for XML parsing. (I repackaged the binaries into a single JAR to make it easier to run it.)

Posted by diego on August 23, 2004 at 10:55 PM


Damn, I'm still in Wiki mode.

Which is, by the way, more than a pun. For some reason I keep thinking of EuroFoo as a sort of meatspace Wiki. A space where people come and go, asking questions, sharing ideas, showing what they've done or are doing or are about to do, and having fun while at it.

Backtracking: I got back yesterday night, exhausted but extremely happy. Ideas swarming in my head. The urge to code. Many, many thanks to everyone and especially Tim O'Reilly for organizing it. It was truly a great experience.

Today, mostly a day or readjustment to reality (as much as that's possible). In my head there's still an image of which I actually took a picture (which I'll look for later), of the bar at the EuroFooHotel at 1 a.m. on Saturday: soft light reflecting warmly on the wood all around us, at least fifty people still there, the tables a sea of coffee cups, beer mugs, cables and laptops (mostly powerbooks), with something fascinating happening at every table.

Okay. Next up: atomflow.

Categories: personal
Posted by diego on August 23, 2004 at 9:11 PM

the pendulum and the eclipse

Fascinating article in this week's Economist on an apparent gravitational anomaly (first observed on a pendulum during a Solar eclipse):

“ASSUME nothing” is a good motto in science. Even the humble pendulum may spring a surprise on you. In 1954 Maurice Allais, a French economist who would go on to win, in 1988, the Nobel prize in his subject, decided to observe and record the movements of a pendulum over a period of 30 days. Coincidentally, one of his observations took place during a solar eclipse. When the moon passed in front of the sun, the pendulum unexpectedly started moving a bit faster than it should have done.

Since that first observation, the “Allais effect”, as it is now called, has confounded physicists. If the effect is real, it could indicate a hitherto unperceived flaw in General Relativity—the current explanation of how gravity works.

True or not true, it's still interesting. :) And here's a recent paper on the topic).

Categories: science
Posted by diego on August 20, 2004 at 3:47 PM

live from EuroFoo

I just realized that in the madness of the last few weeks I neglected to mention that I had been invited to EuroFoo. I got here yesterday evening after a ten-hour trip (there were no delays, but since it was bus+plane+train it was nearly inevitable that it would take a bit).

I spent some 24 hours in electronic isolation, between the trip and subsequent rest. It was good.

Now there are about 50 of us gathered in what I'd say is the restaurant of the hotel that (I heard this morning) we've basically taken over. The hardware/human ratio is something to behold. Powerbooks are the majority, or nearly so.

We should "officially" start any minute now. More in a bit.

Categories: technology
Posted by diego on August 20, 2004 at 3:41 PM

'comments off for now' cont'd.

Email I got (thanks all!) regarding my post yesterday comments off for now pointed to a number of solutions, most of which I knew. I neglected to mention those yesterday, so here they go:

  • Adrian pointed me to this entry on his blog on which he describes a solution similar to what I was discussing yesterday--adding a field to the form, while what I was saying was to change the names of the fields that already exist. Combining extra fields with field name changes and script name change should be a pretty good deterrent I think.
  • To close comments after a period of time, Tima's mt-closure is good but as far as I can see it rebuilds every entry it changes the values on, which is a problem with my slow machine. Ideally, it would save the values and then perform a complete rebuild, similar to what Tima's own mt-rebuild does.
  • Jeremy's mysql-compatible autoclose comments script--there are a couple of others with similar functionality, and in any case I haven't been able to try them since I use Berkeley DB.
  • Some suggestions centered around mt-blacklist, but I think that "blind" (i.e., depending on blacklists that are widely distributed) blacklisting of any kind (IPs, terms, etc) is a bad, bad, bad idea, inelegant to the extreme, and not very effective. If any proof is necessary, it should be enough to note that blacklisting has been done for years on email. And we can all see how well it has worked...
  • Bayesian filters, such as mt-bayesian are another alternative but again they are error-prone.
When I read all this I realize it seems like a long list of complaints. Clearly these solutions all have something to offer, but the reason I haven't started using the ones that have been around for a while is simply time. While I might not be sure of which solution is best, I am sure that I don't want to install a solution that not only doesn't work all the time but that then demands more work to solve the edge cases, however few they may be, or that mixes legitimate and illegitimate comments. That is, in my view, any solution should never mark a legitimate comment as spam, while it may mark a spam comment as legitimate. Closing comments is sort of an orthogonal approach to spam detection or prevention, and I haven't used that for reasons unrelated to the script (ie., machine speed).

Okay, enough with the rant. This is all useful information for that moment at some point in the (near?) future when I will have time to deal with this properly.

Categories: technology
Posted by diego on August 17, 2004 at 5:06 PM

Internet != internet

Wired has announced that they are ditching the capital 'I' in Internet [via Dave] as well as the capital W in Web and the capital N in Net. While I don't have an opinion about the Web and Net cases, I think that the case for Internet is different.

An internet (lowercase 'i') was initially defined to be a connected set of (potentially different) networks, with the Internet (uppercase 'i') being the internet of all internets. So Internet is a particular name for a particular abstraction. True, this might be a definition that is of historical interest more than anything, or specifically pointing at the Internet construct, where Wired is looking at the Internet as a medium. A medium implies certain homogeneity, which, while true in practice in most of the Internet today, is not what the initial use of the term "internet" implied, is not true for research edge networks (and some commercial networks) connected to it, and was not true at the beginning when a protocol had not yet won over others as a standard.

In any case, I assume that in technical terms we will continue to make the distinction, since, strictly speaking, an internet is a subset of the Internet. :)

PS: Check out the various references in the History page at the Internet Society for more detail on the early terminology and naming.

Categories: technology
Posted by diego on August 17, 2004 at 4:13 PM

comments off for now

So I had planned to blog a bit 'for real' today but maybe that time is past: I spent most of my "alloted" blogging time (and then some) deleting spam comments. Twice in the last week d2r has been under what has to be described as a massive spam attack. A bot systematically going through every page (loading information with a Mozilla client ID so that it looks as if there was a read before posting a comment, and probably scanning the page for the comment ID) and then posting garbage to up the ranking of whatever crap of the day they're selling.

The way to stop it has been to simply remove, for now, the comment script as I look for a solution. I've found several, mostly directed towards mysql backends (which I don't use, maybe I should) but always when trying them something doesn't quite work. Also, I don't want to spend much time looking for a solution (probably switching to a faster server is part of that).

One thing I was thinking is that these scripts obviously have to rely on standard MT configurations to be effective. This means fields IDs, form name, things of that nature. Before, I could stop them by changing the name of the comment script, but they have (predictably, I might add) adapted to that by scanning the page before posting. But if the comment form uses names for the fields that are non-standard, as well as pointing to a different URL, I think the only way to post comments should be at the webpage itself, by a person that can recognize the form, since the elements that allow scripts to recognize them automatically wouldn't be there. I recoil at the idea of digging through the MT sources to find that, but maybe I'll do it. Certainly MT could come with a screen to configure the names for your setup, in that way every blog would have its own form names and format and it would be, I think, quite difficult to post comments automatically.

A simple concept, but I think it should work, no?

Categories: technology
Posted by diego on August 16, 2004 at 8:57 PM

the new phone

p900-1.pngSo about a week ago we got a couple of smartphones to start doing some work on them and to understand their capabilities better: a Nokia 6600, and a Sony Ericsson P900, which I got.

While I understand the technologies, capabilities, etc, and have played with them in emulators, I haven't actually owned a smartphone until now. I'll write a more detailed review later (I wanted to play with it for a reasonable amount of time first), but here are some first impressions.

As far as phone functionality, the P900 is quite good. The transition from other apps to the phone is a little awkward but manageable. It comes out of the box with miniheadphones and a microphone, which is crucial considering the phone is at first a little too big to use as a normal phone (later I got used to it though).

Then there's the connectivity: Internet and bluetooth (why oh why doesn't it come with WiFi!). Opera comes in the tools CD, which is nice, but you actually have to install it. Once everything is set up (and I'll dwell on that particular point in a later entry) it works fairly well, usable for many situations, but I don't even want to think how much money it costs to see a simple webpage. Using mobile versions of things like the Google for palm site helps a lot. Too bad it's not so easy to find sites that support that (Opera, btw, is amazing at fitting regular websites in such a tiny screen without completely destryoying the design). Bluetooth works fine, but it's much too slow for file transfers (but this is probably due to the phone rather than Bluetooth's intrinsic speed). The phone also comes with a USB cradle that doubles up as charging station, which is also good, but the transfer speed is also bad, which confirms that it's the phone that is the bottleneck.

Storage: this phone uses the braindead Memory Stick, and Sony in all its wisdom has released a rash of incompatible versions (Regular, Duo, Pro, Duo Pro...). The P900 supports Duo which is limited at 128 MB--ok for a phone but not to store media. This is quite a limitation, and unnecessary IMO.

Software: decent list of basic apps, and of course access to a lot of Symbian apps. Only once, after playing with it for a while, it complained of "low memory" even though I had very few files open. Had to be a memory leak: cue Purify for the Symbian guys. A restart cleared the problem. I also got Task Manager, which has turned out to be a critical tool for looking at running apps, closing them, etc. Don't know if there's a better solution that this.

Finally, media: built-in camera for video and stills. Quality is decent, but I won't be leaving my digital camera behind any time soon. The phone can play MP3s, but the player software that comes with it is a joke (I haven't even found a way of creating playlists), and I haven't yet looked for better software--I don't even know if there is better software. This, coupled with the slow transfers makes it hard to use the phone as your regular MP3 player, but barring any other option it does the job.

Overall, pleasantly surprised on many levels, and a little disappointed at its media handling capabilities (considering this is one of the (if not the) highest-end devices running Symbian, I imagine the situation is similar for most other smartphones. I am probably being unfair in comparing it with devices that only perform a certain function, though. Then again, I don't think it's meant to replace those devices yet.

I got all the necessary tools to write code for it, but I haven't done anything with them--too busy. That'll have to wait for the next few days, or even next week. :)

Categories: technology
Posted by diego on August 15, 2004 at 12:36 PM


Still a little dizzy by the seemingly sudden occurrence on Friday. Thanks to everyone who sent their congratulations, through blogs like Russ and Haiko, left comments (Frank, Chris) or emailed, IM'ed, or Voice'd :).

Took yesterday off. Second day off for the year! Whew! I'm a party animal!

Did not really pay much attention to the machine except ocassionally as I was doing some housekeeping on the drive, backing up, removing duplicate files (some of them, eventually I gave up and am now thinking of an automated solution), etc. Since I didn't really got down to it, it's still not really done. But getting there. A few days ago I got a Western Digital external 160 GB drive (USB2/Firewire) to actually make the backup, and even though it's not really portable (it weighs some 4 pounds with power adapter!) it has worked out really well so far.

I've been reading books again, too. The last 3-4 months I hadn't read a single book, and I was beginning to worry about that. But then last week I picked one up and started. Then another. As it always happens, the more I read the more I want to write, but I keep that idea in check--no time for it now. Maybe in a few weeks, after I submit the final revision for my dissertation and release the new version of share.

Another random thing that happened last week was that I was under a massive blog-spam attack. Hundreds of comments. The only recourse was to restore an old copy of the DB and rebuild (it was either that or spending hours deleting comments). As a result I am now backing up the DB more frequently and often disabling the comment script altogether, until I come up with a good solution. Maybe moving off movabletype, maybe moving to another machine (since rebuilds take so long on this one), I don't know. I am not too thrilled about the idea of wasting a day on that, but I'll have to do it at some point.

Today is digital day: reply to emails, to comments from users, reply to messages in the clevercactus forums, blog some.

More later then. :)

Categories: personal
Posted by diego on August 15, 2004 at 12:12 PM


"Dr. Diego". That honestly sounds weird to me, but it was on one of the emails I got today.

Backing up. :)

In a slightly out-of-the-blue way (like this entry), it finally happened, I defended my thesis this morning (the Viva, or Viva Voce).

I passed!

With comments, that is.

That means that the examiners give me a report with bufixes and I re-submit version 1.01 when I'm done (estimate is a couple of weeks). Changes are generally details, like whether a certain function is an automorphism or has to satisfy less restrictive conditions, and one slightly bigger where I misplaced a particular section (as well as mucking it a bit), completely breaking the flow of the explanation, making it pretty difficult to follow. One or two examples missing. Things like that.

Since we had rescheduled it some three times by now, I didn't want to get my hopes up, but on Wednesday it was pretty much confirmed (with final confirmation last night). Two years to do it. One year (almost) to defend it. In terms of defense-wait/thesis-work ratio it's gotta be some kind of record. :)

I actually didn't know what to expect. A Viva here is quite the unstructured affair. It turned out to be a two-and-a-half hour grilling that went through all of the work. I did have in mind what parts might be more contentious or require discussion, but I was wrong, most of the conversation centered in areas other than those I thought.

But nearing the end, a feeling of exhilaration had started to build up. Big ideas being discussed matter-of-factly, formulas bandied about. Strong arguments. Not just of one or two points on a ten-page paper, but of an entire body of work, its assumptions, its implications. What science is really about. Of course, the day-to-day grind is still necessary, but this was unexpectedly refreshing.

Now for a bit of downtime. I've been pretty much offline the past few days, too busy between work (new version of share coming next week) and the thesis.

Tired, but happy. :)

Categories: personal
Posted by diego on August 13, 2004 at 8:23 PM

lost in the datastream

It's pretty obvious that I've lost all blogflow ain't it? Well. Lots of work, on a new (and important) feature of share. What are you gonna do...

I've been prowling through cyberspace lately. During breaks, or late at night before I go to sleep. And the feeling has been growing on me that we are at a threshold, where information really begins to take shape on its own.

Cyberspace. How long has it been since we've used the term as Gibson intended? Databases rising up like skyscrapers against a virtual horizon?

And it's not the web, not really, or rather, it's not just the web. Distributed networks, things like Project Gutenberg too. The web seems too much a world of its own, disjoint, bereft of solid footing. But all taken together takes on another quality, like a foundation. Not sure foundation for what, but it's definitely happening.

I have no idea why it seems to me that way now and not, say, at the peak of the "web bubble" in 1999/2000. Maybe because the web on its own appears too much like a monoculture, something too uniform in shape to be anything other than a static repository. But now, with these massive alternate mechanisms of information flow (say, BitTorrent) things are starting to look different to me. I might be wrong.

Anyway. Back to work.

PS: funny that whenever I post something like this I immediately start thinking of other things to write about. Writing begets writing. And so every entry that says that I've lost my flow ends up being the first in a stream. (Okay, enough with the blog-recursive thoughts.)

Categories: personal
Posted by diego on August 2, 2004 at 12:40 PM

Copyright © Diego Doval 2002-2011.