Now blogging at diego's weblog. See you over there!

end-of-year thoughts

What a year.

So many things happened.

On my personal sphere, I defended and submitted my PhD, released one product (share) and re-released another (pro), and then only a month ago we ran out of money...

We also had the US presidential election, which energized pretty much everybody, and lots of contentious issues, the ongoing struggle in Iraq, the "war on terror," with another terrible attack, this time on European soil, Madrid on 3/11, by crazed nihilistic psycopaths that somehow think that killing produces anything aside from death and suffering. Only a few months ago the Beslan Massacre and related terrorist attacks gave a terrible excuse to Russian president Vladimir Putin to tighten his grip on power in the country.

Then, this week brought us a tragedy of a different kind, the Earthquake/Tsunami in Asia, which will probably end up costing a quarter million lives and untold social and economic damage. This was just one region, 11 countries, and, in a global scale, comparatively "few" people affected, and even so the international aid system is under severe strain. You better hope that global warming doesn't materialize, or we're in for a lot more than this. Then just a few hours ago, in Buenos Aires, a massive fire in a crowded club has produced one of the worst results (in terms of casualties) on record for indoor fires.

And on and on.

Other things, like tech, picked up steam this year, even as economies everywhere kept giving conflicting signs. The tech revival was partly embodied by Google's IPO, partly by some resurgence in Silicon Valley, the definite establishment of weblogs, more WiFi everywhere, the smartphone explosion, distributed technologies, and the promise that the convergence of those things brings. But then a tsunami hits and all of that kind of goes away, doesn't it. And in the end it seems to add up to a gloomy year, and maybe it wasn't in the top ten... but...

But this week I've been thinking about how much the media is biased to report "bad things" and how we as consumers of that media seem biased by a morbid curiosity to follow them.

We don't get front-page news that say "50,000 people happy after concert ends without incident," or "millions of parents proud of children" or whatever. It's all about what went wrong, what didn't go well, and how people should be kept "safe" (when absolute safety is clearly impossible).

For everything that goes wrong there are things that go right, but we don't seem to linger on those. We tend to remember the bad, and shrug off the good. But the measure of our response to adversity has to reside partly in how we appreciate when things go our way, not just how we respond to disasters.

We still all could do better, and must do better. But that's life, both sides of it: joy, pain, love, anger, suffering, compassion, and even redemption. It wouldn't be worth living otherwise.

So: here's to the good things that happened this year, to my family and friends (past and present, "virtual" or not :)), to the spirit of good people everywhere, and to the millions of small acts of kindness and compassion that are all around us, every day.

"All things are inconstant except the faith in the soul,
which changes all things and fills their inconstancy with light"

-- James Joyce

Categories: personal
Posted by diego on December 31, 2004 at 1:09 PM

ipod!

Btw, I forgot to mention this (or rather, I've been sort of 'off the grid' for a couple of days) but I won one of the iPods in the Feedster developer contest!

Many thanks to Feedster, and Scott. Very cool!

Categories: technology
Posted by diego on December 31, 2004 at 12:02 PM

opml icon?

This webapp I'm writing for Nooked has, as one of its outputs, OPML (It outputs RSS as well of course). Now for the RSS output I'm using the common white-on-orange XML icon:


But for OPML there doesn't seem to be a common icon. I wanted something along similar lines, and I did this:
opml.gif

Now, my question is: Does this exist already? I have the strongest feeling of having seen this, somewhere... has anyone else? Why blue? I don't know. It seemed appropriate. One thing that bugs me a bit is that OPML is technically XML as well (although many OPML files out there don't validate) so you really have to look "beyond meaning" and just say "orange icon: RSS, blue icon: OPML" (or whatever) regardless of what they say. Maybe it's time to have an orange "RSS" icon? Or is the XML icon too ingrained by now?

Btw, there are many possibilities for this set of questions to spawn a flamewar, I'm really not interested in that. :) Mostly, I wonder: do people have a favorite way of representing OPML feeds, either in their own apps or as users? Any feedback will be appreciated!

Categories: soft.dev
Posted by diego on December 28, 2004 at 12:32 PM

hollywood's laws of physics (and gender)

Last night I was watching Charlie's Angels: Full Throttle (on TV, I'm glad I didn't pay for theater tickets or rent it) and I ended up spending the second half of the movie waiting for Wile E. Coyote to show up as a character in the plot (the first half of the movie was spent waiting for the plot itself, which didn't appear). I did come to a few conclusions in the meantime, among them:

  • It is possible to jump off the roof of a four-story building and land on concrete, then proceeding to continue your pursuit
  • If after making a karate jump on the roof of a two-story floor you are shot in the chest, and then you fall to the ground below the worst you can expect is getting wet with the sprinklers that will activate just as you regain consciousness (your kevlar vest saves you from the bullet)
  • If you are flung from a Dodge Viper GT racing at high speed and crash into a window you'll be able not only to continue the chase, but also to catch up to the Dodge Viper GT in only a few seconds, on foot, and silently.
  • If you make Karate moves while you're being shot at, the flow of time will slow down so you can see the bullets fly past you.
  • If you crash a Dodge Viper GT against a concrete wall, you can expect the concrete wall to be destroyed and yourself to be uninjured and ready to continue a fight. If you are a good guy, however, you will have a shard of glass stuck in your abdomen. Removing it will not impede your movement, though.
  • and on and on and on and on...
I guess what I'm wondering is: when did breaking the laws of physics became fun? The Matrix is one of the earliest of its kind, accounting for the little detail that, you know, it happened in a simulated reality (and they are responsible for bullet-time, at least in live-action, Anime is really were it comes from). Mission: Impossible pushed things a bit, but hey, it's Mission: Impossible. M:I2, though, was way over the top, and then things started to come off their tracks. Why is it that blockbusters seem to be resort to CG when the script ain't working, even if they aren't dealing with aliens or twisters or whatever?

Why is it that they have to be just so over the top? The actresses, all of them beautiful, and talented, seem to be having fun, and this is made obvious throughout. Ah-ha. Was that the point of the movie? That the'd enjoy their residuals?

Sigh. One of my favorite movies of all time is Heat. You know why? because it was zero-bullshit. It didn't require me to suspend disbelief from here to Canarsie to buy the plot (Note: movies like MIB, Armaggeddon and ID:4 require suspension of disbelief for entering the theater, so it's okay that they are over the top :)). One of my favorite scenes in Heat is the shootout outside the bank. Cars don't explode (it's pretty difficult to make gas tanks explode, maybe because they've been designed to avoid that). People actually run for cover in the face of M-16 fire. On the opposite end, another favorite is the typical Simpsons scene with a leave falling off a tree, hitting a truck, and making it explode.)

So, comment to Hollywood: read Newton's Principia. You know, 17th-century physics. Einstein not required. If you can't get through it, just remember:

  • A body remains at rest, or moves in a straight line (at a constant velocity), unless acted upon by a net outside force.
  • The acceleration of an object of constant mass is proportional to the resultant force acting upon it.
  • Whenever one body exerts force upon a second body, the second body exerts an equal and opposite force upon the first body.
"Body" here, btw, refers to an object, either animate or inanimate, not to the body of your co-star, be that Cameron Diaz, Lucy Liu, or Drew Barrymore.

This sounds snobbish, doesn't it. Well, it may sound like that. MI:2 was over the top as well.

How about making women real protagonists, without having to behave as if they were in a casting call for Baywatch? Uh? Is this too revolutionary?

Yes, it may be that what really pissed me off was the beer-commercial aesthetics of the movie. I generally ignore the misanthropic inclinations of Bond movies, although they do piss me off as well. Why is it that they seem to be more of an issue with Charlie's Angels? Not sure. Maybe it's just that with Bond they are more of a sideshow, and Bond himself isn't a prize either (and Bond women are generally players in their own right, rather than directed by the all-knowing all-seeing Charlie), or maybe it's that at least the beer-commercial thing is not a big item.

However, Charlie's Angels: Full Throttle isn't something I'd recommend.

Unless you want to see a two-hour long beer commercial.

PS: I also watched Mystery Science Theater 3000 which is a wacky, wacky B-movie that made me laugh out loud in spite of myself. Crazy characters, no plot, and no pretense of one either. Highly recommended.

Categories: art.media
Posted by diego on December 27, 2004 at 7:31 AM

merry xmas!

Well, technically not Christmas here yet, but it's certainly the right time somewhere in Asia.

Cold day, mostly overcast, but by sunset Dublin was given a feast of lights against the cloudcover that had broken up a bit. It looked great.

Aside from tons of phone calls, I'm spending today (and most of the next few weeks) working, as I said in the previous entry.

Also: I'm re-reading KSR's Red Mars. Such a good book. Makes exploration bubble up in my veins. :)

And with that, signing off for today.

Categories: personal
Posted by diego on December 24, 2004 at 4:54 PM

working holidays

Throughout the next few weeks I'm going to be doing freelance work for Nooked (which simplifies RSS publishing for corporations), desigining and developing a new web application for them. This will help pay the bills while the search continues for the next big thing. Should be interesting. I spent most of the day yesterday and today checking up on the latest Struts changes, project settings, tests, and other preliminaries, all of which will no doubt continue through the weekend. Target release is around mid-January, so the mystery won't last long. :)

Categories: personal
Posted by diego on December 24, 2004 at 4:44 PM

javaone 2005 call for papers

The JavaOne 2005 call for papers is now open. I wonder if I could sumbit something. But first I wonder if I'll have the time to actually write it. :)

Categories: soft.dev
Posted by diego on December 23, 2004 at 12:13 PM

a manifold kind of day

I'm about to go out for a bit, but before that, here's what I've been talking about for days now: the manifold weblog. (About my thesis work). The PDF of the dissertation is posted there (finally!), as well as two new posts that explain some more things:

as well as the older posts on the topic, plus a presentation and some code in the last one. Keep in mind that this is research work. Hope it's interesting! And that it doesn't instantaneously induce heavy sleep. :)

my favorite Java 5 change

I used Java 5 (with Eclipse 3.1) for the code I wrote last week to use as example for manifold, and there's no question: the enhanced for loop, combined with generics, rocks.

Aside from the basic difference of going from (Note, use of an inline iterator is also common):

//strings is an ArrayList
for (int i=0; i<strings.size(); ++i) {
String s = (String) strings.get(i);
//do something with s
}
to
for (String s : strings) {
//do something with s
}
there's also the cooler use of it to iterate over the contents of a HashMap, so instead of doing
HashMap m = new HashMap();
//...
//fill the map with String, Vector values
//...
Iterator it = m.keySet().iterator();
while (it.hasNext()) {
String key = (String) it.next();
String value = (Vector) m.get(key);
}
you can do
HashMap m = new HashMap();
//...
//fill the map with String, Vector values
//...
for (String key : m.keySet()) {
Vector value = m.get(key);
}
Which is much more concise, and clear IMO. Very cool.

Categories: soft.dev
Posted by diego on December 20, 2004 at 6:19 PM

a subtle problem of frontpage

As an aside, I spent some time in the last couple of days doing a favor to someone who had created a website but wanted to make it look reasonably good.

Since the site had been created with Frontpage, I had to go through the usual rigamarole of removing the extraordinary amounts of garbage that Frontpage inserts into the HTML. This was problem number one.

The 'subtle' problem though, was the UI. In the process of changing the site I of course redesigned the navigation, but I realized that Frontpage was actually doing something pretty terrible: creating a bad UI.

Frontpage automatically manages the creation and maintenance of navigation on a site. You can create the site hierarchy and FP will maintain links, etc. The problem is that the UI that FP generates is hierarchical, and it doesn't really do justice to the multidimensional nature of hypertext. It is, pretty much, a directory in HTML form. In many default FP templates, sub-pages are generated with "Up" navigation links along with the rest, which is not only ridiculous with HTML but also bad UI practice because the navigation bar changes content for every page you're in.

So my question is: can't Microsoft fix Frontpage so that it a) generates simple, CSS-based HTML and b) that the default templates include well-designed hypertextual UIs, rather than what it does today?

Or does Microsoft need a Firefox HTML Editing app that will wake up the Frontpage team, just as Firefox itself has resucitated the IE team?

PS: Frontpage is actually a product that Microsoft acquired in 1996 when they bought a company called Vermeer Technologies. The founder of Vermeer, Charles Ferguson, wrote a book about his experience, from founding to acquisition, called High Stakes, No Prisoners, which is fantastic. Recommended.

Categories: technology
Posted by diego on December 20, 2004 at 5:58 PM

diggin'

After the attack on clevercactus last week, most of the service is back up. Some things are still missing (e.g., forums). In the meantime, I'm doing other stuff, talking to people, and preparing the blog/site for my thesis so I can stop mixing everything up here (I can already see entries with nothing but "just posted on...").

Blogflow is clearly erratic. Must be the cold. We've had our first subzero days recently.

Categories: personal
Posted by diego on December 20, 2004 at 5:53 PM

under attack

Through the last week the clevercactus site has been sporadically unavailable, and it's down right now. This means no web, no service, no emails getting through.

If you're trying to get through to clevercactus and can't please let me know through a comment or email to my personal address.

What happened is that we were attacked (I'm not sure when) and someone left a number of scripts there that are flooding the system (they do other things too, but at least one of them is clearly written simply to flood the network and disable it). This is something obviously intended to bring down clevercactus, not just a simple hacking. Why? What do they gain by bringing down the service of a small company that is going through hard times?

This kind of thing makes me sad, and is really discouraging.

I had this whole thing planned for today, getting the manifold site up and so on but now I'm going to spend time trying to see how to route around the problem for now until we can determine the extent of the hack. I don't even know how they got in yet--we constantly update our software with the latest patches. Needless to say, I'm seriously reconsidering the whole of the software I use and how to set it up so that this doesn't happen again.

Anyway. We'll see how it goes.

Categories: clevercactus, soft.dev, technology
Posted by diego on December 16, 2004 at 2:31 PM

the lord of the rings: the battle for middle earth-- a review

lotr1.jpg

Long-time readers already know that I'm not much of a gaming fan. I don't even bother with most games, and the only ones I've played with interest have been those of the Doom/Quake series and the C&C series (and Myst too). I didn't play Doom/Quake for that long, probably until I got bored of fragging zombies with BFGs, but C&C games only got repetitive after several months.

Not that I have a lot of time to play games anyway, but over the years they have gone from entertainment to a good way to get everything off my mind for a while, along with books and movies. (Not all entertainment guarantees that--neither do all books or movies or games for that matter. :))

Enter The Lord of the Rings: The Battle for Middle Earth, which I talked about in September, when I found out it was going to be released. It was finally released last week (it was delayed from its original November release date) and I got it on Friday.

First I thought: "Wow". Then "D'Oh" after I was killed.

The "Wow"/"D'Oh" sequence continued for a while, until I figured out how to actually win one game. Aside from a couple of skirmishes, I've been playing the "Good" Campaign (essentially playing the story of the books--there's also the "Evil" campaign in which you control the forces of Sauron).

The game is astonishingly well done. Great interface (the first consistently good use of circular menus I've seen anywhere), well-balanced sides, excellent graphics and sound, plus you get to actually play a story that you know, with the characters you know.

The simulation of combat is great, both on foot and horseback--A high point is to use an army of Rohirrim to run over an incoming band of Orcs. :)

The resource gathering system (a weak point in many RTS games) is good as well, and it fits with the story. Sauron's and Isengard forces, for example, obtain resources by chopping down woods, while the Good guys (Gondor, Rohan) do it by farming. (Tolkien was probably the first Fantasy/SF writer to worry a lot about the ecology/technology balance, and he mostly put technology in the hand of the bad guys).

Aside from the usual armies, there are also "heroes" which are the characters of the books. Heroes have powers that activate with rank (regular soldiers only increase in ability to what they already do). Sometimes they deviate from the story, but that's not a big deal (in Moria, Gandalf can survive against the Balrog, for example). Then, at crucial times, the action is mixed with sections from the movies for what you're doing (say, the arrival of the Elves at Helm's Deep while preparing for the defense). This is getting pretty close to a a mix of RTS with action RPG.

Anyway, if you like wargames (or even Role-Playing Games), you should check this game out (if you like C&C, it's almost guaranteed that you will like this game too--there's even an option to activate a C&C-like input interface).

PS: it is sad that Electronic Arts treats its employees badly. The games these people create are excellent, and they deserve better.

Categories: art.media
Posted by diego on December 14, 2004 at 5:18 PM

back on planet earth

That's me yes. The one who's back, that is.

As a way of, um, relaxing, I've spent part of the last two days coming up with a simple code sample that can show how one of the algorithms of my thesis works. This is one of those things that should be simple but isn't--I had to run in circles for a while before I could unwind the thought process behind it, at least enough so that whatever code I posted wouldn't be completely incomprehensible.

I finally got it done a couple of hours ago, and I spent a bit more time adding some more information to the output. No, I'm not going to post it in this entry, I definitely want to create a site to gather all this stuff, otherwise it will end up scattered all over this weblog. That's what I'm doing now.

So, what else? Um. I have to do some more work on the conversation engine. I'm still considering what to do next. Mostly I want to do the upgrade that will allow it to actually be useful, by spidering many more sites and going further back in time to find more conversations.

That, and blog about a couple of other things. :)

Categories: personal
Posted by diego on December 14, 2004 at 5:07 PM

the final submission

Today I finally submitted the final, approved, (did I say final?) version of my dissertation! There was missing signature somewhere on the paperwork that finally showed up, and it was a go. I didn't have full confirmation that I could do it, but I just showed up at the Graduate Studies office on campus and it turned out I could. So I turned in the copies, one for the University library, one for the Department library, gave one to my advisor and kept one for myself (with the nice binding and all that).

There's one more thing that has to happen though (of course! always something else!), when "The Council" gets together and gives a final seal of approval to the theses that were finally submitted in the previous month. Or something like that. Plus I have to register for the ceremony, etc.

The whole thing has taken quite a bit longer than normal, but now it's really, really done. Next step, as I said, will be to post the full dissertation and continue talking about the algorithms, etc. I'm definitely thinking about creating a site for that to put all this stuff, including code, etc. More later.

Categories: personal
Posted by diego on December 8, 2004 at 6:51 PM

the great playlist meme of '04

Here are the instructions:

  1. Open up the music player on your computer.
  2. Set it to play your entire music collection.
  3. Hit the "shuffle" command.
  4. Tell us the title of the next ten songs that show up (with their musicians), no matter how embarrassing. That's right, no skipping that Carpenters tune that will totally destroy your hip credibility. It's time for total musical honesty. Write it up in your blog or journal and link back to at least a couple of the other sites where you saw this.
  5. If you get the same artist twice, you may skip the second (or third, or etc.) occurances. You don't have to, but since randomness could mean you end up with a list of ten song with five artists, you can if you'd like.

Here's my list:

  1. Light My Fire - The Doors
  2. Can't Not - Alanis Morissette
  3. Help! - The Beatles
  4. Great Escape (Acoustic) - Guster
  5. Pride (In The Name Of Love) (ZooTV Live Transmission, Houston 14/10/1992) - U2
  6. Como un Bolu - Bersuit
  7. High Voltage (Hybrid Theory EP) - Linkin Park
  8. Shattered - The Rolling Stones
  9. Symphony #5 in C Minor (3rd movement) - Beethoven
  10. Learning To Fly (Live) - Pink Floyd

via Erik, Jim.

PS: for the record :), I did skip "duplicate artists." I think it definitely makes sense, considering that for some artists I have hundreds of tracks and others rate only a few, or a few dozen (There I go, trying to "make sense" out of something like this. The bane of logic). I'm surprised to discover that skipping randomly through my entire music collection is oddly addictive...

Categories: art.media
Posted by diego on December 8, 2004 at 9:51 AM

the fun side of supersymmetry

I mean: "squarks", "gravitinos", "photinos", "gluinos", "selectrons" and even "winos" (no alcohol involved there, just the superpartner, or shadow partner, of the W+- boson). Maybe nobody's sure of what String Theory is, exactly, but the name variations are certainly entertaining. Used to be, you just needed a copy of Joyce's Finnegans Wake to name a particle (Murray Gell-Mann took "quark" it from the sentence "Three quarks for Muster Mark" in that book).

Anyway, from today's New York Times: String Theory, At 20, Explains It All (or not):

"String theory, the Italian physicist Dr. Daniele Amati once said, was a piece of 21st-century physics that had fallen by accident into the 20th century.

And, so the joke went, would require 22nd-century mathematics to solve."

Albert Einstein: "God does not play dice with the universe."

Stephen Hawking: "Not only does God play dice, but... he sometimes throws them where they cannot be seen."

Niels Bohr: "How wonderful that we have met with a paradox. Now we have some hope of making progress."

Exactly.

Categories: science
Posted by diego on December 7, 2004 at 10:52 AM

conversation engine: the next step

Since the recent integration of Feedster results into the conversation engine, I stopped coding for a bit and while doing other stuff I've been thinking of how to make it more scalable, covering more weblogs, and not wasting resources in looking at pages with no meaning (read: make it more useful) --- in short, how to solve the problems I mentioned in that entry.

The crucial problem is that Feedster provides only part of the picture. Scott Rafer (Feedster CEO) mentioned in the comments that I could use the Feedster links output, which provides a list of the references to a particular weblog. This doesn't quite do what I need however. The reason is simple: Feedster indexes RSS feeds, not entire sites, and so if someone is providing summary feeds, then Feedster will not be able to find links between weblogs, even if they exist. Because, many, many weblogs provide summary feeds, it is clear that the only way to get the links between entries is to get the actual contents of the HTML page. But.

But what I can do is use Feedster as the source point for the list of pages to index. Right now I am indexing everything on a given website. This has two drawbacks. First, I am forced to download, store, and analyze, waaay more content than I need (which accounts for the small amount of sites the bot is crawling at the moment), particularly when weblogs point to other parts of a site, including Wikis, dynamic apps, etc. Second, it slows down the processing for conversations, which depends on walking the link graph between two sites. This is a problem now, but if I move in the direction of adding multiple-participant conversations (as Don suggests in a comment to my previous conv. engine post, linked above) then this will be even more important.

So.

Next step, then, is to use Feedster as the data source for the entries of a given weblog. Then download/process the pages for each entry's permalink. Then analyze that and combine the results with the Feedster information.

Stay tuned! More in the next few days.

Categories: soft.dev
Posted by diego on December 7, 2004 at 10:36 AM

microsoft and software piracy: oops

This isn't new-news (it's about 3 weeks old), but is still surprising enough that I'm blogging it now: Patroklos notes that Microsoft has used a cracked version of the application Sound Forge to create some of the media files in Windows XP. Patroklos has a screenshot there, I verified it myself in my own copy of Windows XP. To check it, just open any of the WAV files present under C:\WINDOWS\Help\Tours\WindowsMediaPlayer\Audio\Wav and go to the end, you'll see the "Deepz0ne" signature that marks the source app as cracked. This definitely seems to be what it appears.

Oops.

I'm sure that MS will fix this quickly, when an organization is so large, things like these are probably bound to happen. Still, it doesn't look good. If Microsoft itself can't properly enforce licensing with its own employees and contractors (though, admittedly, in a small scale, considering the size of its products), it weakens their own ability to condemn it and prosecute it.

I'm sure there's a lot more fortune-cookie-wisdom that could be expounded from this incident but it's too early in the day, so I'll leave it at that. :)

Categories: soft.dev
Posted by diego on December 6, 2004 at 9:11 AM

conversation engine, v0.2 (enter feedster)

I've been busy with other things, but I had a couple of hours this morning to make some mods to the conversation finder. First, I changed its name. It is now the conversation engine. This is a name that Don used and that I thought was much better than mine, so there it is.

The main change in this version is that I am now using Feedster's search results, combined with my own spidering, to find conversations. This has to consequences, one good and one bad.

The good consequence is that I can now use Feedster's stored metadata for the post, which is excellent. Check out the "canonical" :) search for conversations between Don and Tim. Now you've got pictures and everything! Great. but you'll notice there's one less conversation than there used to be, which brings me to the "bad" consequence.

The bad consequence is that, because I am at the moment using only feedster's first 100 results (the maximum feedster allows on queries) as a filter, it means I am losing the earlier data I had. The data is still on my DB, since it was spidered, but the engine is no longer finding it. This is an easy fix though: just iterating through more feedster results will do the trick, "activating" the older spidered pages on my DB. (BTW, this is why also some conversations that showed up before aren't showing up anymore. I'll be sure to post once I have made this fix, since it's pretty crucial.)

Why am I using both the Feedster DB and my own results, you ask? Because Feedster is, at the moment, returning only a parsed version of the RSS feed for that site:just HTML. And that means that there are no links in the entry. And that means that I can't create the conversation graph, since there are no links to follow.

Even if Feedster was providing the raw entries, it would still be a problem. The reason is summaries: many RSS feeds provide summaries, not the whole content, so it's not guaranteed that you'll be able to extract links from a feed's description element. That means that you absolutely need to crawl the entire site, and then use the combination of Feedster's results and the crawling ("intersecting" them) to come up with the list of links you're interested in.

Anyway, looks much better now, doesn't it? :)

Categories: soft.dev
Posted by diego on December 5, 2004 at 6:40 PM

conversations, metadata, and the URL disambiguation problem

Since my first tack at the conversation finder experiment looks promising, I was looking at what didn't work, and thinking about how it could be improved.

The first point that was clear was that metadata is crucial for this--exactly the kind of metadata that is present in RSS feeds: Author, top-level link, date for posts, etc. While it seems possible to infer a lot of this stuff from the raw HTML, one crucial component, the dates, can't be. While the Last-Modified header in HTTP responses would be mildly useful, it doesn't actually help much becase the page can be rebuilt both by the author and others (e.g., when they post comments).

The date, however, is important but not crucial. The sequencing of the "conversation" is determined by the direction of the link-graph, not by dates.

What is crucial though is solving the ambiguity/duplication of URLs. Most weblogs have archives, which repeat the information already present elsewhere. The result is that the same posts appear many times within the entire index of a single site. Archives cannot be avoided by some generic algorithm because their "shape" varies greatly. So you end up with many pages that have the same content and even appear to create loops within the conversation, particularly when a single archive page contains two posts that belong to the conversation. Right now, I am doing some analysis on the text that surrounds the link to determine whether this is a "duplicate" (see for example the "Other pages from this site with the same reference include" in this conversation list). But while putting duplicates together is clearly possible, I still don't know which was the actual original post and which are archive duplicates--and you'd want the original post of course.

The second problem of URLs is that multiple, completely different URLs point to the same content. This is a notable case in weblogs where sometimes the blog has moved providers over time, chosen a better URL, and both URLs are maintained. Scoble is a good example of this, having both http://scoble.weblogs.com/ and http://radio.weblogs.com/0001011/ pointing to exactly the same content. The only way for the software to "know" that this is actually the same site is by looking at the metadata in the feed (since checking the content will not necessarily be foolproof, since the sites could be slightly out of sync).

Then there's the simpler issue of the host (rather than the full URL) being different in different links. The simplest case is of course something.com and www.something.com pointing to the same thing. But Rui for example has equivalents in www.taoofmac.com and the.taoofmac.com. Again the feed would provide a lot of information to resolve the ambiguity and realize that these two seemingly different sites are actually the same. In this case other things are possible as well, IP checks, content checks, etc, but metadata seems to me a simpler and more effective solution.

I have other things to do today, but this is definitely something interesting to keep thinking about in the background.

Bonus: In the comments to my previous post Don mentioned the idea of "combustible conversations" which means, as describes it "bringing past cluster of interactions to the present when and where it's relevant." This is a great idea! Also something to think about.

Categories: soft.dev
Posted by diego on December 4, 2004 at 3:36 PM

conversation finder v0.1

Okay, so I actually should start using conversation engine, the the name Don suggested and which is think sounds cooler. But for the moment it's still the Conversation Finder. The first version is now live!

This is a very limited version. Only a few sites are being indexed, mostly out of concerns for speed, bandwidth and such. I'll see about expanding it later.

First thing to look at is this result, the conversations the engine (finder?) discovers between Don and Tim Bray.

Interestingly enough, it finds one more aside from their recent Atom conversation, something about flowers :). This is great! It is finding actual conversations!

But... the results are just a just little bit off. I keep seeing what it finds and thinking, "come on, you're so close!". Some links are loops. Some links are pointing to index pages (which might have the content, but...). Some of the text extracts are not relevant (look at "conversations" between the other sites that it's indexing).

I think a big factor here is the fact that the engine knows nothing of archives, or people that run these blogs. Archives duplicate a lot of information, and the engine gets a little confused by that. So maybe the next step is to fiddle around with some of the metadata present in pages for weblogs (the metadata on RSS would be great, particularly the dates, to infer sequencing, however, RSS feeds only go as far back as a few days or posts, so all that's left is parsing for the different types of metadata embedding in HTML).

Anyway, not bad for a few hours of work and a 0.1 version. Looks promising! Now if I just find a way of letting others enabling spidering of their sites without killing my server's bandwith... :))

PS: I wasted a couple of hours on Tomcat setup. Why? Because the JARs I was deploying in WEB-INF/lib didn't have write privileges. Tomcat wants them writable! And it was failing without any error messages, simply not loading the classes in the JARs (and yes, I tried common/lib). Anyway, all is well that ends well.

Update (5/12/2004): The Conversation Finder is now the Conversation Engine.

Categories: soft.dev
Posted by diego on December 3, 2004 at 8:13 PM

manifold, the 30,000 ft. view

As a follow-up to my thesis abstract, I wanted to add a sort of introduction-ish post to explain a couple of things in more detail. People have asked for the PDF of the thesis, which I haven't published yet, for a simple reason: everything is ready, everything's approved, and I have four copies nicely bound (two to submit to TCD) but... there's a signature missing somewhere in one of the documents, and they're trying to fix that. Bureaucracy. Yikes. Hopefully that will be fixed by next week. When that is done, right after I've submitted it, I'll post it here (or, more likely, I'll create a site for it... I want to maintain some coherency on the posts and here it gets mixed up with everything else).

Anyway, I was saying. Here's a short intro.

Resource Location, Resource Discovery

In essence, Resource Location creates a level of indirection, and therefore a decoupling, between a resource (which can be a person, a machine, a software services or agents, etc.) and its location. This decoupling can then be used for various things: mapping human-readable names to machine names, obtaining related information, autoconfiguration, supporting mobility, load balancing, etc.

Resource discovery, on the other hand, facilitates search for resources that match certain characteristics, allowing then to perform a location request or to use the resulting data set directly.

The canonical example of Resource Location is DNS, while Resource Discovery is what we do with search engines. Sometimes, Resource Discovery will involve a Location step afterwards. Web search is an example of this as well. Other times, discovery on its own will give you what you need, particularly if the result of the query contains enough metadata and what you're looking for is related information.

RLD always involves search, but the lines seemed a bit blurry. When was something one and not the other? What defines it? My answer was to look at usage patterns.

It's all about the user

It's the user's needs that determine what will be used, how. The user isn't necessarily a person: more often than not, RLD happens between systems, at the lower levels of applications. So, I settled on the usage patterns according to two main categories: locality of the (local/global) search, and whether the search was exact or inexact. I use the term "search" as an abstract action, the action of locating something. "Finding a book I might like to read" and "Finding my copy of Neuromancer among my books" and "Finding reviews of a book on the web" are all examples of search as I'm using it here.

Local/Global, defining at a high level the "depth" that the search will have. This means, for the current search action, the context of the user in relation to what they are trying to find.

Exact/Inexact, defining the "fuziness" of the search. Inexact searches will generally return one or more matches; Exact searches identify a single, unique, item or set.

These categories combined define four main types of RLD.

Examples: DNS is Global/Exact. Google is Global/Inexact. Looking up my own printer on the network is Local/Exact. Looking up any available printer on the network is Local/Inexact.

Now, none of these concepts will come as a shock to anybody. But writing them down, clearly identifying them, was useful to define what I was after, served as a way to categorize when a system did one but not the other, and to know the limits of what I was trying to achieve.

The Manifold Algorithms

With the usage patterns in hand, I looked at how to solve one or more of the problems, considering that my goal was to have something where absolutely no servers of any kind would be involved.

Local RLD is comparatively simple, since the size of the search space is going to be limited, and I had already looked at that part of the problem with my Nom system for ad hoc wireless networks. Looking at the state of the art, one thing that was clear was that every one of the systems currently existing or proposed for global RLD depends on infrastructure of some kind. In some of them, the infrastructure is self-organizing to a large degree, one of the best examples of this being the Internet Indirection Infrastructure (i3). So I set about to design an algorithm that would would work at global scales with guaranteed upper time bounds, which later turned out to be an overlay network algorithm (which ended up being based on a hypercube virtual topology), as opposed to the broadcast type that Nom was. For a bit more on overlays vs. broadcast networks, check out my IEEE article on the topic.

Then the question was whether to use one or the other, and it occurred to me that there was no reason I couldn't use both. It is possible to to embed a multicast tree in an overlay and thus use a single network, but there are other advantages to the broadcast algorithm that were pretty important in completely "disconnected" environments such as wireless ad hoc networks.

So Nom became the local component, Manifold-b, and the second algorithm became Manifold-g.

So that's about it for the intro. I know that the algorithms are pretty crucial but I want to take some time to explain them properly, and their implications, so I'll leave that for later.

As usual, comments welcome!

Categories: science, soft.dev, technology
Posted by diego on December 3, 2004 at 12:41 PM

IBM puts the PC business on the block

It's all over the wire.

Aside from the historical significance of this, my first feeling about was a bit of sadness, and thinking "No!". My second feeling was surprise at feeling anything about a corporate acquisition! My third :) was a realization that if this happened Apple would be alone, as far as I'm concerned, in moving the ball forward on laptops. IBM was the king of laptop evolution in the PC side, and even though Dell and HP are respectable, they've never shown huge amounts of initiative there (HP has been much better in terms of its PocketPC work, or rather, the iPaq team from Compaq was and HP has kept the ball rolling).

So my thought had a lot less to do with the PC business itself than with the Thinkpad business. IBM desktops were always a bit clunky IMO, and, excellent keyboards aside, they didn't do much for me. No doubt the failed PS/2-OS/2 experiment in the late 80s (remember Microchannel?) had a huge effect on the vitality of the IBM desktop PC line for a long time.

But the Thinkpads! Almost all of my laptops have been Thinkpads. Best keyboards of any laptop. Simple, good design. Amazing reliability. I got a 560e in 1998 and in 2000 I gave it to my parents so they could keep using it. And it still works fine. The battery started failing two years ago. 4 years. No doubt we take good care of the machines, but still, that's quite a long time for these things.

More than anything, what will be missed the most will be the innovation that Thinkpads championed. While a bit dry in terms of design (certainly Apple has been far ahead of everyone else on that count, as usual), they've always moved forward in terms of functionality. Thinkpads where the first to include 10.4" color TFTs, introduced the trackpoint device (invented by IBM, the trackpad was invented by Apple), first notebook to include a CD reader, ultraportables (the 560), first to integrate DVD, and small but useful things like the "ThinkLight" the little light at the top of the LCD that illuminates the keyboard. That, plus what didn't make it in the long such as the amazing Butterfly keyboard or LCD projection that was done by removing the back cover on the display and then resting the now see through LCD on top of an overhead projector, and other things like the Thinkpad Transnote.

Anyway. Maybe IBM will keep part of research focused on that (I hope so!). But it seems more likely that from the point that this transaction happens (assuming it does) then Apple will be the main flag-bearer for innovation on notebooks.

Categories: technology
Posted by diego on December 3, 2004 at 10:11 AM

conversation finder, part 3

The conversation finder saga continues! :) (Parts one, two)

Parser's done, at least in basic form. Both parser and bot seem to be running and playing along nicely. I created a simple conversation finder site to have a fixed point of reference for all this, particularly for the Bot which should start showing up in some logs any minute now. I am keeping it under control by manually specifying which sites it can download (I know this isn't scalable, but it's an easy fix once the rest is done), and at the moment only three are active: My weblog, Don's and Tim's, since the core of the idea came from a conversation Don was having with Tim, I'll use that as the "index case". If the finder actually finds conversations there, I'll start expanding the field, or possibly add a form so that others can "activate" spidering of a site.

But that's for later. Now, regarding the parser, a few interesting things.

As it turns out, parsing itself was a lot simpler than interpreting the information. I am using the HTML Parser in the JDK's HTMLEditorKit, which is actually quite easy to use: just define a Callback and specify what to do with each tag opening, closing, etc. But for the algorithm I'm using links between pages, which sounds simple enough... until you realize that links come in many shapes and sizes. Normalization of links into full URIs took a bit of figuring out. What I needed to do was, starting from any possible HREF, end up with a full URI, of the form: scheme + hostinfo + path + query + fragment (hostinfo actually has components, but let's leave that aside for now).

But URLs in HREF can be both relative and absolute. Relative URLs can be absolute within the site (e.g., /d2r/) or relative to the current page (e.g., ../index.html). Absolute URLs vary in form even if they point at the same thing (e.g., www.dynamicobjects.com and dynamicobjects.com), and you can also use IP numbers.

Then there's parsing errors, and URLs that are malformed but may be "recoverable" through fancier parsing, but I decided to ignore that for the moment. Another decision was to ignore IP URLs (i.e., not attempt to match them with the site), but this is an easy fix and not that critical I think--no weblogs that I know of use IP URLs for permalinks.

For separating the pieces I'm using elements of the following regular expression:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
which is specified in Appendix B of the URI RFC, with the java.util.regex.* classes.

Here's how I'm normalizing URLs at the moment.

For absolute URLs, the main case that has to be disambiguated is something.com vs. www.something.com, which in the vast majority of the cases applies only to the root domain (i.e., www.something.com might point to something.com, but www.other.something.com probably won't exist at all for other.something.com). So there I'm doing a couple of checks and converting something.com to www.something.com when necessary.

For relative URLs I use the current site and page to generate the full URI. With the full URI, I then normalize the path when the reference is relative-relative (e.g., ../index.html).

This solution is far, far from perfect (I don't even want to think about how many special cases I'm not covering), but it's good enough for now.

I'm now working on the algorithm to find conversations. Getting close!

Categories: soft.dev
Posted by diego on December 2, 2004 at 12:17 PM

humbled...

....is how I feel, btw, after the response to my post on clevercactus. I want to say thanks to everybody. Those that have posted about it offering help or referrals, like Erik, Russ, Scoble, Dave, Om, Frank, James, Cristian, Volker, Daniel and many others, and those that have sent emails or put me in touch with people. I don't know what to say (aside from this). Thanks again all!

Categories: personal
Posted by diego on December 1, 2004 at 5:31 PM

conversation finder, part 2

Okay, so going forward with the conversation finder thingy, I think I'm pretty much done with the bot and DB layer for it (nothing to go crazy about, just a few classes, mysql statements, and such). My recent musings on search bots have been helpful, since I had already considered a number of problems that showed up.

DB-related, I simply created a few tables in MySQL (4.x) to deal with the basic data I know I'll need, and then I'll add more as it becomes necessary. To start with, I've got:

  • Site, including main site URL, last spidered date, and robots.txt content.
  • Page, which includes content, URL of a page, pointer to a Site, parsed state, last spidered.
  • PageToPageLinks, just two fields, source PageID and target PageID.
Add to that some methods to create, delete, and query and we're in business.

Bot-related, a simple bot using the Jakarta HttpClient. Why that and not the standard HttpURLConnection from the JDK? Because HttpURLConnection doesn't allow you to set timeouts. That simple. And when you're downloading potentially tens of thousands of links, you need tight timeouts, otherwise a single slow (or worse, non-responsive) site can throw a wrench on things, even if you use thread pools to do simultaneous parsing, you can have threads be locked for way too long and diminish the usefulness of pooling.

Anyway, so the basics out of the way, the bot records ETag and Last-Modified values (which sounds like a good idea but maybe won't always work--we'll see later) to download only changed things. It performs a HEAD request and then a GET if necessary.

But, since there's no parser yet, I can only download single pages I specify myself.

So, coming up: the parser. :)

PS: just to clarify again, I am doing this as a way to relax a bit. This might exist in other forms, in fact, James pointed out in a comment to the previous entry that BottomFeeder already supports this--very cool. I think it doesn't exist in this form, but even if it did, it would be good excercise for the brain anyway.

Categories: soft.dev
Posted by diego on December 1, 2004 at 5:10 PM

conversation finder

My next step is to write something different to get the gears moving in my head again, and Don's conversation category idea is appealing. So, let's do that. :)

Don described it as:

A conversation aggregator subscribes to the category feed of all the participants and merge them into a single feed and publishes a mini-website dedicated to the conversation. The 'referee' of the debate or the conversation moderator gets editorial rights over the merged feed and the mini-website. Hmm. This stuff is very close to what I am currently working on so I think I'll slip this feature in while I am at it.
Since this is probably too big to do in quick & dirty fashion, I was a little worried. But last night I thought of a different approach. How about something that finds conversations, rather than is subscribed to certain categories?

After all, we already have a mechanism to define conversation thread: the permalink. Generally when you're in a cross-blog thread, you point back at the last "reply" from the other person. A cross-blog thread also has the advantage of being a directed graph, with a definite starting point. So permalinks and some kind of graph traversal thingamagic could be used to find the threads that exist, and maybe are in progress.

As Don notes, sometimes you might refer to the other party by name, or make oblique references. That could be step two, using text-based search to add some more information to the graph formation. But let's say we start with permalinks only.

Hm. Okay. So what do I need for this? First things that come to mind:

  • A crawler
  • A DB (the tables, I mean)
  • A parser (to find the set of links)
  • The algorithm to find the conversations
  • Some kind of web front end to make it more usable?
Neither the crawler nor the parser have to be super-sophisticated, so maybe they are doable in a few hours. Or a couple of days?

This sounds like a good starting point. First step should be DB & crawler. More later!

Categories: soft.dev
Posted by diego on December 1, 2004 at 6:52 AM

Copyright © Diego Doval 2002-2011.