Now blogging at diego's weblog. See you over there!

migrating data structures


from the why-do-we-have-so-many-problems-with-program-versions dept.

Ever upgraded from, say, version 95 to version 98 of a certain program, and found that you had to "convert the files". Ever felt the anguish of looking at that dreaded dialog window that says "Do you want to convert this file to the new version? Yes, No, Cancel" (whatever "Cancel" means). Don't you just hate that?

Yeah, me too.

Well, this past few weeks I found myself on the unenviable position of being on the other side of the fence, having to deal with a new database format for spaces and wondering what to do about it. Sure, one option was to just wipe out the old DB and start from scratch: spaces is in alpha, and people know that they shouldn't be using it as their only app yet. In fact, this is the first road I planned to take.

But.. but... as days passed and I got some feedback from users and I realized that my own data would have to be exported/imported somehow (since I am using spaces as my only PIM) I arrived at the conclusion that I had to write a migration tool.

Uh-Oh.

This is not as easy as it sounds. You need to ship a program with two incompatible formats that won't step on each other, detect the old one, and make a conversion if necessary. All without bothering users and destroying information.

As I struggled trying to find a good way to do this (which happened, eventually, and alpha 1.6 was released with an automatic conversion mechanism) so that users wouldn't have to deal with a problem I had created, I realized that this was being much harder than it should be. Then I realized why. I had never done it before.

I have years of experiences with databases in real world settings, in applications deployed to thousands of users, in client and in server settings. I've had to design DBs, install them, maintain them... and yes, sometimes migrated their contents, but in a painful "by-hand" process that involved converting tables, etc, and making sure that data was not lost.

Spaces, being a consumer app, was different: the conversion had to happen automatically, and fast. It might look like a small distinction, but it's not. It's a whole different ballgame.

And, I wondered, why did this seem so strange? The answer: I had never been exposed to it.

In abstract terms, a database is a persistent collection of data structures of one sort or another. Both in Computer Science courses (at college or wherever) and in the "real world" we are taught and learn how to design the best data structures, we discuss their efficiency, their tradeoffs (size, performance, etc), their APIs. But rarely, if ever, is the topic of migration discussed. The famous Object-Relational Mapping Problem is, largely, a problem of "impedance" between different paradigms, but I've become convinced that it's also been created in part by this rigid thinking of structures as creatures that never, ever change, so when change happens we don't know what to do.

The roots of this are, I think, in the idea (also taught and practiced widely) that data structures, like programs, should be designed once, and then implemented, and that's it. There is no concept of evolvability built in how data structure design is taught (and expected to be performed).

This, of course, is just plain wrong. And APIs are not enough. Well-designed APIs are an important element on any migration (and they were what finally got me out of my self-created hole). But there is more. The API itself reflects the underlying data structure in one way or another, so the data structure itself should be analyzed, at least a bit, to understand what will be involved in migrating to a possible future (different) structure.

That is, data structures should not be designed to apply to all cases, but to migrate gracefully. Similarly to the concept of test-oriented programming (one of the components of eXtreme Programming) where you write the test first and the code later, we should work on data migration first, making sure that one version is compatible with another in some automatic way (and the migration process is already in place), and then do the change.

Just like with test-oriented programming, this builds up confidence: since you know the data won't be lost, you are free to design better data structures as you realize what's needed for every particular case.

This doesn't change the data structure design process itself, it changes what comes before (the planning process) and what comes after (the implementation). The implementation in particular requires hooks that will allow you to perform the migration easily and automatically. Every database manager should include a version manager, with hooks to define when a particular object is of a given version, and, if necessary, how to convert it.

In all cases (and this is a UI problem more than anything) the user should barely notice that something has changed. If there is a long operation involved, yes, a progress bar of some sort will be necessary. But nothing more.

It's time for programs to be "responsible" and take care of things transparently instead of involving the user in decisions over things they don't want to know about. For all of us developing applications, having more practice with data structure migration (as opposed to simply data migration, since developers deal with the code) and how to automate it would go a long way towards that.

Categories: technology
Posted by diego on January 4 2003 at 4:32 PM

Copyright © Diego Doval 2002-2011.
Powered by
Movable Type 4.37