Now blogging at diego's weblog. See you over there!

the story of an outage

a tale of mistakes, backups, recovery (by a hair), and why permalinks are not so permanent after all

out·age (ou?tij) noun

  1. A quantity or portion of something lacking after delivery or storage.
  2. A temporary suspension of operation, especially of electric power.

    When I woke up yesterday after a brief sleep I started to log back in to different services and as I'm seeing something's funny with my server, Jim over at #mobitopia asks "is your site down?".

    Damn.

    As I checked what was happening, I could see that all sorts of things were not working on the server. I was starting to fear the worst ("the worst" in abstract, nothing specific) when I remembered that I had seen similar symptoms a couple of months ago, and back then it had been a disk space problem. I run "df" and sure enough, the mountpoint where a bunch of data related to the services (including logs) is stored was full (since November the number of pageviews a month has increased to over 200,000, which creates pretty big logfiles). As the last time, the logs were the culprits. Still half-asleep, I start to compress, move things around and delete files, when suddenly after a delete I stop cold: "No such file or directory".

    What? But I had just seen that file...

    I look up the console history and four rm commands had failed similarly.

    Uh-oh.

    I run "pwd". Look at the result. "That's not right...". I was not where I thought I was.

    At that point, I woke up completely. Nothing like adrenaline for shaking off sleepiness.

    I look through the command history. At some point in my switching back and forth from one directory to another, I mistyped a "cd -" command and it all went downhill from there. Adding to the confusion was the fact that I used keep parallel structures of the same data on different partitions, "just in case". I stopped doing that once I got DSL back in May last year, opting instead to download stuff to my home machine, but the old structure, with old data, remained. And, even more, my bash configuration for root doesn't display the current directory (the first thing I did after I realized that was add $PWD to the prompt, but of course by then it was too late).

    I had just wiped out the movable type DB, the MT binaries (actually, all the CGI scripts), the archives, and a bunch of other stuff in my home directory.

    I took a deep breath and finished creating space, and moved on.

    First thing I did was restart the services, now that disk space wasn't longer an issue. Then I reinstalled the binaries that I had just wiped out, which I always keep in a separate directory with some quick instructions on how to install them. That turned out to be a lifesaver, one of the many in this little story.

    After that I put up a simple page that explaining the situation (here's a copy for... err... "historical reference"), plus a hand-written feed and worked on the problem in breaks between work.

    Then I realized that all the links that were coming in from the outside (through other weblogs, google, etc) were getting a 404. So as a temporary measure I redirected the archive traffic to the main page through a mod_rewrite clause:

    RewriteRule /d2r/archives/(.*) /d2r/ [R=307]
    That would return a temporary redirect (code 307) while I got things fixed (one fire out! 10 to go).

    So what next? The data of course. When I came back to Ireland at the beginning of January I started doing backups of different things (a "new year, new backups" sort of thing), and I backed up all the server data directories on Thursday, and then on Saturday I did what I thought was a backup of my weblog data, through MovableType's "Export" feature. As things turned out, the latter proved useless, and it was the "binary" backup that saved the day.

    Why? Well, as I started looking at things, I went to MT's "import" command in cavalier fashion and was about to start when the word "permalink" popped up in my head. Then it grew to a question: "What about the permalinks?".

    The question was valid because my permalinks are directly based on the MT entry ids. Therefore, if an import changed the entry IDs, it would also break all the permalinks. I started cursing for not switching over to using entry-based strings for permalinks, but that didn't help. So I did a little digging and I realized that I was right. MT assigns entry IDs on a system-wide basis. So if you have multiple weblogs on the same DB (which I have, some of them private, some for testing, etc) OR if you have to recover the data from an export (which I had to do) you're out of luck. More likely than not, the permalinks will not work anymore. The exported file did not include IDs. Re-importing would generate the IDs again. Different IDs. Different links. Result: broken links all over the place, both within the weblog and from external sources.

    This is clearly an issue with the MT database design, which doesn't seem too well adapted to the idea of recovery. To be fair, however, I am not sure how other blogging software deals with this problem, if at all. I think this is one big hole in the weblog infrastructure that we haven't yet completely figured out, both for recovery and for transitions between blog software (As Don noted recently).

    This is when I started thinking that things would have been much easier if I had written my own weblog software. :) That thought would return a few times over the next 24 hours, but luckily I was busy enough with other things not to indulge in it too much.

    After looking online and finding nothing on the topic, I came to the conclusion that my only chance was to do a direct restore of the "binary" copy (that is, replacing the clean database with the backup directly) I had from last Thursday. I did the upload, put everything in place, and things seemed to go well, I could log in to MT and the entries up to that point where right where they had to be. So far so good. I was going to do a rebuild and I thought that maybe now was a good time to close off all comment threads in all entries (to avoid ever-increasing comment spam) and I spent some time trying to figure out how to use the various MT tools to close comments on old entries. However, they all seem to be ready for MySQL rather than BerkeleyDB. It wasn't a hard decision to set it aside and move on.

    So I started a full rebuild. The first 40 entries went along fine, albeit slowly. Then nothing happened. Then, failure. I thought for a moment that, for some strange reason, the redirect I had set up yesterday was causing the problem, so I removed it, restarted the server, and Tried again. Failed again. No apparent reason.

    I got angry for a second but then I remembered that the "binary" backup was of everything, including the published HTML files. Aha! I uploaded those,crossed my fingers, and did a rebuild only of the index files, and everything was up again. Actually, this was important for another reason, since the uploaded images that are linked from the entries end up by default in the archives directory, you need a backup of that or the images (and whatever else you upload into MT) will be gone if you lose the site.

    So the solution up until this point had been a lot simpler than I thought at the beginning.

    But wait! All the entries after last Thursday were missing, and I didn't have a backup for those. That was when RSS came to the rescue in three different forms: 1) I download my own feeds into my aggregator, so there I had a copy up to a point. 2) Some kind souls, along with their condolences for the problem, sent along their own copy of the latest entries (Thanks!!--and Thanks to those who sent good wishes as well). 3) Search engines, (Feedster was the most up to date--btw, it was Matt that suggested yesterday, also on #mobitopia, that I check out Feedster as a source of information, a great idea that really applies to many search engines if their database is properly updated), had cached copies that I could use to check dates and content. So armed with all that information I set out to recreate the missing entries.

    Here the problem of the permalinks surfaced again. I had to be careful on the sequencing, or the IDs wouldn't match. So I re-created empty entries, one-by-one, to maintain the sequencing (leaving them unpublished), actually posted a couple of updates of what was going on, and then I published the recovered entries as I entered the content and set the right dates.

    So. All things are restored now (except for the comments from the last week, which are truly lost--this makes me think that setting up comment feeds would be a good idea. However, that doesn't address how would I recreate the comments given what happened. Would I post them myself under the submitter's name? That doesn't seem right at all. Another problem with no obvious solution given the combination of export/ID issues with MT).

    What's strange is that there's been slight a breakdown in continuity now, because I did "post" some updates to that temporary index file, but it couldn't be part of the regular blogflow. Hopefully this entry fixes that to the extent possible.

    Okay, lessons learned?

    1. Backups do work. :) I am going to do another full backup today, and I'll try to set up something automated to that effect. (Yes, I know I should have done it before, but as usual there are no simple solutions, and then you leave it for the next day... and the next...). Plus, backups for MT installations, should always be both of the DB and the published data, to make recovery quick. (I have about 1500 entries, which amount to something like 20MB of generated HTML--additionally, the images are posted directly on the archives directory, so if you're not backing that up, you've lost them).
    2. For MovableType, the export feature is not so great as far as backups are concerned. The single-ID-per-database problem is a big one IMO, and I don't think MT is alone in this. We need to start looking at recovery and transition in a big way if weblogs are going to hit the mainstream (and we want permalinks to be really permanent)
    3. Solutions are often simpler than you think, if you have the right data. Having a full backup makes recovery in this case easy and fast.
    4. This stuff is still too hard. What would a less technically-oriented user do in this situation? Granted, it was my knowledge (since I was fixing stuff directly on the server) that actually created the problem in the first place, but there are lots of ways in which the same result could have been "achieved", starting from simple admin screwups, hardware failures, etc.
    Overall, this has been a wake-up call in more than one sense, and it has set off a number of ideas and questions in my head. How to solve these problems? I'll have to think about it more.

    Anyway. Back to work now, one less thing on my mind.

    Where was I?

    Categories: technology
    Posted by diego on January 14, 2004 at 5:45 PM

all entries restored

Okay, all entries since last thursday have been restored. Coming up: the story of an outage.

Categories: personal
Posted by diego on January 14, 2004 at 3:53 PM

back up--partially

Okay. Getting back to normal now. Everything seems to be back up until last thursday. Now I'm recovering the entries I had posted since then. Entries will start showing up as I update them. And I say "update them" because I had to create the entries first as empty drafts to maintain the ID sequencing. Many lessons learned from this whole thing. More later.

Categories: technology
Posted by diego on January 14, 2004 at 2:59 PM

Copyright © Diego Doval 2002-2011.