I'm so annoyed with myself.
So today was the day I was going to upgrade the disks in daedalus.
I'd finally managed to convince Dell to sell me two new 140Gb disks to replace to two 70Gb disks that were fast approaching capacity. I'd organised for access to the datacentre from 10am. I arrived early.
I didn't manage to get into the datacentre until about 11am, because they were having problems remotely unlocking the door, and I had to wait until an air conditioning repair guy turned up, who was fixing some air conditioning problem.
I hadn't had a lot of time to plan how I was going to do the upgrade, but I figured it was going to be pretty straightforward. Each disk is partitioned into three partitions, all of which are mirrored using software RAID. The first partition is the root filesystem, which is just an ext3 filesystem, mirrored. The second partition is swap, also mirrored, and the third partition is an LVM physical volume. The first two partitions are only a couple of gigabytes.
So the basic plan was to remove the second disk (disk 1, aka /dev/sdb), put in the new bigger disk, load the partition table from the original disk 1, resync the three mirrors, and then boot off the new disk 1, and rinse and repeat with disk 0.
This was all going swimmingly, when right at the end (in fact after the third partition finished syncing according to the kernel), /dev/sda decided it had a few unreadable sectors. This made the software mirroring have a dummy-spit, and decide it needed to restart the syncing from scratch. I'd seen this before with a bad disk.
So I wishfully hoped it was something to do with the new disk being bigger than the old disk, so tried to roll back to the original disk 1. As was pretty much expected, disk 0 exhibited the same problems at the same point. So the problem was disk 0, not the new disk 1.
So it seemed like I just managed to pick the wrong half of the mirror to work from.
I thought I'd try manually dd'ing the third partition from /dev/sda to /dev/sdb, ignoring the errors, but that didn't result in a usable half of a mirror either.
Interestingly, these bad sectors hadn't made themselves known until this point. The RAID array had been otherwise healthy, and SMART hadn't uttered a peep.
I think at this point, after spending about 3 hours fighting with it, I decided to move up a layer. I could successfully sync /dev/md0 (the root filesystem) and /dev/md1 (the swap device), and the only reason I could think of these bad sectors not giving me grief already, was because they were right at the end of the disk, and I hadn't quite used all the free physical extents in LVM yet. Lucky I hadn't decided to grow things any further.
So I decided that rather than trying to convince this existing /dev/md2 to sync with the new /dev/sdb3, I'd just build a new degraded RAID-1 on /dev/sdb3 and move all the physical extents from the degraded RAID-1 on /dev/sda3 to the new degraded RAID-1 on /dev/sdb3. I tried the logical volume with /tmp on it first, since it was fairly sacrificial. It worked fine. So I tried the largest logical volume, /srv, which was 30Gb. That went fine as well, so I moved the rest. All of them went without complaint. So then I removed the now unused /dev/md2 from the volume group, and finally managed to boot with just the new disk 1. Then I was able to put in the new disk 0, and just do a standard RAID-1 rebuild onto it. Hooray for LVM saving the day yet again.
And this is where I should have stopped.
At this point it was about 3pm. I'd been aiming to leave by 4pm to pick up a rental car that I had to get by 5pm. I'd wanted to also upgrade to Etch while I was physically in front of the box, in case there were any nasty kernel/udev issues that made the box unbootable. The catch was, I needed to grow /var, which was quite full.
The intention had been that once I'd got the two new disks in and all synced up, I'd create a fourth mirrored partition, and add this to LVM as a new physical volume, then I could just keep merrily growing things. It felt a little bit dirty having to make a new partition to get at the extra space, and I'd reading about mdadm's "grow" mode, whilst I was sitting around waiting for things to sync.
So I thought to myself, I'd just delete the third partition, and recreate it using the remaining additional cylinders I now had at my disposal, then use this grow feature of mdadm to tell Linux the array was now bigger, then resize the physical volume, and go from there.
This is where things went a bit pear-shaped.
I deleted and recreated (larger) the third partition on each disk, and rebooted.
I was then messing around with mdadm trying to update /etc/mdadm/mdadm.conf, since the UUID of /dev/md2 had changed, when I discovered that /dev/md2 had completely disappeared.
It seems that recreating that partition made Linux completely fail to see /dev/sda3 and /dev/sdb3 as RAID-1 members, and so because of the fairly transparent nature of Linux's RAID-1, LVM had seen the underlying physical volumes, and just decided to run with /dev/sda3 as the physical volume, and everything had continued to work. I just had no redundancy any more.
So it was about 3:20pm at this stage, and I was starting to panic, thinking I'd never get this sorted out before I had to leave, and I'd end up returning to the US with a non-redundant mess on my hands. Then I remembered that I had the backups of the partition tables of the old disks, from before I removed them. So I thought I'd go back to that state, and just have the additional cylinders unallocated. I hadn't yet attempted to grow the RAID-1 (partly because it had vanished), so in my haste, I didn't see this being a problem.
The problem was, I'd booted into multi-user mode, and things had merrily been operating on just half the mirror. I restored the partitions, and rebooted, and lo and behold, /dev/md2 magically reappeared, already 100% consistent according to /proc/mdstat. I was somewhat surprised by that, but I booted back into multi-user mode again, only to discover that the /var filesystem was panicking and remounting itself into read-only mode. This was pretty weird, given that during boot, it was considered clean by e2fsck.
So I was really getting bothered by this stage, so I rebooted into emergency mode (single-user mode in Debian starts way too much), and manually started up LVM, and forcibly fscked everything. There was a bit of filesystem damage to /var, /home and /srv.
Thinking I'd now fixed the problem, I rebooted into multi-user mode again, but again /var, which was considered clean during the fscking/mounting phase of booting, panicked shortly after mounting, and remounted itself read-only.
So I rebooted again, broke the mirror, and forcibly fscked the filesystems again (only operating on what was on /dev/sda3. Then I rebooted without incident into multi-user mode and resynced the mirror.
At this point I ran away.
So I'm really annoyed with myself. This should have been a blog post about how crap Linux's software RAID is at error recovery, and how LVM saved the day yet again, but it was overshadowed by me overreaching in my haste, and managing to cause silent data corruption.
The morals of this story:
- If things go off the rails, and you get to a reasonable milestone, and you're approaching the end of your maintenance window, quit while you're ahead.
- Don't do stuff in multi-user mode, just because you can and you want to minimise the length and magnitude of an outage
- Don't rush to try and do something (growing a RAID-1) without researching it properly
- Linux's software RAID-1 sucks when the underlying devices are having issues (I already knew this)
- Linux's software RAID-1 can look perfectly healthy when the underlying disk has issues, as long as the bad bits aren't getting any exercise (I think newer versions of mdadmand/or Linux address this by doing a full array check on a regular basis)
- Linux's software RAID-1 can get out of sync with reality and report that it's in-sync with itself when it isn't really, which causes silent data corruption
- e2fsck can report a filesystem is clean when it's really got issues
I think that is all. I'm just so pissed off that what could have been a perfect (enough) upgrade with zero data-loss, even considering I had a bum disk, wasn't.