Diary of a geek

January 2008
Mon Tue Wed Thu Fri Sat Sun
 
16
     

My ugly mug

Where's Andrew?

Categories

Other people's blogs

Subscribe

RSS feed

Contact me

JavaScript required


Wednesday, 16 January 2008

Linux software RAID hates me

After the debacle last time I tried to grow the size of my existing RAID1 when I put new disks in daedalus, I thought this time I'd do my homework.

I did some research, I found out the way I should have done it. I did a practice run on a USB key. I fully planned how I was going to do it:

mdadm /dev/md2 --fail /dev/sdb3
<delete /dev/sdb3, recreate at new full size>
reboot
mdadm /dev/md2 --add /dev/sdb3
<wait for sync>
mdadm /dev/md2 --fail /dev/sda3
<delete /dev/sda3, recreate at new full size>
reboot
mdadm /dev/md2 --add /dev/sda3
<wait for sync>
mdadm --grow /dev/md2
<wait for sync>
pvresize /dev/md2

Everything went as planned, until I went to grow the RAID1 volume. It still thought the underlying device was the same size. There was nothing to grow.

So at this point, I decided to do something similar to what I did last time to get around the failing disk, and should have done last time anyway. I broke the mirror, created a new degraded RAID1 using the full size of the new partition on the half I pulled out of the mirror, and did a pvmove from the old non-full-sized degraded mirror to the new full-sized degraded mirror.

All of that went swimmingly until the pvmove was around 50% complete, when the kernel decided to oops spectacularly. I had to power cycle daedalus to get it back under control, and even in single-user mode, without me doing anything, the kernel started oopsing again. Dammit.

I had to boot into emergency-mode (insert standard gripe about Debian's single-user mode being far too non-singular here), then I could resume the pvmove without any further oopsing. After that completed, I was able to ditch the old non-full-sized degraded RAID1 device and resync the new one onto the old partition. There was still some minor filesystem corruption, more likely because I had everything mounted at the time of the crash. Yes, I still haven't learned not to do this kind of thing in multi-user mode. It seems every time I try to minimise the size and duration of an outage, it bites me in the arse. Even though I should have been able to move open logical volumes between physical volumes, the kernel oops seemed to be in the dm_mirror code. daedalus is running a fairly old kernel. The annoying thing is that getting some additional disk space on board was the dependency for doing a general upgrade of all of the software on it. Argh.

Anyway, it's done. I hope not to have to go through this again. I just have to sit through a potentially nail-biting remote upgrade of Debian now, and I should be good for a couple more years hopefully.

[21:33] [tech] [permalink]