Software RAID Recovery Tutorial

This tutorial will walk through the troubleshooting and recovery of a failed disk on a Software RAID setup. Especially mirroring the boot partition which is important when replacing disk and making sure the OS boots up properly after replacement. This tutorial was done on Ubuntu 14.04 but should still work on other distributions such as RHEL and CentOS.

Preliminary Note

In this example I have two hard drives, /dev/sda and /dev/sdb, with the partitions /dev/sda1, /dev/sda2 and /dev/sda3 as well as /dev/sdb1, /dev/sdb2 and /dev/sdb3.

/dev/sda1 and /dev/sdb1 make up the RAID1 array /dev/md0.

/dev/sda2 and /dev/sdb2 make up the RAID1 array /dev/md1.

/dev/sda3 and /dev/sdb3 make up the RAID1 array /dev/md2.

/dev/sda1 + /dev/sdb1 = /dev/md0

/dev/sda2 + /dev/sdb2 = /dev/md1

/dev/sda3 + /dev/sdb3 = /dev/md2

****/dev/sdb has failed, and we want to replace it.

How Do I Tell If A Hard Disk Has Failed?

If a disk has failed, you will probably find a lot of error messages in the log files, e.g. /var/log/messages or /var/log/syslog.

You can also run

cat /proc/mdstat

Removing The Failed Disk

To remove /dev/sdb, we will mark /dev/sdb1, /dev/sdb2 and /dev/sdb3 as failed and remove them from their respective RAID arrays (/dev/md0, /dev/md1 and /dev/md3).

First we mark /dev/sdb1 as failed:

mdadm --manage /dev/md0 --fail /dev/sdb1

Verify this from the output of:

cat /proc/mdstat

Then we remove /dev/sdb1 from /dev/md0:

mdadm --manage /dev/md0 --remove /dev/sdb1

cat /proc/mdstat

Now we do the same steps again for /dev/sdb2 (which is part of /dev/md1):

mdadm --manage /dev/md1 --fail /dev/sdb2

cat /proc/mdstat

mdadm --manage /dev/md1 --remove /dev/sdb2

cat /proc/mdstat

Do this also for /dev/sdb3 which part of the /dev/md2 raid:

mdadm --manage /dev/md2 --fail /dev/sdb3

cat /proc/mdstat

mdadm --manage /dev/md2 --remove /dev/sdb3

cat /proc/mdstat

Now power down the system so that you can replace the failed disk. If the disk is hot-swapable, then you don’t need to power it down.

poweroff

Now replace the old /dev/sdb hard drive with a new one (it must have at least the same size as the old one – if it’s only a few MB smaller than the old one then rebuilding the arrays will fail).

Adding the New Hard Disk

After you have changed the hard disk /dev/sdb, boot the system.

The first thing we must do now is to create the exact same partitioning as on /dev/sda. We can do this with one simple command:

sfdisk -d /dev/sda | sfdisk --force /dev/sdb

You can run fdisk to check if both hard drives have the same partitioning now.

fdisk -l

Next we add /dev/sdb1 to /dev/md0 and /dev/sdb2 to /dev/md1 and /dev/sdb3 to /dev/md2:

mdadm --manage /dev/md0 --add /dev/sdb1

mdadm --manage /dev/md1 --add /dev/sdb2

mdadm --manage /dev/md2 --add /dev/sdb3

Now both arays (/dev/md0 and /dev/md1) will be synchronized. To see rebuild status:

cat /proc/mdstat

Next is to rebuild grub on both disk. This is important so that both disk will have a copy of GRUB. If you don’t do this, there is a possibility that the OS will not boot when the old disk (/dev/sda) fails because the new replacement (/dev/sdb) does not have a GRUB on it. To do this, you can re-install grub on both disk again using grub-install command:

grub-install /dev/sda

grub-install /dev/sdb

That should be enough and you should test it by rebooting the server again. If the above command fails, then try the following to rebuild using the hard-coded disk method:

grub

grub> device (hd0) /dev/sda

grub> device (hd1) /dev/sdb

grub> root (hd0,0)

grub> setup (hd0)

grub> root (hd1,0)

grub> setup (hd1)

grub> quit

Now your server should be UP and have the healthy redundancy of RAID-1 again.

– masterkenneth

Preliminary Note

How Do I Tell If A Hard Disk Has Failed?

Removing The Failed Disk

Adding the New Hard Disk

Leave a Reply Cancel reply