This tutorial will walk through the troubleshooting and recovery of a failed disk on a Software RAID setup. Especially mirroring the boot partition which is important when replacing disk and making sure the OS boots up properly after replacement. This tutorial was done on Ubuntu 14.04 but should still work on other distributions such as RHEL and CentOS.
Preliminary Note
In this example I have two hard drives, /dev/sda and /dev/sdb, with the partitions /dev/sda1, /dev/sda2 and /dev/sda3 as well as /dev/sdb1, /dev/sdb2 and /dev/sdb3.
/dev/sda1 and /dev/sdb1 make up the RAID1 array /dev/md0.
/dev/sda2 and /dev/sdb2 make up the RAID1 array /dev/md1.
/dev/sda3 and /dev/sdb3 make up the RAID1 array /dev/md2.
/dev/sda1 + /dev/sdb1 = /dev/md0
/dev/sda2 + /dev/sdb2 = /dev/md1
/dev/sda3 + /dev/sdb3 = /dev/md2
****/dev/sdb has failed, and we want to replace it.
How Do I Tell If A Hard Disk Has Failed?
If a disk has failed, you will probably find a lot of error messages in the log files, e.g. /var/log/messages or /var/log/syslog.
You can also run
cat /proc/mdstat
Removing The Failed Disk
To remove /dev/sdb, we will mark /dev/sdb1, /dev/sdb2 and /dev/sdb3 as failed and remove them from their respective RAID arrays (/dev/md0, /dev/md1 and /dev/md3).
First we mark /dev/sdb1 as failed:
mdadm --manage /dev/md0 --fail /dev/sdb1
Verify this from the output of:
cat /proc/mdstat
Then we remove /dev/sdb1 from /dev/md0:
mdadm --manage /dev/md0 --remove /dev/sdb1
cat /proc/mdstat
Now we do the same steps again for /dev/sdb2 (which is part of /dev/md1):
mdadm --manage /dev/md1 --fail /dev/sdb2
cat /proc/mdstat
mdadm --manage /dev/md1 --remove /dev/sdb2
cat /proc/mdstat
Do this also for /dev/sdb3 which part of the /dev/md2 raid:
mdadm --manage /dev/md2 --fail /dev/sdb3
cat /proc/mdstat
mdadm --manage /dev/md2 --remove /dev/sdb3
cat /proc/mdstat
Now power down the system so that you can replace the failed disk. If the disk is hot-swapable, then you don’t need to power it down.
poweroff
Now replace the old /dev/sdb hard drive with a new one (it must have at least the same size as the old one – if it’s only a few MB smaller than the old one then rebuilding the arrays will fail).
Adding the New Hard Disk
After you have changed the hard disk /dev/sdb, boot the system.
The first thing we must do now is to create the exact same partitioning as on /dev/sda. We can do this with one simple command:
sfdisk -d /dev/sda | sfdisk --force /dev/sdb
You can run fdisk to check if both hard drives have the same partitioning now.
fdisk -l
Next we add /dev/sdb1 to /dev/md0 and /dev/sdb2 to /dev/md1 and /dev/sdb3 to /dev/md2:
mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --manage /dev/md1 --add /dev/sdb2
mdadm --manage /dev/md2 --add /dev/sdb3
Now both arays (/dev/md0 and /dev/md1) will be synchronized. To see rebuild status:
cat /proc/mdstat
Next is to rebuild grub on both disk. This is important so that both disk will have a copy of GRUB. If you don’t do this, there is a possibility that the OS will not boot when the old disk (/dev/sda) fails because the new replacement (/dev/sdb) does not have a GRUB on it. To do this, you can re-install grub on both disk again using grub-install command:
grub-install /dev/sda
grub-install /dev/sdb
That should be enough and you should test it by rebooting the server again. If the above command fails, then try the following to rebuild using the hard-coded disk method:
grub
grub> device (hd0) /dev/sda
grub> device (hd1) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> root (hd1,0)
grub> setup (hd1)
grub> quit
Now your server should be UP and have the healthy redundancy of RAID-1 again.
– masterkenneth