Short Tip: Replacing a failed drive in mdadm softwareraid

Sometimes you check your fileserver and your raid looks like this:
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active (auto-read-only) raid6 sdl[1] sdh[7] sdf[5] sdk[2] sdg[6] sdm[8] sdi[0] sdn[10] sde[4]
23441080320 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/9] [UUU_UUUUUU]
bitmap: 17/22 pages [68KB], 65536KB chunk

Noticed the _ in [UUU_UUUUUU]? This means your raid is degraded and one drive is missing. You had one job harddrive! You can use the output of find /dev/sd? and lsblk to check which device is the missing one. Hopefully the drive is still present to the system, than it should appear. In that case it also should be possible to query it for the serial number:

smartctl -i /dev/sdj | awk '/Serial/ {print $3}'

You should now be able to determine the failed drive in the system and replace it. After that you can add the new one to the raid:

mdadm /dev/md125 --add /dev/sdj

Tip1:
Hard drive failure rates work like the bathtub curve. They die in the first hours of their usage or add the end, mostly around 25.000-30.000 (depends on many many many factors, this is just a rough estimate). But just keep that in mind for the next hours while your raid rebuild, the new drive may fail and you have to replace it again.

Tip2
You can speed up the rebuild with:

echo 2000000000 > /proc/sys/dev/raid/speed_limit_max
echo 200000000 > /proc/sys/dev/raid/speed_limit_min

This forces the kernel to use all available IO power for the rebuild. You can watch it with watch cat /proc/mdstat.

This entry was posted in 30in30, General, Linux, Short Tips. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.