On Tue, Dec 21, 2010 at 06:42:36AM -0500, Eric Martin wrote:
On Mon, Dec 20, 2010 at 4:52 PM, John Stoffel <john@stoffel.org> wrote:
> "Eric" == Eric Martin <eric.joshua.martin@gmail.com> writes:
Eric> Now, the main reason for my email. My mythbox / home file Eric> server runs linux soft-raid in RAID5 with 3 500GB disks. One Eric> disk went bad a while ago and the array never rebuilt itself.
So you only had two 500Gb disks left in the array?
Correct. I just received my new disk yesterday and dd_rescued the data from the bad disk onto a fresh disk. the array won't start, so here's the info you asked for
I know it is confusing, but ddrescue and dd_rescue are two different projects that have no relation to each other, other than the fact that they do similar things. GNU ddrescue is the more advanced of the two and I would recommend it over dd_rescue--it may be able to recover more data (unless you use the cumbersome dd_rhelp helper shell script with dd_rescue). Check this out: http://www.toad.com/gnu/sysadmin/index.html#ddrescue
BTW, this is what I was missing. there was no warning that my disk was bad! Like I said, I have backups but the last one failed so I want something tighter
I run smartd (from smartmontools) and mdmonitor (mdmon) as daemons to monitor for SMART errors and array issues. On Fedora/Red Hat like systems this is as simple as: chkconfig mdmonitor on chkconfig smartd on service mdmonitor start chkconfig smartd start (and making sure that root mail goes somewhere useful) mdmon will monitor the arrays that are specified in /etc/mdadm.conf (which incidentally is written automatically by the Fedora installer) and let you know if any component device of an array has trouble or drops out for any reason. smartd will monitor devices listed in /etc/smartd.conf (or with DEVICESCAN it will scan and monitor all available ATA/SCSI devices on the system) and let you know if you start getting uncorrectable errors, reallocated sectors, etc. which can be a nice early-warning system. I usually replace disks as soon as a single sector is reallocated or "pending". But even with these, you still should "scrub" the array--try to read every sector on all devices that make up the array--to find errors that might otherwise be missed. I believe that is what the checkarray script that John posted about does. checkarray appears to be a Debian-specific script. My Fedora system has a similar one called raid-check: /etc/cron.weekly/99-raid-check /etc/sysconfig/raid-check #!/bin/bash # # Configuration file for /etc/cron.weekly/raid-check # # options: # ENABLED - must be yes in order for the raid check to proceed # CHECK - can be either check or repair depending on the type of # operation the user desires. A check operation will scan # the drives looking for bad sectors and automatically # repairing only bad sectors. If it finds good sectors that # contain bad data (meaning that the data in a sector does # not agree with what the data from another disk indicates # the data should be, for example the parity block + the other # data blocks would cause us to think that this data block # is incorrect), then it does nothing but increments the # counter in the file /sys/block/$dev/md/mismatch_count. # This allows the sysadmin to inspect the data in the sector # and the data that would be produced by rebuilding the # sector from redundant information and pick the correct # data to keep. The repair option does the same thing, but # when it encounters a mismatch in the data, it automatically # updates the data to be consistent. However, since we really # don't know whether it's the parity or the data block that's # correct (or which data block in the case of raid1), it's # luck of the draw whether or not the user gets the right # data instead of the bad data. This option is the default # option for devices not listed in either CHECK_DEVS or # REPAIR_DEVS. # CHECK_DEVS - a space delimited list of devs that the user specifically # wants to run a check operation on. # REPAIR_DEVS - a space delimited list of devs that the user # specifically wants to run a repair on. # SKIP_DEVS - a space delimited list of devs that should be skipped # # Note: the raid-check script intentionaly runs last in the cron.weekly # sequence. This is so we can wait for all the resync operations to complete # and then check the mismatch_count on each array without unduly delaying # other weekly cron jobs. If any arrays have a non-0 mismatch_count after # the check completes, we echo a warning to stdout which will then me emailed # to the admin as long as mails from cron jobs have not been redirected to # /dev/null. We do not wait for repair operations to complete as the # md stack will correct any mismatch_cnts automatically. # # Note2: you can not use symbolic names for the raid devices, such as you # /dev/md/root. The names used in this file must match the names seen in # /proc/mdstat and in /sys/block. ENABLED=yes CHECK=check # To check devs /dev/md0 and /dev/md3, use "md0 md3" CHECK_DEVS="" REPAIR_DEVS="" SKIP_DEVS="" Finally, if you do come across bad sectors on a device, they will only usually be reallocated to a spare sector on the media when the sector is written to rather than just read from. As a last ditch "I really want to continue using this bad disk, please just reallocate the bad sectors and I'll deal with the data that is already lost" you can try to force a reallocation with hdparm --repair-sector: --repair-sector This is an alias for the --write-sector flag. VERY DANGEROUS. --write-sector Writes zeros to the specified sector number. VERY DANGEROUS. The sector number must be given (base10) after this flag. hdparm will issue a low-level write (completely bypassing the usual block layer read/write mechanisms) to the specified sec- tor. This can be used to force a drive to repair a bad sector (media error). But I would really think about replacing any disks that have developed such bad sectors, as it is only a matter of time before you run out of spare sectors to reallocate to.