Re: [Wlug] Dec Meeting announcement / linux-raid problem

Dec. 23, 2010

      On Tue, Dec 21, 2010 at 06:42:36AM -0500, Eric Martin wrote:
...
On Mon, Dec 20, 2010 at 4:52 PM, John Stoffel <john@stoffel.org> wrote:
...
...
...
...
...
> "Eric" == Eric Martin <eric.joshua.martin@gmail.com> writes:
Eric> Now, the main reason for my email.  My mythbox / home file
Eric> server runs linux soft-raid in RAID5 with 3 500GB disks.  One
Eric> disk went bad a while ago and the array never rebuilt itself.
So you only had two 500Gb disks left in the array?
Correct.  I just received my new disk yesterday and dd_rescued the data from
the bad disk onto a fresh disk.  the array won't start, so here's the info
you asked for
I know it is confusing, but ddrescue and dd_rescue are two different 
projects that have no relation to each other, other than the fact that 
they do similar things.  GNU ddrescue is the more advanced of the two 
and I would recommend it over dd_rescue--it may be able to recover 
more data (unless you use the cumbersome dd_rhelp helper shell script 
with dd_rescue).  Check this out:

http://www.toad.com/gnu/sysadmin/index.html#ddrescue
...
BTW, this is what I was missing.  there was no warning that my disk was
bad!  Like I said, I have backups but the last one failed so I want
something tighter
I run smartd (from smartmontools) and mdmonitor (mdmon) as daemons to 
monitor for SMART errors and array issues.  On Fedora/Red Hat like 
systems this is as simple as:

chkconfig mdmonitor on
chkconfig smartd on
service mdmonitor start
chkconfig smartd start

(and making sure that root mail goes somewhere useful)

mdmon will monitor the arrays that are specified in /etc/mdadm.conf 
(which incidentally is written automatically by the Fedora installer) 
and let you know if any component device of an array has trouble or 
drops out for any reason.

smartd will monitor devices listed in /etc/smartd.conf (or with 
DEVICESCAN it will scan and monitor all available ATA/SCSI devices on 
the system) and let you know if you start getting uncorrectable 
errors, reallocated sectors, etc. which can be a nice early-warning 
system.  I usually replace disks as soon as a single sector is 
reallocated or "pending".

But even with these, you still should "scrub" the array--try to read 
every sector on all devices that make up the array--to find errors 
that might otherwise be missed.  I believe that is what the checkarray 
script that John posted about does.  checkarray appears to be a 
Debian-specific script.  My Fedora system has a similar one called 
raid-check:

/etc/cron.weekly/99-raid-check
/etc/sysconfig/raid-check

#!/bin/bash
#
# Configuration file for /etc/cron.weekly/raid-check
#
# options:
#	ENABLED - must be yes in order for the raid check to proceed
#	CHECK - can be either check or repair depending on the type of
#		operation the user desires.  A check operation will scan
#		the drives looking for bad sectors and automatically
#		repairing only bad sectors.  If it finds good sectors that
#		contain bad data (meaning that the data in a sector does
#		not agree with what the data from another disk indicates
#		the data should be, for example the parity block + the other
#		data blocks would cause us to think that this data block
#		is incorrect), then it does nothing but increments the
#		counter in the file /sys/block/$dev/md/mismatch_count.
#		This allows the sysadmin to inspect the data in the sector
#		and the data that would be produced by rebuilding the
#		sector from redundant information and pick the correct
#		data to keep.  The repair option does the same thing, but
#		when it encounters a mismatch in the data, it automatically
#		updates the data to be consistent.  However, since we really
#		don't know whether it's the parity or the data block that's
#		correct (or which data block in the case of raid1), it's
#		luck of the draw whether or not the user gets the right
#		data instead of the bad data.  This option is the default
#		option for devices not listed in either CHECK_DEVS or
#		REPAIR_DEVS.
#	CHECK_DEVS - a space delimited list of devs that the user specifically
#		wants to run a check operation on.
#	REPAIR_DEVS - a space delimited list of devs that the user
#		specifically wants to run a repair on.
#	SKIP_DEVS - a space delimited list of devs that should be skipped
#
# Note: the raid-check script intentionaly runs last in the cron.weekly
# sequence.  This is so we can wait for all the resync operations to complete
# and then check the mismatch_count on each array without unduly delaying
# other weekly cron jobs.  If any arrays have a non-0 mismatch_count after
# the check completes, we echo a warning to stdout which will then me emailed
# to the admin as long as mails from cron jobs have not been redirected to
# /dev/null.  We do not wait for repair operations to complete as the
# md stack will correct any mismatch_cnts automatically.
#
# Note2: you can not use symbolic names for the raid devices, such as you
# /dev/md/root.  The names used in this file must match the names seen in
# /proc/mdstat and in /sys/block.

ENABLED=yes
CHECK=check
# To check devs /dev/md0 and /dev/md3, use "md0 md3"
CHECK_DEVS=""
REPAIR_DEVS=""
SKIP_DEVS=""

Finally, if you do come across bad sectors on a device, they will only 
usually be reallocated to a spare sector on the media when the sector 
is written to rather than just read from.  As a last ditch "I really 
want to continue using this bad disk, please just reallocate the bad 
sectors and I'll deal with the data that is already lost" you can try 
to force a reallocation with hdparm --repair-sector:

       --repair-sector
              This is an alias for the --write-sector flag.  VERY DANGEROUS.
       --write-sector
              Writes  zeros  to  the specified sector number.  VERY DANGEROUS.
              The sector number  must  be  given  (base10)  after  this  flag.
              hdparm  will  issue  a low-level write (completely bypassing the
              usual block layer read/write mechanisms) to the  specified  sec-
              tor.   This  can be used to force a drive to repair a bad sector
              (media error).

But I would really think about replacing any disks that have developed 
such bad sectors, as it is only a matter of time before you run out of 
spare sectors to reallocate to.

Re: [Wlug] Dec Meeting announcement / linux-raid problem

Chuck Anderson