Dec Meeting announcement / linux-raid problem
I just realized that I haven't received anything from WLUG in over a month, including the announcement for this past meeting. Did anybody else have this problem or is it just me? Now, the main reason for my email. My mythbox / home file server runs linux soft-raid in RAID5 with 3 500GB disks. One disk went bad a while ago and the array never rebuilt itself. The other day, the second disk went bad. Am I hosed? I've been googling for 'rebuild bad linux software raid' but all I get is the rebuild command. Also, I don't see any tools that will move bad data to another spot on the disk. This is my first time using software raid so I'm in a bit over my head. thanks! -- Eric Martin
It appears there has been very little traffic as of late. Did you compare the messages you've received to the wlug archives? http://mail.wlug.org/pipermail/wlug/ As for your RAID issue you I don't have any advice as I have virtually no experience with it. I assume if you had backups you won't have bothered writing the group. On Fri, Dec 17, 2010 at 6:30 AM, Eric Martin <eric.joshua.martin@gmail.com> wrote:
I just realized that I haven't received anything from WLUG in over a month, including the announcement for this past meeting. Did anybody else have this problem or is it just me?
Now, the main reason for my email. My mythbox / home file server runs linux soft-raid in RAID5 with 3 500GB disks. One disk went bad a while ago and the array never rebuilt itself. The other day, the second disk went bad. Am I hosed? I've been googling for 'rebuild bad linux software raid' but all I get is the rebuild command. Also, I don't see any tools that will move bad data to another spot on the disk. This is my first time using software raid so I'm in a bit over my head.
thanks!
-- Eric Martin
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
I have backups of some stuff but I have a program that isn't great about writing stuff properly so I wanted to see if I could get more. If I can't it's ok, the backups will suffice. I was just hoping for more :) The replacement is on it's way so this is a moot point pretty fast. On Sat, Dec 18, 2010 at 11:28 AM, James Gray <jamespgray@gmail.com> wrote:
It appears there has been very little traffic as of late. Did you compare the messages you've received to the wlug archives? http://mail.wlug.org/pipermail/wlug/
As for your RAID issue you I don't have any advice as I have virtually no experience with it. I assume if you had backups you won't have bothered writing the group.
On Fri, Dec 17, 2010 at 6:30 AM, Eric Martin <eric.joshua.martin@gmail.com> wrote:
I just realized that I haven't received anything from WLUG in over a
month, including the announcement for this past meeting. Did anybody else have this problem or is it just me?
Now, the main reason for my email. My mythbox / home file server runs
linux soft-raid in RAID5 with 3 500GB disks. One disk went bad a while ago and the array never rebuilt itself. The other day, the second disk went bad. Am I hosed? I've been googling for 'rebuild bad linux software raid' but all I get is the rebuild command. Also, I don't see any tools that will move bad data to another spot on the disk. This is my first time using software raid so I'm in a bit over my head.
thanks!
-- Eric Martin
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
-- Eric Martin
I don't run software raid so I'm not sure how it handles filesystem errors but I've used ddrescue to recover drives/arrays that were written off. It's fairly easy to use but can take quite a while depending on what the drive failure was. Simple bad sectors in a bad place it will usually chew through pretty quick, though. Take care, James On 12/18/2010 1:01 PM, Eric Martin wrote:
I have backups of some stuff but I have a program that isn't great about writing stuff properly so I wanted to see if I could get more. If I can't it's ok, the backups will suffice. I was just hoping for more :) The replacement is on it's way so this is a moot point pretty fast.
On Sat, Dec 18, 2010 at 11:28 AM, James Gray <jamespgray@gmail.com <mailto:jamespgray@gmail.com>> wrote:
It appears there has been very little traffic as of late. Did you compare the messages you've received to the wlug archives? http://mail.wlug.org/pipermail/wlug/
As for your RAID issue you I don't have any advice as I have virtually no experience with it. I assume if you had backups you won't have bothered writing the group.
On Fri, Dec 17, 2010 at 6:30 AM, Eric Martin <eric.joshua.martin@gmail.com <mailto:eric.joshua.martin@gmail.com>> wrote: > > I just realized that I haven't received anything from WLUG in over a month, including the announcement for this past meeting. Did anybody else have this problem or is it just me? > > Now, the main reason for my email. My mythbox / home file server runs linux soft-raid in RAID5 with 3 500GB disks. One disk went bad a while ago and the array never rebuilt itself. The other day, the second disk went bad. Am I hosed? I've been googling for 'rebuild bad linux software raid' but all I get is the rebuild command. Also, I don't see any tools that will move bad data to another spot on the disk. This is my first time using software raid so I'm in a bit over my head. > > thanks! > > -- > Eric Martin > > _______________________________________________ > Wlug mailing list > Wlug@mail.wlug.org <mailto:Wlug@mail.wlug.org> > http://mail.wlug.org/mailman/listinfo/wlug >
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org <mailto:Wlug@mail.wlug.org> http://mail.wlug.org/mailman/listinfo/wlug
-- Eric Martin
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
"Eric" == Eric Martin <eric.joshua.martin@gmail.com> writes:
Eric> Now, the main reason for my email. My mythbox / home file Eric> server runs linux soft-raid in RAID5 with 3 500GB disks. One Eric> disk went bad a while ago and the array never rebuilt itself. So you only had two 500Gb disks left in the array? Eric> The other day, the second disk went bad. Am I hosed? Possibly, it depends on how bad the second disk is. What I would do is try to use dd_rescue to copy the 2nd bad disk onto a new disk (possibly your original bad disk if you feel brave!) and then try to re-assemble your raid 5 using that. You might or might not have corruption in the filesystem, so make sure you run an fsck on it. Now, in the future, you should run a weekly check of the array for bad blocks or other problems, so that you get notified if a disk dies silently. I use the following crontab entry: # # cron.d/mdadm -- schedules periodic redundancy checks of MD devices # # Copyright © martin f. krafft <madduck@madduck.net> # distributed under the terms of the Artistic Licence 2.0 # # By default, run at 00:57 on every Sunday, but do nothing unless the day of # the month is less than or equal to 7. Thus, only run on the first Sunday of # each month. crontab(5) sucks, unfortunately, in this regard; therefore this # hack (see #380425). 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi and I get a nice weekly report on both arrays on my main filerserver at thome. Eric> I've been googling for 'rebuild bad linux software raid' but all Eric> I get is the rebuild command. Also, I don't see any tools that Eric> will move bad data to another spot on the disk. This is my Eric> first time using software raid so I'm in a bit over my head. The first thing is to ask for help on the linux-raid mailing list, which is hosted on vger.kernel.org. But somethings you can do to help is to give us more information. Like: cat /proc/mdstat mdadm -E /dev/sd... or /dev/hd... depending on whether your SATA or IDE drives. Basically, use the devices you got from the /proc/mdstat output as your basis. Give us this output, and we should be able to help you more. John
On Mon, Dec 20, 2010 at 4:52 PM, John Stoffel <john@stoffel.org> wrote:
"Eric" == Eric Martin <eric.joshua.martin@gmail.com> writes:
Eric> Now, the main reason for my email. My mythbox / home file Eric> server runs linux soft-raid in RAID5 with 3 500GB disks. One Eric> disk went bad a while ago and the array never rebuilt itself.
So you only had two 500Gb disks left in the array?
Correct. I just received my new disk yesterday and dd_rescued the data from the bad disk onto a fresh disk. the array won't start, so here's the info you asked for
Eric> The other day, the second disk went bad. Am I hosed?
Possibly, it depends on how bad the second disk is. What I would do is try to use dd_rescue to copy the 2nd bad disk onto a new disk (possibly your original bad disk if you feel brave!) and then try to re-assemble your raid 5 using that. You might or might not have corruption in the filesystem, so make sure you run an fsck on it.
Now, in the future, you should run a weekly check of the array for bad blocks or other problems, so that you get notified if a disk dies silently. I use the following crontab entry:
BTW, this is what I was missing. there was no warning that my disk was bad! Like I said, I have backups but the last one failed so I want something tighter
Eric> I've been googling for 'rebuild bad linux software raid' but all Eric> I get is the rebuild command. Also, I don't see any tools that Eric> will move bad data to another spot on the disk. This is my Eric> first time using software raid so I'm in a bit over my head.
The first thing is to ask for help on the linux-raid mailing list, which is hosted on vger.kernel.org.
But somethings you can do to help is to give us more information. Like:
cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md2 : inactive sda3[0](S) 4000064 blocks md3 : inactive sda4[0] 483315456 blocks unused devices: <none>
mdadm -E /dev/sd...
or /dev/hd... depending on whether your SATA or IDE drives. Basically, use the devices you got from the /proc/mdstat output as your basis.
Give us this output, and we should be able to help you more.
livecd / # mdadm -E /dev/sda3 /dev/sda3: Magic : a92b4efc Version : 00.90.00 UUID : 8374ea27:6e191996:e56f6693:e45468a9 Creation Time : Sat Jul 11 17:14:31 2009 Raid Level : raid5 Used Dev Size : 4000064 (3.81 GiB 4.10 GB) Array Size : 8000128 (7.63 GiB 8.19 GB) Raid Devices : 3 Total Devices : 2 Preferred Minor : 0 Update Time : Mon Dec 13 03:59:27 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Checksum : fc28b820 - correct Events : 0.496842 Layout : left-symmetric Chunk Size : 64K Number Major Minor RaidDevice State this 0 8 3 0 active sync /dev/sda3 0 0 8 3 0 active sync /dev/sda3 1 1 8 19 1 active sync 2 2 0 0 2 faulty removed livecd / # mdadm -E /dev/sda4 /dev/sda4: Magic : a92b4efc Version : 00.90.00 UUID : c7b07c90:cbd50faf:bc824667:2504996b Creation Time : Sat Jul 11 16:52:52 2009 Raid Level : raid5 Used Dev Size : 483315456 (460.93 GiB 494.92 GB) Array Size : 966630912 (921.85 GiB 989.83 GB) Raid Devices : 3 Total Devices : 2 Preferred Minor : 3 Update Time : Thu Dec 9 11:13:25 2010 State : clean Active Devices : 1 Working Devices : 1 Failed Devices : 2 Spare Devices : 0 Checksum : d43b9ad8 - correct Events : 0.15550817 Layout : left-symmetric Chunk Size : 64K Number Major Minor RaidDevice State this 0 8 4 0 active sync /dev/sda4 0 0 8 4 0 active sync /dev/sda4 1 1 0 0 1 faulty removed 2 2 0 0 2 faulty removed livecd / # mdadm -E /dev/sdb3 mdadm: cannot open /dev/sdb3: No such file or directory livecd / # mdadm -E /dev/sdb4 mdadm: cannot open /dev/sdb4: No such file or directory livecd / # mdadm -E /dev/sdc4 /dev/sdc4: Magic : a92b4efc Version : 00.90.00 UUID : c7b07c90:cbd50faf:bc824667:2504996b Creation Time : Sat Jul 11 16:52:52 2009 Raid Level : raid5 Used Dev Size : 483315456 (460.93 GiB 494.92 GB) Array Size : 966630912 (921.85 GiB 989.83 GB) Raid Devices : 3 Total Devices : 2 Preferred Minor : 3 Update Time : Thu Dec 9 11:13:06 2010 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Checksum : d34e516a - correct Events : 0.15550815 Layout : left-symmetric Chunk Size : 64K Number Major Minor RaidDevice State this 1 8 20 1 active sync 0 0 8 4 0 active sync /dev/sda4 1 1 8 20 1 active sync 2 2 0 0 2 faulty removed livecd / # mdadm -E /dev/sdc3 /dev/sdc3: Magic : a92b4efc Version : 00.90.00 UUID : 8374ea27:6e191996:e56f6693:e45468a9 Creation Time : Sat Jul 11 17:14:31 2009 Raid Level : raid5 Used Dev Size : 4000064 (3.81 GiB 4.10 GB) Array Size : 8000128 (7.63 GiB 8.19 GB) Raid Devices : 3 Total Devices : 2 Preferred Minor : 0 Update Time : Mon Dec 13 03:59:27 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Checksum : fc28b832 - correct Events : 0.496842 Layout : left-symmetric Chunk Size : 64K Number Major Minor RaidDevice State this 1 8 19 1 active sync 0 0 8 3 0 active sync /dev/sda3 1 1 8 19 1 active sync 2 2 0 0 2 faulty removed /dev/sda is a good disk, /dev/sdc is the bad disk and /dev/sdb is the good disk that has the clone of /dev/sdc. Curiously, mdadm -E doesn't work on /dev/sdb even though the partitions are setup correctly thanks!
John
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
-- Eric Martin
On Tue, Dec 21, 2010 at 06:42:36AM -0500, Eric Martin wrote:
On Mon, Dec 20, 2010 at 4:52 PM, John Stoffel <john@stoffel.org> wrote:
> "Eric" == Eric Martin <eric.joshua.martin@gmail.com> writes:
Eric> Now, the main reason for my email. My mythbox / home file Eric> server runs linux soft-raid in RAID5 with 3 500GB disks. One Eric> disk went bad a while ago and the array never rebuilt itself.
So you only had two 500Gb disks left in the array?
Correct. I just received my new disk yesterday and dd_rescued the data from the bad disk onto a fresh disk. the array won't start, so here's the info you asked for
I know it is confusing, but ddrescue and dd_rescue are two different projects that have no relation to each other, other than the fact that they do similar things. GNU ddrescue is the more advanced of the two and I would recommend it over dd_rescue--it may be able to recover more data (unless you use the cumbersome dd_rhelp helper shell script with dd_rescue). Check this out: http://www.toad.com/gnu/sysadmin/index.html#ddrescue
BTW, this is what I was missing. there was no warning that my disk was bad! Like I said, I have backups but the last one failed so I want something tighter
I run smartd (from smartmontools) and mdmonitor (mdmon) as daemons to monitor for SMART errors and array issues. On Fedora/Red Hat like systems this is as simple as: chkconfig mdmonitor on chkconfig smartd on service mdmonitor start chkconfig smartd start (and making sure that root mail goes somewhere useful) mdmon will monitor the arrays that are specified in /etc/mdadm.conf (which incidentally is written automatically by the Fedora installer) and let you know if any component device of an array has trouble or drops out for any reason. smartd will monitor devices listed in /etc/smartd.conf (or with DEVICESCAN it will scan and monitor all available ATA/SCSI devices on the system) and let you know if you start getting uncorrectable errors, reallocated sectors, etc. which can be a nice early-warning system. I usually replace disks as soon as a single sector is reallocated or "pending". But even with these, you still should "scrub" the array--try to read every sector on all devices that make up the array--to find errors that might otherwise be missed. I believe that is what the checkarray script that John posted about does. checkarray appears to be a Debian-specific script. My Fedora system has a similar one called raid-check: /etc/cron.weekly/99-raid-check /etc/sysconfig/raid-check #!/bin/bash # # Configuration file for /etc/cron.weekly/raid-check # # options: # ENABLED - must be yes in order for the raid check to proceed # CHECK - can be either check or repair depending on the type of # operation the user desires. A check operation will scan # the drives looking for bad sectors and automatically # repairing only bad sectors. If it finds good sectors that # contain bad data (meaning that the data in a sector does # not agree with what the data from another disk indicates # the data should be, for example the parity block + the other # data blocks would cause us to think that this data block # is incorrect), then it does nothing but increments the # counter in the file /sys/block/$dev/md/mismatch_count. # This allows the sysadmin to inspect the data in the sector # and the data that would be produced by rebuilding the # sector from redundant information and pick the correct # data to keep. The repair option does the same thing, but # when it encounters a mismatch in the data, it automatically # updates the data to be consistent. However, since we really # don't know whether it's the parity or the data block that's # correct (or which data block in the case of raid1), it's # luck of the draw whether or not the user gets the right # data instead of the bad data. This option is the default # option for devices not listed in either CHECK_DEVS or # REPAIR_DEVS. # CHECK_DEVS - a space delimited list of devs that the user specifically # wants to run a check operation on. # REPAIR_DEVS - a space delimited list of devs that the user # specifically wants to run a repair on. # SKIP_DEVS - a space delimited list of devs that should be skipped # # Note: the raid-check script intentionaly runs last in the cron.weekly # sequence. This is so we can wait for all the resync operations to complete # and then check the mismatch_count on each array without unduly delaying # other weekly cron jobs. If any arrays have a non-0 mismatch_count after # the check completes, we echo a warning to stdout which will then me emailed # to the admin as long as mails from cron jobs have not been redirected to # /dev/null. We do not wait for repair operations to complete as the # md stack will correct any mismatch_cnts automatically. # # Note2: you can not use symbolic names for the raid devices, such as you # /dev/md/root. The names used in this file must match the names seen in # /proc/mdstat and in /sys/block. ENABLED=yes CHECK=check # To check devs /dev/md0 and /dev/md3, use "md0 md3" CHECK_DEVS="" REPAIR_DEVS="" SKIP_DEVS="" Finally, if you do come across bad sectors on a device, they will only usually be reallocated to a spare sector on the media when the sector is written to rather than just read from. As a last ditch "I really want to continue using this bad disk, please just reallocate the bad sectors and I'll deal with the data that is already lost" you can try to force a reallocation with hdparm --repair-sector: --repair-sector This is an alias for the --write-sector flag. VERY DANGEROUS. --write-sector Writes zeros to the specified sector number. VERY DANGEROUS. The sector number must be given (base10) after this flag. hdparm will issue a low-level write (completely bypassing the usual block layer read/write mechanisms) to the specified sec- tor. This can be used to force a drive to repair a bad sector (media error). But I would really think about replacing any disks that have developed such bad sectors, as it is only a matter of time before you run out of spare sectors to reallocate to.
I forgot to thank everybody for their help. I used dd_rescue to clone the disk and forced mdadm to activate the array and everything was good. Of course now my backup hard drive is throwing errors so it's either heat in the cabinet that's killing hds or a bad mb. I guess I'll be pulling the back off of my entertainment center... On Mon, Dec 20, 2010 at 4:52 PM, John Stoffel <john@stoffel.org> wrote:
"Eric" == Eric Martin <eric.joshua.martin@gmail.com> writes:
Eric> Now, the main reason for my email. My mythbox / home file Eric> server runs linux soft-raid in RAID5 with 3 500GB disks. One Eric> disk went bad a while ago and the array never rebuilt itself.
So you only had two 500Gb disks left in the array?
Eric> The other day, the second disk went bad. Am I hosed?
Possibly, it depends on how bad the second disk is. What I would do is try to use dd_rescue to copy the 2nd bad disk onto a new disk (possibly your original bad disk if you feel brave!) and then try to re-assemble your raid 5 using that. You might or might not have corruption in the filesystem, so make sure you run an fsck on it.
Now, in the future, you should run a weekly check of the array for bad blocks or other problems, so that you get notified if a disk dies silently. I use the following crontab entry:
# # cron.d/mdadm -- schedules periodic redundancy checks of MD devices # # Copyright © martin f. krafft <madduck@madduck.net> # distributed under the terms of the Artistic Licence 2.0 #
# By default, run at 00:57 on every Sunday, but do nothing unless the day of # the month is less than or equal to 7. Thus, only run on the first Sunday of # each month. crontab(5) sucks, unfortunately, in this regard; therefore this # hack (see #380425). 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
and I get a nice weekly report on both arrays on my main filerserver at thome.
Eric> I've been googling for 'rebuild bad linux software raid' but all Eric> I get is the rebuild command. Also, I don't see any tools that Eric> will move bad data to another spot on the disk. This is my Eric> first time using software raid so I'm in a bit over my head.
The first thing is to ask for help on the linux-raid mailing list, which is hosted on vger.kernel.org.
But somethings you can do to help is to give us more information. Like:
cat /proc/mdstat
mdadm -E /dev/sd...
or /dev/hd... depending on whether your SATA or IDE drives. Basically, use the devices you got from the /proc/mdstat output as your basis.
Give us this output, and we should be able to help you more.
John
_______________________________________________ Wlug mailing list Wlug@mail.wlug.org http://mail.wlug.org/mailman/listinfo/wlug
-- Eric Martin
participants (5)
-
Chuck Anderson
-
Eric Martin
-
James Gray
-
John Stoffel
-
soup