View topic - Software RAID and bad blocks questions

Main » Forums home » Software and hardware tools

All times are UTC - 5 hours [ DST ]

Software RAID and bad blocks questions

Page 1 of 1

[ 6 posts ]

Print view

Previous topic | Next topic

Author

Message

nerdbot

Post subject: Software RAID and bad blocks questions

Posted: February 18th, 2011, 18:12

Joined: February 18th, 2011, 14:52
Posts: 3
Location: SW, USA

Hi guys,

Please bear with me with the somewhat lengthy background info to the actual questions -

I currently have a Linux backup server with 2 Seagate 1TB drives in a software RAID1 configuration using EXT3. I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile until one day the nightly backups stopped working.

Since this was several months ago, I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device. There appeared to be massive filesystem corruption with orphaned inodes, block bitmap differences, incorrect directory counts, incorrent free inodes, etc. I wrote a script that would run e2fsck -y consecutively until it ran twice cleanly. A couple days and 20 or so e2fscks later, it finally came up clean.

At the time, I didn't think to check for bad blocks, and instead, I continued on using my drives (though with much more caution and w/ less critical information stored on it). A few months after the first incident, it happened again, with similar results. Double-digit e2fscks until the RAID array was considered clean. At this point, I did do a badblocks check on each of the drives in the raid array and was surprised to see that neither drive had any bad blocks. Regardless of this result, I decided to buy 2 new drives because I just didn't trust my 1TB Seagate drives any more. I hadn't discovered MHDD yet.

I recently bought 2 new Hitachi 5K3000 2TB drives to run in a RAID1 setup in my Linux backup server, and also learned about MHDD around the same time. The first thing I did with these drives (before setting them up in any way) was run MHDD on them. Both came up w/ reasonable acceptable numbers (everything fell under <150ms).

I want to avoid these headaches with the new setup, and the original experience left me with a few questions:

1.) When my RAID1 array had all those filesystem issues, I realized I didn't quite understand how bad blocks would affect a RAID1 array. I was concerned that perhaps the bad block from one drive would be synced/replicated on to the other drive, possibly copying corrupt data to the "good drive" and causing the FS corruptions. After doing a bit of reading, it is now my understanding that each drive handles the remapping internally, and in most cases, it should not affect the RAID array?

2.) When I had those filesystem issues, I ran e2fsck against the raid device (/dev/md0) but the badblocks command against the actual drives in the array (/dev/sda and /dev/sdb). I wasn't sure if this was correct, but it made sense to me at the time. Was this the correct thing to do?

3.) I ran MHDD on the old 1TB Seagate drives, and while the SMART ATT results and the surface scan confirmed there were no bad blocks on either drive, both drives had about a dozen brown slow blocks. SMART ATT reported 0 for reallocated sectors and Reallocation Event Count, but seek error rate and hardware ECC recovered had numbers like '824678' when the thresholds were something like 100. I assume this isn't a good sign?

4.) After the first filesystem corruption, I changed my backup script to mount the RAID1 array right before the backup process and umount the array when the backup process completed. I also ran e2fsck every couple days via cron, just to keep an eye out for additional corruptions. Could the repeated mount/unmount and much more frequent e2fscks shorten the lifespan of a drive?

5.) Any additional advice to keep my drives healthy and running well? I'm concerned about the ext3 filesystem corruptions, so I'm thinking of trying ext4 this time around...

Thanks!

Top

Vulcan

Post subject: Re: Software RAID and bad blocks questions

Posted: February 18th, 2011, 22:17

Joined: May 6th, 2008, 22:53
Posts: 2138
Location: England

Several (most) of your questions are really Linux sysadmin and therefore out-of-scope for this board. Also it would need lots more detail, command logs, messages files etc. before I could even start to suggest a specific hypothesis - and that level of work isn't something I can offer free.

However despite that, I've given some analysis for you below.

nerdbot wrote:

I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile

How long is "awhile" i.e. how long was this working for, without apparent problems? I ask that question because this is a very intermittent issue - you mention below that it then worked again for months after the first "event".

nerdbot wrote:

until one day the nightly backups stopped working.

Linux messages files from that specific time, would be needed for whoever you get to investigate this to perform some review.

nerdbot wrote:

I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device.

My guess would be that whatever error messages you were getting, pointed you in the direction of a filesystem problem - otherwise why would you start by running (e2)fsck? As I said, we're missing all the detail of the history of this (application errors, messages files etc.), which prevents an efficient analysis now

nerdbot wrote:

There appeared to be massive filesystem corruption with orphaned inodes, block bitmap differences, incorrect directory counts, incorrent free inodes, etc. I wrote a script that would run e2fsck -y consecutively until it ran twice cleanly. A couple days and 20 or so e2fscks later, it finally came up clean.

Was the filesystem unmounted when you were doing this? Also, the exact messages reported by e2fsck would be needed to perform an efficient analysis now. I realise you're trying to help by giving that summary of the types of errors, but without the full output, then it will reduce the chances of anyone finding root cause for that event.

Also, although you may not have realised it, but you actually had a choice before you ran e2fsck to "fix" the filesystem - either (a) collect data first to allow possible later analysis, or (b) try to correct the filesystem, but in doing so, destroy evidence which may have been needed to get to root cause. By running e2fsck to fix the filesystem, you chose the latter.

nerdbot wrote:

At the time, I didn't think to check for bad blocks, and instead, I continued on using my drives (though with much more caution and w/ less critical information stored on it). A few months after the first incident, it happened again, with similar results. Double-digit e2fscks until the RAID array was considered clean.

Again, the output from e2fsck and Linux messages files would be needed for this event, to allow for further analysis.

nerdbot wrote:

At this point, I did do a badblocks check on each of the drives in the raid array and was surprised to see that neither drive had any bad blocks.

I'm not surprised. In my experience, unreadable blocks would not cause the specific symptoms you describe - but the e2fsck output and Linux messages files would be needed to see if there was something hidden that you didn't mention.

nerdbot wrote:

I want to avoid these headaches with the new setup, and the original experience left me with a few questions:

As I said, most of these are Linux sysadmin questions and are better directed to a forum or support provider in that area.

nerdbot wrote:

1.) When my RAID1 array had all those filesystem issues, I realized I didn't quite understand how bad blocks would affect a RAID1 array. I was concerned that perhaps the bad block from one drive would be synced/replicated on to the other drive, possibly copying corrupt data to the "good drive" and causing the FS corruptions. After doing a bit of reading, it is now my understanding that each drive handles the remapping internally, and in most cases, it should not affect the RAID array?

This can be a complex area, but is irrelevant to this case since you explained there were no bad blocks. Yes, remapping of blocks is handled internally to modern disks, but that's not the whole story, since remapping behaviour (for read errors) depends on the level of difficulty reading a block and whether that block is later written to. It's complicated...

nerdbot wrote:

2.) When I had those filesystem issues, I ran e2fsck against the raid device (/dev/md0) but the badblocks command against the actual drives in the array (/dev/sda and /dev/sdb). I wasn't sure if this was correct, but it made sense to me at the time. Was this the correct thing to do?

In answer to your specific question, yes you chose the correct devices, but did you unmount the filesystem before running e2fsck and badblocks? Was your choice of badblocks command a read-only or destructive one? You would be more likely to get more advice from Linux sysadmin folks.

nerdbot wrote:

3.) I ran MHDD on the old 1TB Seagate drives, and while the SMART ATT results and the surface scan confirmed there were no bad blocks on either drive, both drives had about a dozen brown slow blocks. SMART ATT reported 0 for reallocated sectors and Reallocation Event Count, but seek error rate and hardware ECC recovered had numbers like '824678' when the thresholds were something like 100. I assume this isn't a good sign?

This is likely to be normal. You don't provide the hard data that you refer to, for anyone to review, hence it's impossible to be certain. However you seem to be reporting "raw" values, and several raw values encode rate information on Seagate drives which results in large decimal values, even when the behaviour is normal. Again, you'd need to provide hard data to get a full analysis.

nerdbot wrote:

4.) After the first filesystem corruption, I changed my backup script to mount the RAID1 array right before the backup process and umount the array when the backup process completed. I also ran e2fsck every couple days via cron, just to keep an eye out for additional corruptions. Could the repeated mount/unmount and much more frequent e2fscks shorten the lifespan of a drive?

No, that level of I/O is insignificant. It would need a deeper review of what changes you made to system usage, and when you made them (i.e. a "timeline") for your support provider to see whether there are any clues in the system behaviour over time, about what the underlying problem(s) might be.

nerdbot wrote:

5.) Any additional advice to keep my drives healthy and running well? I'm concerned about the ext3 filesystem corruptions, so I'm thinking of trying ext4 this time around...

In the absence of any hardware error messages, my experience of FS corruption is that the host (or human error) is much more likely to be the cause, than the disks. There are ways in which disks could cause FS corruption, but they are very rare, and by running e2fsck to fix the reported corruption, you destroyed any evidence of the exact FS corruption that your support provider could have used for deeper analysis.

Even if your Hitachi disk software RAID1 does not appear to experience the same issues, that does not actually prove that the Seagate drives themselves were faulty - but that sort of non-intuitive conclusion is something that I teach on troubleshooting courses, and too long for here.

In summary: My gut feeling (but without any actual hard data being provided from your system), is that I would be more concerned about your PC (including HBA, HBA drivers, SATA cabling, power etc.) than your disks. Is your PC memory parity-checked or ECC or neither? Are you getting any other unusual / unexplained behaviour, other than the "apparent" FS corruptions? As I said earlier, hard evidence would need to be carefully examined to look for clues - a "problem rate" of once every few months, as you report, will make this a challenge (thoigh not necessarily impossible) for investigation by you or your support provider.

If you haven't been doing it already, IMHO you need to keep total records from your system (host messages, logs of command output when you run e2fsck, badblocks (read-only), smartctl etc.) from now onwards, to allow for further investigation when the next "problem" occurs. Running regular read-only (i.e. without fixing) e2fsck (with the FS unmounted) will allow you to narrow-down the time period in which any FS corruption occurred. However as soon as you try to fix reported corruption, without taking a dd copy of the disks first, then you destroy evidence that would be helpful in later analysis.

I'm glad I was able to answer some of your questions and I hope this helps. Good luck.

Top

nerdbot

Post subject: Re: Software RAID and bad blocks questions

Posted: February 19th, 2011, 0:33

Joined: February 18th, 2011, 14:52
Posts: 3
Location: SW, USA

Hi Vulcan, thank you so much for your responses, I greatly appreciate it.

Vulcan wrote:

Several (most) of your questions are really Linux sysadmin and therefore out-of-scope for this board.

I kind of suspected as much, but wasn't quite sure since it kind of crosses a bit into low-level hardware details as well.

Vulcan wrote:

Also it would need lots more detail, command logs, messages files etc. before I could even start to suggest a specific hypothesis - and that level of work isn't something I can offer free.

Completely understandable. Originally, I wasn't looking for an in-depth analysis of my exact problem and was "merely" hoping for some general answers, but after reading your response I understand now that with such a complicated situation, there are no easy answers.

Actually, I do have all the logs of the e2fsck runs as well as the other Linux log files. I made my original post while at work, so I didn't have them with me on hand to include in the post, but in retrospect it's probably a good thing I didn't. As you mentioned, this sort of analysis isn't easy nor should I expect for free. If you're curious, I can post a few blocks of errors as examples of what I saw, but by no means do I expect you to do an in-depth analysis for free.

Vulcan wrote:

nerdbot wrote:

I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile

About 6 months.

Vulcan wrote:

nerdbot wrote:

I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device.

Now that I had more time to think about it, I remember exactly what happened. When rsnapshot would go to rotate the previous days backups by renaming the "daily.0" directory to "daily.1", the move would fail and an error message about unable to move a specific file in "daily.0". When I went to look at that specific file, an "ls -al" showed something like this:

Quote:

[user@server ~]$ ls -al some_file
?????????? ? ??????? ??????? 891 Apr 21 2007 some_file

In addition to that, the filesystem was marked read-only. It has been my experience (from a sysadmin standpoint) that this indicates a FS corruption and e2fscking is the "fix".

Vulcan wrote:

Was the filesystem unmounted when you were doing this?

Yes, I ran all the e2fscks with the filesystem unmounted.

Vulcan wrote:

My original, uneducated fear was that a bad block in a RAID array was like a "virus", and would spread bad blocks to other devices just by the nature of existing on one device and then synced across by the RAID mechanism to the other devices. Glad to know this is not the case, though I wonder if that would cause other higher-level problems (like a filesystem corruption)?

Vulcan wrote:

Yup, I ran the e2fsck and badblocks with the drives unmounted. As I mentioned earlier, I ran "e2fsck -y". As I failed to mention earlier

, I ran badblocks in read-only mode. Basically, I was just trying to determine if there actually were any badblocks.

Vulcan wrote:

Ok, I'll try to get the actual values.

Vulcan wrote:

nerdbot wrote:

No, that level of I/O is insignificant.

Ah, that's really good to know. I was worried I was going overboard with the constant monitoring.

Vulcan wrote:

Yeah, I know I could've done a DD of the drive to another disk to save the data, though at the time I didn't have another 2TB drive on hand. And, because I wasn't clear on exactly how bad blocks in a RAID1 configuration affects the physical devices, I wasn't sure if one or both drives were at fault (if at all). Ultimately, at the time, I decided that since this was "backup" data, with the original data still intact on the original machines, the data wasn't completely lost yet and I focused more on fixing the problem and less on the "why" (short-sighted, I admit). Now that I'm getting ready to start again, I'm more interested in the "why", in hopes that it'll help prevent the problem this time around.

Vulcan wrote:

In summary: My gut feeling (but without any actual hard data being provided from your system), is that I would be more concerned about your PC (including HBA, HBA drivers, SATA cabling, power etc.) than your disks. Is your PC memory parity-checked or ECC or neither?

It's actually really interesting that you mention this, and it never occurred to me until literally just now. I left out another part of the story because I thought it was irrelevant since it had nothing to do with hard drives directly. When I had the last round of problems with the 1TB seagate drives, I decided to stop using my backup server entirely until I could replace the drives (despite the fact that I didn't find any bad blocks). I turned off the server (at least I think I did) and waited until 2TB drives became cheap enough and then would rebuild my RAID1 array. Well, that was about 3-4 months ago, and I just purchased my 2TB drives about a week ago. When I turned my server back on, it wouldn't even POST. After much debugging, I determined the northbridge chipset had been fried - long story short, the motherboard LED diagnostic code indicated an error in the AGP/PCI bus, which I read was controlled by the northbridge chipset, and the cooling fan on the northbridge chipset also wasn't working anymore either. The thing probably overheated at some point.

So, I also just purchased a new motherboard/cpu/memory combo. Your comment just made me realize that perhaps it was my other hardware (that ultimately failed) that caused my problems.

Vulcan wrote:

If you haven't been doing it already, IMHO you need to keep total records from your system (host messages, logs of command output when you run e2fsck, badblocks (read-only), smartctl etc.) from now onwards, to allow for further investigation when the next "problem" occurs.

Yup, like I said earlier, I do have a significant amount of all sorts of logs. However, I was only able to make sense out of the Linux logs that told me "I had a problem" and not enough knowledge about HDD hardware to understand the other logs (e2fsck, SMART information, etc) to tell me the "why". I hadn't discovered MHDD or these forums at that point.

Vulcan wrote:

I'm glad I was able to answer some of your questions and I hope this helps. Good luck.

Believe me, I am too!

Thank you again for taking the time to answer my questions.

Top

Vulcan

Post subject: Re: Software RAID and bad blocks questions

Posted: February 19th, 2011, 11:13

Joined: May 6th, 2008, 22:53
Posts: 2138
Location: England

nerdbot wrote:

Hi Vulcan, thank you so much for your responses, I greatly appreciate it.

You're very welcome and thanks for the extra info in your reply - I have a better picture of the overall situation now. Brief follow-up below...

nerdbot wrote:

Vulcan wrote:

nerdbot wrote:

I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile

About 6 months.

Interesting - shame it wasn't running OK for longer, but it is what it is

That sort of time frame is not significantly different from "a few months" (which is the time between the 1st and 2nd problems) for me to draw conclusions with confidence - and given that the underlying system hardware has been replaced, hypothesis now is probably a waste of time.

nerdbot wrote:

Understandable concern - typically the answer is no. Again, the exact details vary depending on hardware, and history (i.e. full resync after replacing a disk or not) on that system, and can be complicated - in my experience. You'd do better asking to a Linux sysadmin forum, since at the OS level, an unreadable block in a software RAID is something that has to be coped with. I don't pretend to keep up with the very latest behaviour of SVM/LVM.

However the point is that Linux can either read a block from at least one side of the mirror or it can't. If it can read from either side of the mirror then that won't cause FS corruption, because it got the data; if can't read a given LBA from both sides of the mirror, then if that LBA contains FS metadata, the FS will be marked read-only immediately and you'd see errors about the unreadable block in the messages file. That's why bad blocks on their own don't cause the specific behaviour of apparent massive FS corruption without an obvous cause, that you're reporting.

nerdbot wrote:

Vulcan wrote:

Yeah, I know I could've done a DD of the drive to another disk to save the data, though at the time I didn't have another 2TB drive on hand.

Completely understandable not to have 2TB disks lying around the place

Unfortunately it remains a fact that this level of data collection would be needed each time, for you or your support provider to be able to examine the details of the corruption, if you were trying to get to root cause efficiently.

nerdbot wrote:

Vulcan wrote:

Very interesting!

Yes, that other hardware would certainly be my suspicion, based on the type of problem you've been seeing.

Since you're replacing all the hardware, then further analysis of the original data seems less useful now, especially since further tests can't be run on that old (now dead) motherboard etc.

I guess you don't want to delay having the new system being available to do backups for too long, but I'd recommend a concerted effort doing burn-in on that new hardware, since there is the potential for the new hardware to have a latent fault, as with anything new.

The motherboard manufacturer might have their own diags, or you might have your own favourites

but whatever you choose to use, I'd suggest not putting the new hardware into use without getting some confidence in it first.

Running memtest86 on the base hardware would be good. I'd also suggest getting Linux installed and then running as many VMs as you can squeeze into physical RAM, with each VM running some sot of test (I don't use memtest86 for this, but that might work too). The point is to cause lots of significant memory address changes, which multi-tasking of memory tests in different VMs certainly does - I've seen that expose latent problems that were not reported when running memtest86 on the underlying hardware.

As with any test - success does not prove everything is OK; failure does mean cause for concern and requires further investigation, but could also be a misreport by the diagnostics.

It's good to hear that you've been keeping the logs of previous problems; don't lose those, in case they are useful for comparison in future.

Regarding the disks - I can't give a flowchart for each possible result, but this would be my basic approach:

Collect their SMART data using a recent version of smartmontools (I'd run "smartctl -a >> file1" and "smartctl -x >> file1" in order not to completely rely on the relatively untested 48bit LBA SMART support in both smartmontools and disk f/w itself). Keep a record of this and all future SMART data collection. Check whether, on your config, running smartctl causes any (benign) error messages - some HBA drivers do report messages, so check and then you know to expect those.

After gathering a baseline of the SMART data, and before using the disks for the first time, read them completely (dd to /dev/null with a reasonably big block size e.g. 1MB, although more than about 4MB, there is little benefit to bigger block sizes in terms of throughput, in my experience). Those Hitachi disks probably have background media scan built-in, but a read of the full disk is always a good idea.

Once the disks are being used, then after each backup to those disks, unmount the filesystem; run those 2 smartctl commands (above) on each disk, looking for any increase in things like reallocated sectors, current pending sectors, interface CRC errors & check temperature is not a concern e.g. < 40 deg C.

Then you can run the badblock read-only test (or just another dd - they amount to the same thing) every so often (could be each night, or each weekend, depending on available time). And also your read-only e2fsck to see if you can detect the original problem occurring again - but I bet that it won't, as you've changed so much hardware

That's all I've got time to think about on this right now. Hope that's some help. Good luck

Top

nerdbot

Post subject: Re: Software RAID and bad blocks questions

Posted: February 21st, 2011, 12:53

Joined: February 18th, 2011, 14:52
Posts: 3
Location: SW, USA

Vulcan wrote:

Interesting - shame it wasn't running OK for longer, but it is what it is

I guess I should say "6 months running as a backup server". The original hardware was my previous desktop PC, which I had running for probably 2-3 years. When I built a new desktop PC, I used the old hardware to run my backup server. Teaches me to be cheap

Vulcan wrote:

I guess you don't want to delay having the new system being available to do backups for too long, but I'd recommend a concerted effort doing burn-in on that new hardware, since there is the potential for the new hardware to have a latent fault, as with anything new.
<snip>
That's all I've got time to think about on this right now. Hope that's some help. Good luck

Wow, lots of great information in there, I'll definitely look into doing those things you mentioned. Again, thank you so much for taking the time to respond and for your advice and suggestions, it has been really helpful!

Top

Vulcan

Post subject: Re: Software RAID and bad blocks questions

Posted: February 21st, 2011, 14:30

Joined: May 6th, 2008, 22:53
Posts: 2138
Location: England

nerdbot wrote:

Vulcan wrote:

Interesting - shame it wasn't running OK for longer, but it is what it is

I guess I should say "6 months running as a backup server". The original hardware was my previous desktop PC, which I had running for probably 2-3 years.

Aha, that's interesting

While it's not possible to be conclusive (given that there have been only 2 failures, so the sample size is small), this could be interpreted as being a system that worked OK for 2-3 years (a relatively long time), then had a (new) problem, and then had a repeat of that problem within a few months (a relatively short time).

IMHO that 2nd noticeably shorter time-to-failure (compared to the 2-3 years of previously successful running), could fit with the Northbridge fan failing a few months before the first problem. Of course it's a guess, but the change from 2-3 years of successful running, to only a few months of successful running (between the two problem events) fits with that possibility.

nerdbot wrote:

Again, thank you so much for taking the time to respond and for your advice and suggestions, it has been really helpful!

No problem - it's been a pleasure to think about a well-described problem, where you've obviously put effort into thinking about things too

Good luck!

Top

Page 1 of 1

[ 6 posts ]

Main » Forums home » Software and hardware tools

All times are UTC - 5 hours [ DST ]

Who is online

Users browsing this forum: Google [Bot] and 8 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum