February 18th, 2011, 18:12
February 18th, 2011, 22:17
nerdbot wrote:I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile
nerdbot wrote:until one day the nightly backups stopped working.
nerdbot wrote:I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device.
nerdbot wrote:There appeared to be massive filesystem corruption with orphaned inodes, block bitmap differences, incorrect directory counts, incorrent free inodes, etc. I wrote a script that would run e2fsck -y consecutively until it ran twice cleanly. A couple days and 20 or so e2fscks later, it finally came up clean.
nerdbot wrote:At the time, I didn't think to check for bad blocks, and instead, I continued on using my drives (though with much more caution and w/ less critical information stored on it). A few months after the first incident, it happened again, with similar results. Double-digit e2fscks until the RAID array was considered clean.
nerdbot wrote:At this point, I did do a badblocks check on each of the drives in the raid array and was surprised to see that neither drive had any bad blocks.
nerdbot wrote:I want to avoid these headaches with the new setup, and the original experience left me with a few questions:
nerdbot wrote:1.) When my RAID1 array had all those filesystem issues, I realized I didn't quite understand how bad blocks would affect a RAID1 array. I was concerned that perhaps the bad block from one drive would be synced/replicated on to the other drive, possibly copying corrupt data to the "good drive" and causing the FS corruptions. After doing a bit of reading, it is now my understanding that each drive handles the remapping internally, and in most cases, it should not affect the RAID array?
nerdbot wrote:2.) When I had those filesystem issues, I ran e2fsck against the raid device (/dev/md0) but the badblocks command against the actual drives in the array (/dev/sda and /dev/sdb). I wasn't sure if this was correct, but it made sense to me at the time. Was this the correct thing to do?
nerdbot wrote:3.) I ran MHDD on the old 1TB Seagate drives, and while the SMART ATT results and the surface scan confirmed there were no bad blocks on either drive, both drives had about a dozen brown slow blocks. SMART ATT reported 0 for reallocated sectors and Reallocation Event Count, but seek error rate and hardware ECC recovered had numbers like '824678' when the thresholds were something like 100. I assume this isn't a good sign?
nerdbot wrote:4.) After the first filesystem corruption, I changed my backup script to mount the RAID1 array right before the backup process and umount the array when the backup process completed. I also ran e2fsck every couple days via cron, just to keep an eye out for additional corruptions. Could the repeated mount/unmount and much more frequent e2fscks shorten the lifespan of a drive?
nerdbot wrote:5.) Any additional advice to keep my drives healthy and running well? I'm concerned about the ext3 filesystem corruptions, so I'm thinking of trying ext4 this time around...
February 19th, 2011, 0:33
Vulcan wrote:Several (most) of your questions are really Linux sysadmin and therefore out-of-scope for this board.
Vulcan wrote:Also it would need lots more detail, command logs, messages files etc. before I could even start to suggest a specific hypothesis - and that level of work isn't something I can offer free.
Vulcan wrote:nerdbot wrote:I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile
How long is "awhile" i.e. how long was this working for, without apparent problems? I ask that question because this is a very intermittent issue - you mention below that it then worked again for months after the first "event".
Vulcan wrote:nerdbot wrote:I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device.
My guess would be that whatever error messages you were getting, pointed you in the direction of a filesystem problem - otherwise why would you start by running (e2)fsck? As I said, we're missing all the detail of the history of this (application errors, messages files etc.), which prevents an efficient analysis now![]()
[user@server ~]$ ls -al some_file
?????????? ? ??????? ??????? 891 Apr 21 2007 some_file
Vulcan wrote:Was the filesystem unmounted when you were doing this?
Vulcan wrote:This can be a complex area, but is irrelevant to this case since you explained there were no bad blocks. Yes, remapping of blocks is handled internally to modern disks, but that's not the whole story, since remapping behaviour (for read errors) depends on the level of difficulty reading a block and whether that block is later written to. It's complicated...
Vulcan wrote:In answer to your specific question, yes you chose the correct devices, but did you unmount the filesystem before running e2fsck and badblocks? Was your choice of badblocks command a read-only or destructive one? You would be more likely to get more advice from Linux sysadmin folks..
Vulcan wrote:This is likely to be normal. You don't provide the hard data that you refer to, for anyone to review, hence it's impossible to be certain. However you seem to be reporting "raw" values, and several raw values encode rate information on Seagate drives which results in large decimal values, even when the behaviour is normal. Again, you'd need to provide hard data to get a full analysis.
Vulcan wrote:nerdbot wrote:4.) After the first filesystem corruption, I changed my backup script to mount the RAID1 array right before the backup process and umount the array when the backup process completed. I also ran e2fsck every couple days via cron, just to keep an eye out for additional corruptions. Could the repeated mount/unmount and much more frequent e2fscks shorten the lifespan of a drive?
No, that level of I/O is insignificant.
Vulcan wrote:In the absence of any hardware error messages, my experience of FS corruption is that the host (or human error) is much more likely to be the cause, than the disks. There are ways in which disks could cause FS corruption, but they are very rare, and by running e2fsck to fix the reported corruption, you destroyed any evidence of the exact FS corruption that your support provider could have used for deeper analysis.
Vulcan wrote:In summary: My gut feeling (but without any actual hard data being provided from your system), is that I would be more concerned about your PC (including HBA, HBA drivers, SATA cabling, power etc.) than your disks. Is your PC memory parity-checked or ECC or neither?
Vulcan wrote:If you haven't been doing it already, IMHO you need to keep total records from your system (host messages, logs of command output when you run e2fsck, badblocks (read-only), smartctl etc.) from now onwards, to allow for further investigation when the next "problem" occurs.
Vulcan wrote:I'm glad I was able to answer some of your questions and I hope this helps. Good luck.
February 19th, 2011, 11:13
nerdbot wrote:Hi Vulcan, thank you so much for your responses, I greatly appreciate it.
nerdbot wrote:Vulcan wrote:nerdbot wrote:I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile
How long is "awhile" i.e. how long was this working for, without apparent problems? I ask that question because this is a very intermittent issue - you mention below that it then worked again for months after the first "event".
About 6 months.
nerdbot wrote:My original, uneducated fear was that a bad block in a RAID array was like a "virus", and would spread bad blocks to other devices just by the nature of existing on one device and then synced across by the RAID mechanism to the other devices.
nerdbot wrote:Vulcan wrote:In the absence of any hardware error messages, my experience of FS corruption is that the host (or human error) is much more likely to be the cause, than the disks. There are ways in which disks could cause FS corruption, but they are very rare, and by running e2fsck to fix the reported corruption, you destroyed any evidence of the exact FS corruption that your support provider could have used for deeper analysis.
Yeah, I know I could've done a DD of the drive to another disk to save the data, though at the time I didn't have another 2TB drive on hand.
nerdbot wrote:Vulcan wrote:In summary: My gut feeling (but without any actual hard data being provided from your system), is that I would be more concerned about your PC (including HBA, HBA drivers, SATA cabling, power etc.) than your disks. Is your PC memory parity-checked or ECC or neither?
It's actually really interesting that you mention this, and it never occurred to me until literally just now. I left out another part of the story because I thought it was irrelevant since it had nothing to do with hard drives directly. When I had the last round of problems with the 1TB seagate drives, I decided to stop using my backup server entirely until I could replace the drives (despite the fact that I didn't find any bad blocks). I turned off the server (at least I think I did) and waited until 2TB drives became cheap enough and then would rebuild my RAID1 array. Well, that was about 3-4 months ago, and I just purchased my 2TB drives about a week ago. When I turned my server back on, it wouldn't even POST. After much debugging, I determined the northbridge chipset had been fried - long story short, the motherboard LED diagnostic code indicated an error in the AGP/PCI bus, which I read was controlled by the northbridge chipset, and the cooling fan on the northbridge chipset also wasn't working anymore either. The thing probably overheated at some point.
So, I also just purchased a new motherboard/cpu/memory combo. Your comment just made me realize that perhaps it was my other hardware (that ultimately failed) that caused my problems.
February 21st, 2011, 12:53
Vulcan wrote:Interesting - shame it wasn't running OK for longer, but it is what it isThat sort of time frame is not significantly different from "a few months" (which is the time between the 1st and 2nd problems) for me to draw conclusions with confidence - and given that the underlying system hardware has been replaced, hypothesis now is probably a waste of time.
Vulcan wrote:I guess you don't want to delay having the new system being available to do backups for too long, but I'd recommend a concerted effort doing burn-in on that new hardware, since there is the potential for the new hardware to have a latent fault, as with anything new.
<snip>
That's all I've got time to think about on this right now. Hope that's some help. Good luck
February 21st, 2011, 14:30
nerdbot wrote:Vulcan wrote:Interesting - shame it wasn't running OK for longer, but it is what it isThat sort of time frame is not significantly different from "a few months" (which is the time between the 1st and 2nd problems) for me to draw conclusions with confidence - and given that the underlying system hardware has been replaced, hypothesis now is probably a waste of time.
I guess I should say "6 months running as a backup server". The original hardware was my previous desktop PC, which I had running for probably 2-3 years.
nerdbot wrote:Again, thank you so much for taking the time to respond and for your advice and suggestions, it has been really helpful!
Powered by phpBB © phpBB Group.