Software RAID and bad blocks questions
Posted: February 18th, 2011, 18:12
Hi guys,
Please bear with me with the somewhat lengthy background info to the actual questions -
I currently have a Linux backup server with 2 Seagate 1TB drives in a software RAID1 configuration using EXT3. I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile until one day the nightly backups stopped working.
Since this was several months ago, I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device. There appeared to be massive filesystem corruption with orphaned inodes, block bitmap differences, incorrect directory counts, incorrent free inodes, etc. I wrote a script that would run e2fsck -y consecutively until it ran twice cleanly. A couple days and 20 or so e2fscks later, it finally came up clean.
At the time, I didn't think to check for bad blocks, and instead, I continued on using my drives (though with much more caution and w/ less critical information stored on it). A few months after the first incident, it happened again, with similar results. Double-digit e2fscks until the RAID array was considered clean. At this point, I did do a badblocks check on each of the drives in the raid array and was surprised to see that neither drive had any bad blocks. Regardless of this result, I decided to buy 2 new drives because I just didn't trust my 1TB Seagate drives any more. I hadn't discovered MHDD yet.
I recently bought 2 new Hitachi 5K3000 2TB drives to run in a RAID1 setup in my Linux backup server, and also learned about MHDD around the same time. The first thing I did with these drives (before setting them up in any way) was run MHDD on them. Both came up w/ reasonable acceptable numbers (everything fell under <150ms).
I want to avoid these headaches with the new setup, and the original experience left me with a few questions:
1.) When my RAID1 array had all those filesystem issues, I realized I didn't quite understand how bad blocks would affect a RAID1 array. I was concerned that perhaps the bad block from one drive would be synced/replicated on to the other drive, possibly copying corrupt data to the "good drive" and causing the FS corruptions. After doing a bit of reading, it is now my understanding that each drive handles the remapping internally, and in most cases, it should not affect the RAID array?
2.) When I had those filesystem issues, I ran e2fsck against the raid device (/dev/md0) but the badblocks command against the actual drives in the array (/dev/sda and /dev/sdb). I wasn't sure if this was correct, but it made sense to me at the time. Was this the correct thing to do?
3.) I ran MHDD on the old 1TB Seagate drives, and while the SMART ATT results and the surface scan confirmed there were no bad blocks on either drive, both drives had about a dozen brown slow blocks. SMART ATT reported 0 for reallocated sectors and Reallocation Event Count, but seek error rate and hardware ECC recovered had numbers like '824678' when the thresholds were something like 100. I assume this isn't a good sign?
4.) After the first filesystem corruption, I changed my backup script to mount the RAID1 array right before the backup process and umount the array when the backup process completed. I also ran e2fsck every couple days via cron, just to keep an eye out for additional corruptions. Could the repeated mount/unmount and much more frequent e2fscks shorten the lifespan of a drive?
5.) Any additional advice to keep my drives healthy and running well? I'm concerned about the ext3 filesystem corruptions, so I'm thinking of trying ext4 this time around...
Thanks!
Please bear with me with the somewhat lengthy background info to the actual questions -
I currently have a Linux backup server with 2 Seagate 1TB drives in a software RAID1 configuration using EXT3. I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile until one day the nightly backups stopped working.
Since this was several months ago, I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device. There appeared to be massive filesystem corruption with orphaned inodes, block bitmap differences, incorrect directory counts, incorrent free inodes, etc. I wrote a script that would run e2fsck -y consecutively until it ran twice cleanly. A couple days and 20 or so e2fscks later, it finally came up clean.
At the time, I didn't think to check for bad blocks, and instead, I continued on using my drives (though with much more caution and w/ less critical information stored on it). A few months after the first incident, it happened again, with similar results. Double-digit e2fscks until the RAID array was considered clean. At this point, I did do a badblocks check on each of the drives in the raid array and was surprised to see that neither drive had any bad blocks. Regardless of this result, I decided to buy 2 new drives because I just didn't trust my 1TB Seagate drives any more. I hadn't discovered MHDD yet.
I recently bought 2 new Hitachi 5K3000 2TB drives to run in a RAID1 setup in my Linux backup server, and also learned about MHDD around the same time. The first thing I did with these drives (before setting them up in any way) was run MHDD on them. Both came up w/ reasonable acceptable numbers (everything fell under <150ms).
I want to avoid these headaches with the new setup, and the original experience left me with a few questions:
1.) When my RAID1 array had all those filesystem issues, I realized I didn't quite understand how bad blocks would affect a RAID1 array. I was concerned that perhaps the bad block from one drive would be synced/replicated on to the other drive, possibly copying corrupt data to the "good drive" and causing the FS corruptions. After doing a bit of reading, it is now my understanding that each drive handles the remapping internally, and in most cases, it should not affect the RAID array?
2.) When I had those filesystem issues, I ran e2fsck against the raid device (/dev/md0) but the badblocks command against the actual drives in the array (/dev/sda and /dev/sdb). I wasn't sure if this was correct, but it made sense to me at the time. Was this the correct thing to do?
3.) I ran MHDD on the old 1TB Seagate drives, and while the SMART ATT results and the surface scan confirmed there were no bad blocks on either drive, both drives had about a dozen brown slow blocks. SMART ATT reported 0 for reallocated sectors and Reallocation Event Count, but seek error rate and hardware ECC recovered had numbers like '824678' when the thresholds were something like 100. I assume this isn't a good sign?
4.) After the first filesystem corruption, I changed my backup script to mount the RAID1 array right before the backup process and umount the array when the backup process completed. I also ran e2fsck every couple days via cron, just to keep an eye out for additional corruptions. Could the repeated mount/unmount and much more frequent e2fscks shorten the lifespan of a drive?
5.) Any additional advice to keep my drives healthy and running well? I'm concerned about the ext3 filesystem corruptions, so I'm thinking of trying ext4 this time around...
Thanks!