Several (most) of your questions are really Linux sysadmin and therefore out-of-scope for this board. Also it would need lots more detail, command logs, messages files etc. before I could even start to suggest a specific hypothesis - and that level of work isn't something I can offer free.
However despite that, I've given some analysis for you below.
nerdbot wrote:
I setup rsnapshot to run nightly incremental backups of my home PCs, and everything was working fine for awhile
How long is "awhile" i.e. how long was this working for, without apparent problems? I ask that question because this is a very intermittent issue - you mention below that it then worked again for months after the first "event".
nerdbot wrote:
until one day the nightly backups stopped working.
Linux messages files from that specific time, would be needed for whoever you get to investigate this to perform some review.
nerdbot wrote:
I forget now what tipped me off to the problem, but needless to say, I ran an e2fsck immediately on the /dev/md0 device.
My guess would be that whatever error messages you were getting, pointed you in the direction of a filesystem problem - otherwise why would you start by running (e2)fsck? As I said, we're missing all the detail of the history of this (application errors, messages files etc.), which prevents an efficient analysis now

nerdbot wrote:
There appeared to be massive filesystem corruption with orphaned inodes, block bitmap differences, incorrect directory counts, incorrent free inodes, etc. I wrote a script that would run e2fsck -y consecutively until it ran twice cleanly. A couple days and 20 or so e2fscks later, it finally came up clean.
Was the filesystem unmounted when you were doing this? Also, the exact messages reported by e2fsck would be needed to perform an efficient analysis now. I realise you're trying to help by giving that summary of the types of errors, but without the full output, then it will reduce the chances of anyone finding root cause for that event.
Also, although you may not have realised it, but you actually had a choice before you ran e2fsck to "fix" the filesystem - either (a) collect data first to allow possible later analysis, or (b) try to correct the filesystem, but in doing so, destroy evidence which may have been needed to get to root cause. By running e2fsck to fix the filesystem, you chose the latter.
nerdbot wrote:
At the time, I didn't think to check for bad blocks, and instead, I continued on using my drives (though with much more caution and w/ less critical information stored on it). A few months after the first incident, it happened again, with similar results. Double-digit e2fscks until the RAID array was considered clean.
Again, the output from e2fsck and Linux messages files would be needed for this event, to allow for further analysis.
nerdbot wrote:
At this point, I did do a badblocks check on each of the drives in the raid array and was surprised to see that neither drive had any bad blocks.
I'm not surprised. In my experience, unreadable blocks would not cause the
specific symptoms you describe - but the e2fsck output and Linux messages files would be needed to see if there was something hidden that you didn't mention.
nerdbot wrote:
I want to avoid these headaches with the new setup, and the original experience left me with a few questions:
As I said, most of these are Linux sysadmin questions and are better directed to a forum or support provider in that area.
nerdbot wrote:
1.) When my RAID1 array had all those filesystem issues, I realized I didn't quite understand how bad blocks would affect a RAID1 array. I was concerned that perhaps the bad block from one drive would be synced/replicated on to the other drive, possibly copying corrupt data to the "good drive" and causing the FS corruptions. After doing a bit of reading, it is now my understanding that each drive handles the remapping internally, and in most cases, it should not affect the RAID array?
This can be a complex area, but is irrelevant to this case since you explained there were no bad blocks. Yes, remapping of blocks is handled internally to modern disks, but that's not the whole story, since remapping behaviour (for read errors) depends on the level of difficulty reading a block and whether that block is later written to. It's complicated...
nerdbot wrote:
2.) When I had those filesystem issues, I ran e2fsck against the raid device (/dev/md0) but the badblocks command against the actual drives in the array (/dev/sda and /dev/sdb). I wasn't sure if this was correct, but it made sense to me at the time. Was this the correct thing to do?
In answer to your specific question, yes you chose the correct devices, but did you unmount the filesystem before running e2fsck and badblocks? Was your choice of badblocks command a read-only or destructive one? You would be more likely to get more advice from Linux sysadmin folks.
nerdbot wrote:
3.) I ran MHDD on the old 1TB Seagate drives, and while the SMART ATT results and the surface scan confirmed there were no bad blocks on either drive, both drives had about a dozen brown slow blocks. SMART ATT reported 0 for reallocated sectors and Reallocation Event Count, but seek error rate and hardware ECC recovered had numbers like '824678' when the thresholds were something like 100. I assume this isn't a good sign?
This is likely to be normal. You don't provide the hard data that you refer to, for anyone to review, hence it's impossible to be certain. However you seem to be reporting "raw" values, and several raw values encode
rate information on Seagate drives which results in large decimal values, even when the behaviour is normal. Again, you'd need to provide hard data to get a full analysis.
nerdbot wrote:
4.) After the first filesystem corruption, I changed my backup script to mount the RAID1 array right before the backup process and umount the array when the backup process completed. I also ran e2fsck every couple days via cron, just to keep an eye out for additional corruptions. Could the repeated mount/unmount and much more frequent e2fscks shorten the lifespan of a drive?
No, that level of I/O is insignificant. It would need a deeper review of what changes you made to system usage, and when you made them (i.e. a "timeline") for your support provider to see whether there are any clues in the system behaviour over time, about what the underlying problem(s) might be.
nerdbot wrote:
5.) Any additional advice to keep my drives healthy and running well? I'm concerned about the ext3 filesystem corruptions, so I'm thinking of trying ext4 this time around...
In the absence of any hardware error messages, my experience of FS corruption is that the host (or human error) is much more likely to be the cause, than the disks. There are ways in which disks could cause FS corruption, but they are very rare, and by running e2fsck to fix the reported corruption, you destroyed any evidence of the exact FS corruption that your support provider could have used for deeper analysis.
Even if your Hitachi disk software RAID1 does not appear to experience the same issues, that does not actually prove that the Seagate drives themselves were faulty - but that sort of non-intuitive conclusion is something that I teach on troubleshooting courses, and too long for here.
In summary: My gut feeling (but without any actual hard data being provided from your system), is that I would be more concerned about your PC (including HBA, HBA drivers, SATA cabling, power etc.) than your disks. Is your PC memory parity-checked or ECC or neither? Are you getting any other unusual / unexplained behaviour, other than the "apparent" FS corruptions? As I said earlier, hard evidence would need to be carefully examined to look for clues - a "problem rate" of once every few months, as you report, will make this a challenge (thoigh not necessarily impossible) for investigation by you or your support provider.
If you haven't been doing it already, IMHO you need to keep total records from your system (host messages, logs of command output when you run e2fsck, badblocks (read-only), smartctl etc.) from now onwards, to allow for further investigation when the next "problem" occurs. Running regular read-only (i.e. without fixing) e2fsck (with the FS unmounted) will allow you to narrow-down the time period in which any FS corruption occurred. However as soon as you try to fix reported corruption, without taking a dd copy of the disks first, then you destroy evidence that would be helpful in later analysis.
I'm glad I was able to answer some of your questions and I hope this helps. Good luck.