Hello all,
I've had an rather unexpected and to be honest a bit strange encounter with my file server. Apparently I've got at least 3 totally faulty and maybe even 5 faulty disks (of which 4 are ST2000DM001's). My server is running on a ZFS and the storage pool consists of two striped raidz2's (raid6). Thus, now the whole pool is offline due to another raidz2 set having three failed devices. Supermicro AOC-USAS2-L8I (IT firmware) and HP SAS expander are also being used.
I first noticed a problem after my network drives went down due to high amount of failed writes on all disks (monitoring, anyone?):
Code:
NAME STATE READ WRITE CKSUM
data UNAVAIL 8 164 0
raidz2-0 UNAVAIL 6 30 0
c0t5000C5005365ECBAd0 UNAVAIL 0 21 0
c0t5000C500507AA260d0 UNAVAIL 0 21 0
c0t5000C500507ACBE0d0 UNAVAIL 0 22 0
c0t5000C5005070B6DFd0 UNAVAIL 0 4 0
c0t5000C5005070B21Dd0 UNAVAIL 0 22 0
c0t5000C5005071BF4Fd0 UNAVAIL 0 22 0
c0t5000C5005071F13Ed0 UNAVAIL 0 22 0
c0t5000C5005071F40Ed0 UNAVAIL 0 22 0
c0t5000C50050575E4Fd0 UNAVAIL 2 22 0
c0t5000C50050780BDEd0 UNAVAIL 0 22 0
raidz2-1 UNAVAIL 10 134 0
c0t50024E92060B16AFd0 UNAVAIL 0 95 0
c0t50024E92060B16BBd0 UNAVAIL 0 97 0
c0t50024E92060B1607d0 UNAVAIL 0 95 0
c0t50024E92060B1693d0 UNAVAIL 0 98 0
c0t5000C5002A7FB551d0 UNAVAIL 0 98 0
c0t5000C5002A87CBFFd0 UNAVAIL 0 94 0
c0t5000C50053495485d0 UNAVAIL 0 96 0
c0t5000C5004007FE66d0 UNAVAIL 0 96 0
c0t5000C5005CDA7FA7d0 UNAVAIL 0 15 0
c0t5000C5005CDA7D59d0 UNAVAIL 2 96 1
This obviously caused me to suspect that either my SAS expander or hard disk controller were faulty. Also, apparently the disks were not completely broken at this point as all the disks came up after I rebooted the server (but started giving errors and went offline one by one). After this I powered off the virtual machine but the disks were left powered on for a couple of weeks.
Now, after replacing hard disk controller (SAS expander is still on it's way as the Ebay replacement was nonfunctional...) the printout is as follows.
Code:
NAME STATE READ WRITE CKSUM
data UNAVAIL 0 0 0
raidz2-0 UNAVAIL 0 0 0
c0t5000C5005365ECBAd0 ONLINE 0 0 0
c0t5000C500507AA260d0 UNAVAIL 0 0 0
c0t5000C500507ACBE0d0 ONLINE 0 0 0
c0t5000C5005070B6DFd0 UNAVAIL 0 0 0
c0t5000C5005070B21Dd0 ONLINE 0 0 0
c0t5000C5005071BF4Fd0 UNAVAIL 0 0 0
c0t5000C5005071F13Ed0 ONLINE 0 0 0
c0t5000C5005071F40Ed0 ONLINE 0 0 0
c0t5000C50050575E4Fd0 ONLINE 0 0 0
c0t5000C50050780BDEd0 ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
c0t50024E92060B16AFd0 ONLINE 0 0 0
c0t50024E92060B16BBd0 ONLINE 0 0 0
c0t50024E92060B1607d0 ONLINE 0 0 0
c0t50024E92060B1693d0 UNAVAIL 0 0 0
c0t5000C5002A7FB551d0 ONLINE 0 0 0
c0t5000C5002A87CBFFd0 ONLINE 0 0 0
c0t5000C50053495485d0 ONLINE 0 0 0
c0t5000C5004007FE66d0 ONLINE 0 0 0
c0t5000C5005CDA7FA7d0 UNAVAIL 0 0 0
c0t5000C5005CDA7D59d0 ONLINE 0 0 0
The following disks are 'dead'. They spin up normally and do not make any 'additional' noises (from what I can understand) but are not detected by my SATA<->USB adapter.
c0t5000C5005CDA7FA7d0
c0t5000C500507AA260d0
c0t5000C5005070B6DFd0
These disks are detected and SMART shows no faults. I wonder what is the reason for Solaris deciding to offline these? Also, these two disks are on the same SAS expander port. Coincidence?
c0t50024E92060B1693d0
c0t5000C5005071BF4Fd0
Also, another strange thing to note. When doing
Code:
sudo dd if=/dev/dsk/c0t5000C5004007FE66d0 of=/dev/null bs=10240
the LED for either c0t5000C500507AA260d0 or c0t5000C5005070B6DFd0 was lit (another disk once again on the same SAS expander port). Could there be something strange going on with my SAS expander? Can a faulty expander brick drives? Can the Norco 4224 backplanes be at fault?
I've been trying to search the forums and the Google in general but the information I've been able to find is rather limited. Do we have any experts here who could share their insight regarding the matter? How probable would it be for the profession data recovery companies to restore one disk fully or two disks partially so the pool can be brought back up and thus, allowing my to backup the data I need? I would think it would be highly probable as there appears to be no physical damage apart from (maybe) slightly excessive heat?
Also, are there any cheap premade serial adapters which would allow me to access the disks via terminal to monitor what's happening?
My plan of action is
1) When I've received a new SAS expander I'll test the faulty drive slots with my old 500GB drives and see if everything is fine (trying to isolate the faulty backplanes).
2) If everything seems to be fine I'll try to bring two offlined but according to SMART, healthy drives back up in Solaris and see if I can start resilvering/replacing dead drives with new ones.
3) If I cannot bring the pool up, I'll be willing to invest in professional data recovery services (assuming the price remains under ~2k). Maybe I'll try to take a look with console access before proceeding with professional services as I understand just connecting the console and monitoring the output would reveal quite a bit?
Thanks, and remember guys: RAID is not a backup! (most of my irreplaceable data was backed up, *phew*). Also, I'm willing to give a small tip via bitcoins for any worthy replies
