All times are UTC - 5 hours [ DST ]




Post new topic Reply to topic  [ 7 posts ] 
Author Message
 Post subject: Multiple drives exhibiting same symptoms -- hardware??
PostPosted: April 27th, 2012, 10:33 
Offline

Joined: April 27th, 2012, 9:57
Posts: 3
Location: Singapore
Hey guys, bit of a weird one:

I have a total of five Samsung HDDs as follows:

1TB - single drive config
2x1.5TB - RAID0
2x2TB - RAID0

And a Crucial M4 128GB SSD for my OS. Other potentially relevant system specs:

Asus P5Q P45 / ICH10R
Windows 7 x64
Latest Intel storage system / drivers (plus tried on a clean install using MS' AHCI drivers).

Recently my I noticed copying files to either of my stripes was intermittently slow. When copying large numbers of large files, after a while the speed would drop from the usual ~150MB/s to a fairly consistent 30MB/s, sometimes speeding up again later. So I fired up resmon to take a look. First weird thing I noticed was the activity graphs for each disk that're usually on the right side of the Disks tab in Windows 7 resmon were not there. Only the top 'Overall' disk I/O graph was shown. I figured it was a driver issue / OS screw-up and was probably related to my slow RAID I/O speed, so I reinstalled Windows 7. This fixed the resmon issue but the performance issue was still there. In resmon it manifested as a maximum queue length and >98% busy time on the affected drive(s).

Next thing I figured was that one or more of my HDDs had errors, so I began scanning them with MHDD. I used to work for a system builder where we used MHDD on an industrial scale to check for faulty drives, so I know what a failing drive looks like as well as what a healthy drive should look like. In my opinion, my results are kinda strange: all of the drives that have been in RAID are showing approximately the same range of read times; a very large number of <10ms and <50ms (far more than a new / healthy drive should have), quite a few <150ms, and a small number of <500ms (but not as many as I'd expect to see from a mechanically failing drive). If I'd seen these results for one of the drives, or even one from each RAID stripe, I could accept that I had an impending faulty drive or two on my hands and replace them. However all four RAID drives are like this, but the 1TB drive which has not been in RAID tests almost like a brand new drive: almost completely <3ms with just a very small number of <10ms and <50ms. Considering this is by far the oldest of the five drives, I find this very strange.

I have to say I think the odds of all four drives failing pretty much simultaneously are astronomical. I'm therefore wondering if something the RAID controller has done could have screwed up the drives at a block-level, causing slow access to certain areas of the drive?

I've bought an additional 2TB drive and backed up everything off the 2x1.5TB stripe. Is there anything you guys can suggest that I could try running / doing with those drives to kinda zero every sector and restore the drives to an unused state (low-level format??). Bear in mind the alternative is to replace another three HDDs (expensive) and I have an sure-fire way of gauging success (run MHDD again), I might as well try anything that might work. I've got a bootable USB with Hiren's boot CD so I have a wide range of tools at my disposal, so fire away with suggestions ;)

Cheers!


Top
 Profile  
 
 Post subject: Re: Multiple drives exhibiting same symptoms -- hardware??
PostPosted: April 27th, 2012, 20:51 
Offline

Joined: August 18th, 2010, 17:35
Posts: 3630
Location: Massachusetts, USA
What are the raw smart values for each drive? Are these consistent or similar among all "faulty" drives?

_________________
Hard Disk Drive, SSD, USB Drive and RAID Data Recovery Specialist in Massachusetts


Top
 Profile  
 
 Post subject: Re: Multiple drives exhibiting same symptoms -- hardware??
PostPosted: May 1st, 2012, 0:24 
Offline

Joined: April 27th, 2012, 9:57
Posts: 3
Location: Singapore
Thanks for the reply. Here are the SMART values for the two 1.5TB drives. All claim to be OK but some of the values don't look too healthy, particularly 5, 196 and 200 from the first one.

Code:
SMART ATTRIBUTES:
ID   Description                            Status       Value        Worst        Threshold    Raw Value    TEC                 
---------------------------------------------------------------------------------------------------------------------------------------------
  1   Raw Read Error Rate                    OK           100          100          51           0            N.A.               
  3   Spin Up Time                           OK           72           72           11           9320         N.A.               
  4   Start/Stop Count                       OK           96           96           0            3860         N.A.               
  5   Reallocated Sector Count               OK           100          100          10           4            N.A.               
  7   Seek Error Rate                        OK           253          253          51           0            N.A.               
  8   Seek Time Performance                  OK           97           97           15           16174        N.A.               
  9   Power On Time                          OK           99           99           0            5059         N.A.               
10   Spin Retry Count                       OK           100          100          51           0            N.A.               
11   Calibration Retry Count                OK           100          100          0            0            N.A.               
12   Power Cycle Count                      OK           99           99           0            1293         N.A.               
13   Soft Read Error Rate                   OK           100          100          0            0            N.A.               
183   SATA Downshift Error Count             OK           100          100          0            0            N.A.               
184   End-to-End error                       OK           100          100          0            0            N.A.               
187   Reported Uncorrectable Errors          OK           100          100          0            0            N.A.               
188   Command Timeout                        OK           100          100          0            0            N.A.               
190   Temperature Difference from 100        OK           79           62           0            353697813    N.A.               
194   Temperature                            OK           78           61           0            22 C         N.A.               
195   Hardware ECC Recovered                 OK           100          100          0            142592       N.A.               
196   Reallocation Event Count               OK           100          100          0            4            N.A.               
197   Current Pending Sector Count           OK           100          100          0            0            N.A.               
198   Uncorrectable Sector Count             OK           100          100          0            0            N.A.               
199   UltraDMA CRC Error Count               OK           99           99           0            7            N.A.               
200   Write Error Count                      OK           100          100          0            45           N.A.               
201   Off Track Errors                       OK           253          253          0            0            N.A.


Code:
SMART ATTRIBUTES:
ID   Description                            Status       Value        Worst        Threshold    Raw Value    TEC                 
---------------------------------------------------------------------------------------------------------------------------------------------
  1   Raw Read Error Rate                    OK           100          100          51           0            N.A.               
  3   Spin Up Time                           OK           73           73           11           8880         N.A.               
  4   Start/Stop Count                       OK           96           96           0            4143         N.A.               
  5   Reallocated Sector Count               OK           100          100          10           0            N.A.               
  7   Seek Error Rate                        OK           253          253          51           0            N.A.               
  8   Seek Time Performance                  OK           100          100          15           0            N.A.               
  9   Power On Time                          OK           99           99           0            3954         N.A.               
10   Spin Retry Count                       OK           100          100          51           0            N.A.               
11   Calibration Retry Count                OK           100          100          0            1            N.A.               
12   Power Cycle Count                      OK           99           99           0            1418         N.A.               
13   Soft Read Error Rate                   OK           100          100          0            0            N.A.               
183   SATA Downshift Error Count             OK           100          100          0            0            N.A.               
184   End-to-End error                       OK           100          100          0            0            N.A.               
187   Reported Uncorrectable Errors          OK           100          100          0            0            N.A.               
188   Command Timeout                        OK           100          100          0            0            N.A.               
190   Temperature Difference from 100        OK           80           72           0            336855060    N.A.               
194   Temperature                            OK           79           68           0            21 C         N.A.               
195   Hardware ECC Recovered                 OK           100          100          0            59720        N.A.               
196   Reallocation Event Count               OK           100          100          0            0            N.A.               
197   Current Pending Sector Count           OK           100          100          0            0            N.A.               
198   Uncorrectable Sector Count             OK           100          100          0            0            N.A.               
199   UltraDMA CRC Error Count               OK           99           99           0            7            N.A.               
200   Write Error Count                      OK           100          100          0            0            N.A.               
201   Off Track Errors                       OK           253          253          0            0            N.A.


I've not yet figured out the best way to back up the stuff on the 4TB stripe so I don't want to break the mirror to run diskcheckup. I can disable RAID and run one of the Hiren's boot CD tools from DOS, but I'll have to take a photo of my screen and type out the results - I'll get back to you on that. I figure I'll see if the two 1.5TB drives are salvageable by any means first or if I'm going to have to bite the bullet and buy at least one more drive...


Top
 Profile  
 
 Post subject: Re: Multiple drives exhibiting same symptoms -- hardware??
PostPosted: May 1st, 2012, 6:12 
Offline

Joined: December 14th, 2011, 8:24
Posts: 60
Location: Cyberspace
See if SMART attributes change after one more slowdown.

Also, I'd suspect the power supply. If you are doing all your tests on the same machine, what is installed (full list), what is the power supply rated output, and how old is it?


Top
 Profile  
 
 Post subject: Re: Multiple drives exhibiting same symptoms -- hardware??
PostPosted: May 1st, 2012, 6:34 
Offline

Joined: April 27th, 2012, 9:57
Posts: 3
Location: Singapore
All testing carried out on the same machine. The PSU is a CoolerMaster SilentPro 600W. It's an 80+ Gold rated, $150US+ PSU and is less than a year old. I think if it was PSU related I'd be experiencing other issues such as random hard reboots / power-offs.

Full list of components:

Asus P5Q
Core2 Quad Q9550 @ 3.4GHz
8GB Corsair XMS2 DDR2 800MHz
AMD Radeon HD6950
5x mechanical HDDs plus Crucial SSD

Well within spec for a decent 600W PSU I'd say.

Also, since I've broken the 3TB stripe and tried copying stuff to the two 1.5TB drives individually I've noticed pretty much the same behaviour. I can copy from my new 2TB to either of them and it starts off fast (~100MB/s) then slows down to as little as 2-3MB/s. Cancel the copy and restart it and often it'll go fast again -- maybe the OS / controller is writing to a different area of the drive? So I do think there's something wrong with both of these drives; I'm just not sure it's mechanical.


Top
 Profile  
 
 Post subject: Re: Multiple drives exhibiting same symptoms -- hardware??
PostPosted: May 1st, 2012, 7:41 
Offline

Joined: December 14th, 2011, 8:24
Posts: 60
Location: Cyberspace
Cables? Several SATA cables bundled together, as in crosstalk? Or all the cables coming from the same batch with the same manufacturing defect?


Top
 Profile  
 
 Post subject: Re: Multiple drives exhibiting same symptoms -- hardware??
PostPosted: May 5th, 2012, 21:09 
Offline

Joined: May 5th, 2012, 8:29
Posts: 2
Location: Brisbane
Well, the initial fast speed could just be due to caching. Could you change the drive setting from Optimise for performance to Optimise for quick removal (or whatever it is called in Win 7) ?
Does that have any effect on the initial speed?


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC - 5 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group