I've been playing with software raid (Raid 5) for some 2 weeks now. I've gone between ZFS setup with Freenas and now Linux software raid with mdadm. With ZFS I kept getting errors thrown at me, I wasn't sure what the deal was so in an effort to do some troubleshooting I moved to Linux software raid. No more errors but odd behaviour is still around.
here is example output of one of my drives
nas01:/# smartctl -a /dev/sdb
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is
http://smartmontools.sourceforge.net/Warning! Drive Identity Structure error: invalid SMART checksum.
=== START OF INFORMATION SECTION ===
Device Model: ST3750528AS
Serial Number: 5VP0CM59
Firmware Version: CC34
User Capacity: 750,156,374,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Thu Aug 6 01:49:00 2009 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 144) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 85044048
3 Spin_Up_Time 0x0003 096 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 22
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 063 060 030 Pre-fail Always - 2410597
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 301
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 22
183 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 097 097 000 Old_age Always - 3
190 Airflow_Temperature_Cel 0x0022 073 067 045 Old_age Always - 27 (Lifetime Min/Max 26/29)
194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 21 0 0)
195 Hardware_ECC_Recovered 0x001a 036 016 000 Old_age Always - 85044048
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 246741576188240
241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 4147851726
242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1190436351
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
----
With google I couldn't come up with anything (Except this forum) here is my hdparm output
nas01:/# hdparm -iI /dev/sdb
/dev/sdb:
Model=ST3750528AS , FwRev=CC34 , SerialNo= 5VP0CM59
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1465149168
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-4,5,6,7
* signifies the current active mode
ATA device, with non-removable media
Model Number: ST3750528AS
Serial Number: 5VP0CM59
Firmware Revision: CC34
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 1465149168
device size with M = 1024*1024: 715404 MBytes
device size with M = 1000*1000: 750156 MBytes (750 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
Power-Up In Standby feature set
SET_FEATURES required to spinup after power up
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
Write-Read-Verify feature set
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* SATA-I signaling speed (1.5Gb/s)
* SATA-II signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12] (vendor specific)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
124min for SECURITY ERASE UNIT. 124min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50014414df2
NAA : 5
IEEE OUI : c50
Unique ID : 014414df2
Checksum: correct
HOWEVER for some of my devices the hdparm Checksum will say
Security:
Master password revision code = 65502
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
142min for SECURITY ERASE UNIT. 142min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50013e8f1aa
NAA : 5
IEEE OUI : c50
Unique ID : 013e8f1aa
Checksum: incorrect (0xe0), expected 0x20
I have a total of 4 harddrives, 3 give that checksum error in smartctl and 2 show a bad checksum in hdparm. Are there any other tools I can use to check or diagnose? Are these numbers normal with the high Raw_Read_Error_Rate? Possible I have bad harddrives or could it be the raid/sata controller? (I'm not using it for raid just to connect the harddrives)
Computer info:
as01:/# cat proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) processor
stepping : 2
cpu MHz : 996.330
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 1995.14
clflush size : 32
power management:
nas01:/# free -m
total used free shared buffers cached
Mem: 789 781 8 0 2 732
-/+ buffers/cache: 46 743
Swap: 976 0 976
nas01:/# uname -a
Linux nas01 2.6.26-2-486 #1 Sun Jul 26 20:43:17 UTC 2009 i686 GNU/Linux
I'm running Debian +lenny
Thanks!