Well, after a few months I'm returning here to document this issue and my findings just in case someone else has the same issue.
In my first post I missed to mention that I'm running BTRFS on this drive. Since then I've received a dozen "
Checksum Mismatch" erros on random files.
And I have confirmed that
these were in fact corrupted files, and I had to restore these files from backup.
The interesting thing is that no bad sector nor any SMART were was reported. I even run and "Extended SMART Test" that took several hours and it completed without errors. As you can see below the standard SMART values are all OK:
Code:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 076 064 006 - 37021224
3 Spin_Up_Time PO---- 096 096 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 35
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 080 060 045 - 96894512
9 Power_On_Hours -O--CK 087 087 000 - 11750 (99 52 0)
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 35
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 000 - 0 0 0
189 High_Fly_Writes -O-RCK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 069 058 040 - 31 (Min/Max 25/33)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 449
193 Load_Cycle_Count -O--CK 100 100 000 - 508
194 Temperature_Celsius -O---K 031 042 000 - 31 (0 22 0 0 0)
195 Hardware_ECC_Recovered -O-RC- 076 064 000 - 37021224
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
240 Head_Flying_Hours ------ 100 253 000 - 8813h+14m+19.236s
241 Total_LBAs_Written ------ 100 253 000 - 8106501321
242 Total_LBAs_Read ------ 100 253 000 - 33554198868
(...)
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 11744 -
# 2 Short offline Completed without error 00% 11644 -
# 3 Short offline Completed without error 00% 10924 -
# 4 Short offline Completed without error 00% 10180 -
# 5 Short offline Completed without error 00% 9477 -
# 6 Short offline Completed without error 00% 8733 -
# 7 Short offline Completed without error 00% 8061 -
# 8 Short offline Completed without error 00% 7317 -
# 9 Extended offline Completed without error 00% 6824 -
#10 Short offline Completed without error 00% 6810 -
#11 Short offline Completed without error 00% 6573 -
#12 Short offline Completed without error 00% 5853 -
#13 Short offline Completed without error 00% 5109 -
#14 Short offline Completed without error 00% 4389 -
#15 Short offline Completed without error 00% 3645 -
#16 Short offline Completed without error 00% 2901 -
#17 Short offline Completed without error 00% 2181 -
#18 Short offline Completed without error 00% 1439 -
#19 Short offline Completed without error 00% 722 -
(...)
However this drive is IN FACT FAILING, as confirmed by the btrfs "checksum mismatch" erros and confirmed that the reported files were in fact corrupt. And as I mentioned in the original post the "Number of Reported Uncorrectable Errors" in the SMART Device Statistics kept increasing, now 20 (more or less consistend with the number of "checksum mismatch" errors from btrfs:
Code:
(...)
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 35 --- Lifetime Power-On Resets
0x01 0x010 4 11750 --- Power-on Hours
0x01 0x018 6 8106535977 --- Logical Sectors Written
0x01 0x020 6 24955511 --- Number of Write Commands
0x01 0x028 6 33554198868 --- Logical Sectors Read
0x01 0x030 6 38612416 --- Number of Read Commands
0x01 0x038 6 - --- Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 11749 --- Spindle Motor Power-on Hours
0x03 0x010 4 11746 --- Head Flying Hours
0x03 0x018 4 508 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 449 --- Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 20 --- Number of Reported Uncorrectable Errors <<<<<< !!!!!!!!!!!!!!!!!!!!!
0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 31 --- Current Temperature
0x05 0x010 1 31 --- Average Short Term Temperature
0x05 0x018 1 28 --- Average Long Term Temperature
0x05 0x020 1 42 --- Highest Temperature
0x05 0x028 1 0 --- Lowest Temperature
0x05 0x030 1 39 --- Highest Average Short Term Temperature
0x05 0x038 1 23 --- Lowest Average Short Term Temperature
0x05 0x040 1 36 --- Highest Average Long Term Temperature
0x05 0x048 1 27 --- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 60 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 0 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 0 --- Number of Hardware Resets
0x06 0x010 4 0 --- Number of ASR Events
0x06 0x018 4 0 --- Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
(...)
I have another drive of the exact same model, age and usage and this other drive has "0" in the "Number of Reported Uncorrectable Errors" value.
At this point I just replaced this drive and will try an RMA, since I cannot trust it anymore.
And it is really strange. I just detected this problem because of the BTRFS checksum feature. Otherwise I would just be getting corrupted files without any notification from the drive. I would just consider bit rot and I would never suspect the drive if the "Number of Reported Uncorrectable Errors" statistics would not keep increasing...