Hi there,
I have discussed this issue extensively on ServeTheHome forums already (
here), and was pointed to your direction as a more specialized bunch.
Here goes the summary of the issue:
My WD RED WD80EFAX HDD suddenly died few weeks ago week: I shut down my Proxmox server, booted it up again and the drive started "clicking". It was clicking for a while, until it stopped and no longer does that. I did not receive any SMART warnings ahead of time, and looking back at the /var/lib/smartmontools/ attrlog, I don’t think there was anything to worry about there (couldn't copy'n'paste the table here, see the original post at ServeTheHome for reference if needed).
The HDD was connected through an external USB enclosure, so I first tested to make sure the problem persists using another USB enclosure and it does, unfortunately. What I was seeing in dmesg was:
Code:
[25343.421737] usb 2-3: new SuperSpeed USB device number 8 using xhci_hcd
[25343.442848] usb 2-3: New USB device found, idVendor=152d, idProduct=1561, bcdDevice= 1.04
[25343.442854] usb 2-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[25343.442857] usb 2-3: Product: SABRENT
[25343.442858] usb 2-3: Manufacturer: SABRENT
[25343.442860] usb 2-3: SerialNumber: DB98765432143
[25343.446053] scsi host1: uas
[25343.446591] scsi 1:0:0:0: Direct-Access SABRENT 0104 PQ: 0 ANSI: 6
[25343.448532] sd 1:0:0:0: Attached scsi generic sg0 type 0
[25353.377987] sd 1:0:0:0: [sda] 1953506646 4096-byte logical blocks: (8.00 TB/7.28 TiB)
[25353.378144] sd 1:0:0:0: [sda] Write Protect is off
[25353.378147] sd 1:0:0:0: [sda] Mode Sense: 53 00 00 08
[25353.378427] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[25353.378658] sd 1:0:0:0: [sda] Preferred minimum I/O size 32768 bytes
[25353.378662] sd 1:0:0:0: [sda] Optimal transfer size 268431360 bytes not a multiple of preferred minimum block size (32768 bytes)
[25384.996385] sd 1:0:0:0: [sda] tag#22 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
[25384.996393] sd 1:0:0:0: [sda] tag#22 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25385.016413] scsi host1: uas_eh_device_reset_handler start
[25385.148590] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25385.174465] scsi host1: uas_eh_device_reset_handler success
[25417.783354] scsi host1: uas_eh_device_reset_handler start
[25417.783528] sd 1:0:0:0: [sda] tag#24 uas_zap_pending 0 uas-tag 1 inflight: CMD
[25417.783535] sd 1:0:0:0: [sda] tag#24 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25417.915763] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25417.937381] scsi host1: uas_eh_device_reset_handler success
[25450.530389] scsi host1: uas_eh_device_reset_handler start
[25450.530552] sd 1:0:0:0: [sda] tag#26 uas_zap_pending 0 uas-tag 1 inflight: CMD
[25450.530556] sd 1:0:0:0: [sda] tag#26 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25450.658774] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25450.680523] scsi host1: uas_eh_device_reset_handler success
[25453.039632] sd 1:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=99s
[25453.039639] sd 1:0:0:0: [sda] tag#9 Sense Key : Aborted Command [current]
[25453.039641] sd 1:0:0:0: [sda] tag#9 Add. Sense: No additional sense information
[25453.039644] sd 1:0:0:0: [sda] tag#9 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25453.039646] I/O error, dev sda, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[25453.039650] Buffer I/O error on dev sda, logical block 0, async page read
[25483.301277] sd 1:0:0:0: [sda] tag#10 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
[25483.301299] sd 1:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 00 00 00 00 00 00 01 00
[25483.345279] scsi host1: uas_eh_device_reset_handler start
[25483.477571] usb 2-3: reset SuperSpeed USB device number 8 using xhci_hcd
[25483.499402] scsi host1: uas_eh_device_reset_handler success
While the disk appears to report the capacity (7.28 TiB), I cannot get smartct to show anything at all: when connected via USB bridge, issuing -c, i or -a ends up with smartcul hung up. The disk does, however, "ticks" rhythmically and rather quietly during when smartctl remains stuck, yet it is not the louder "clicking" sound.
Additionally, I tested with another system to connect the drive directly via SATA and now am getting even different errors, which I thought could be another evidence for the electronics failing. Correct me if I am wrong, but errors such as "failed to enable AA", " Read log 0x00 page 0x00 failed", etc. suggest there's a communication error with the disk? Or would these also appear when the disk fails to e.g. operate the heads to read from the HPA?
Code:
2023-11-28T20:36:43.823710+01:00 proxmox kernel: [ 1687.533870] ata6: link is slow to respond, please be patient (ready=0)
2023-11-28T20:36:48.059737+01:00 proxmox kernel: [ 1691.769848] ata6: COMRESET failed (errno=-16)
2023-11-28T20:36:49.027742+01:00 proxmox kernel: [ 1692.737841] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
2023-11-28T20:36:49.027770+01:00 proxmox kernel: [ 1692.739092] ata6.00: failed to read native max address (err_mask=0x100)
2023-11-28T20:36:49.027773+01:00 proxmox kernel: [ 1692.739769] ata6.00: HPA support seems broken, skipping HPA handling
2023-11-28T20:36:54.607898+01:00 proxmox kernel: [ 1698.317931] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
2023-11-28T20:36:54.607905+01:00 proxmox kernel: [ 1698.318902] ata6.00: ATA-9: WDC WD80EFAX-68LHPN0, 83.H0A83, max UDMA/133
2023-11-28T20:36:54.607906+01:00 proxmox kernel: [ 1698.319966] ata6.00: failed to enable AA (error_mask=0x1)
2023-11-28T20:36:54.611989+01:00 proxmox kernel: [ 1698.322029] ata6.00: Read log 0x00 page 0x00 failed, Emask 0x1
2023-11-28T20:36:54.611993+01:00 proxmox kernel: [ 1698.322786] ata6.00: NCQ Send/Recv Log not supported
2023-11-28T20:36:54.611994+01:00 proxmox kernel: [ 1698.323423] ata6.00: Read log 0x00 page 0x00 failed, Emask 0x40
2023-11-28T20:36:54.611994+01:00 proxmox kernel: [ 1698.324056] ata6.00: NCQ Send/Recv Log not supported
2023-11-28T20:36:54.611994+01:00 proxmox kernel: [ 1698.324707] ata6.00: Read log 0x00 page 0x00 failed, Emask 0x40
2023-11-28T20:36:54.611995+01:00 proxmox kernel: [ 1698.325353] ata6.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 32)
2023-11-28T20:36:54.611995+01:00 proxmox kernel: [ 1698.326043] ata6.00: failed to set xfermode (err_mask=0x40)
2023-11-28T20:36:54.615803+01:00 proxmox kernel: [ 1698.326712] ata6: limiting SATA link speed to 3.0 Gbps
2023-11-28T20:36:54.615824+01:00 proxmox kernel: [ 1698.327359] ata6.00: limiting speed to UDMA/133:PIO3
2023-11-28T20:37:00.063709+01:00 proxmox kernel: [ 1703.776061] ata6: SATA link down (SStatus 0 SControl 320)
2023-11-28T20:37:00.063743+01:00 proxmox kernel: [ 1703.776851] ata6.00: disable device
2023-11-28T20:37:00.803705+01:00 proxmox kernel: [ 1704.513792] ata6: SATA link down (SStatus 0 SControl 300)
Also, when connected directly like that, the system simply stops attempting at initializing the disk (ata6.00: disable device), so there's no access to SMART at all.
Lastly, I removed the PCB and did not see any immediate damage to it. I also cleaned it up a bit, including the spring joints, but that didn't do anything.
At this point I am wondering if these SATA link errors could be indicative of a PCB failure? It would be an odd one, since the disk *does* spin up and partially reports itself (the model, at least). I am asking because I am not sure if I should go through that effort of replacing the PC. The donor PCBs for this model are readily available on Aliexpress at a reasonable price, but it would take quite some work to re-solder the BIOS SMD chip.