@MrSpiffdifilous,
MrSpiffdifilous wrote:
Thank you very much Vulcan.
You're very welcome

MrSpiffdifilous wrote:
It looks like I'm going to have to start recording every single event going forward to try to get a better idea of what's going on here.
If the historical data isn't available, or you can't be confident in its accuracy, then good recording of the data from now would certainly be part of my plan too. The pattern of which systems are affected, compared to which are not affected (assuming that those 2 groups have similar numbers of drives being tested, but note my comment below about the "not affected" group) should help to produce a list of suspects for further investigation. If you appear to find that every test station is affected, then you're either seeing multiple causes or one common cause affecting them all...
FYI, as with any intermittent problem, just be careful of believing that a test station is
not affected, especially with a small sample size of test results. It's impossible to say with
100% confidence that any non-trivial system is "not affected" with a problem. For example (using some made-up numbers), all we can say is that test station X is
probably not affected since after 100 tests, there had been no TVS events, whereas on test station Y there had been 10 TVS events during 100 tests. My point is that we can never know whether (in this example) the 101st test on station X would have caused the first TVS event.
That new testing & recording process will take time and cost money, and perhaps lead to more damaged drives which is a worry, so obviously I was hoping to use data from previous events, since that testing (and damage) has already happened. Therefore to reduce further testing time (and damage) if you can find any test station where you're sure an attached drive's TVS has previously conducted without human error being involved (whether the PSU shutdown or kept powered-on causing the TVS to overheat), I would start investigating that test station with a DSO as a priority, right now.

My guess is that you'll (perhaps intermittently? perhaps depending on the length of time the PSU was left powered off?) see an overshoot of one or both power rails, when the PSU powering the drive is switched-on.
One piece of historical data which you might have, and which may help you to narrow down your investigation, is which TVS diode was damaged on previous drives. Was it always the 12V TVS? Always the 5V TVS? A mixture of both? For example, if it was always the 5V TVS, then that would be the DC output rail to start looking for evidence with the DSO, IMHO.
MrSpiffdifilous wrote:
I have found several drives where the TVS has actually broken one of its solder joints on the board, but that seems to be more the exception than the rule in the overall issue.
I'd be interested to see a close-up photo, if you have time to take one, but in general, TVS (and surrounding area) damage depends on the length of time that the TVS diode is conducting. If a drive's PCB doesn't include some kind of fusible link in the supply rail with a TVS (whether it's a fuse, a polyswitch, a zero Ohm resistor or similar) then when the TVS diode conducts, if the PC PSU doesn't shutdown quickly, overheating of the TVS diode & surrounding drive PCB is very likely. This is more likely to happen with higher wattage PSUs. I don't want to prejudge whether those 550W PSUs are causing your original problem or not - DSO investigation is needed to check - but such high wattage PSUs do increase the risk that they won't shutdown when a TVS conducts, depending on exactly how the TVS behaves when abused in that way. Personally I wouldn't be using that type of PSU, but other options is a whole new conversation...
I see I've overlapped with some similar comments from
SAjunky while I've been typing this

I've got to get back to my day job now - I've got some code to debug... Hope the above comments help, or at least give you some ideas for investigating the problem. Good luck!