Greetings,
Most of you have probably read Charles Sobey paper "Recovering Unrecoverable Data" (the link to it was posted several months ago in this forum).
In the paper, the author states that:
"It is not widely known, but many modern drives routinely check for thermal decay of bits in the field and rewrite the sectors in which degradation is identified"
I've contacted the author and asked the following questions about this:
Q1) What does it mean "routinely check"? Let's take for example a 80GB disk. Scanning it sector by sector looking for thermal decay would take a long time, so I suppose that it doesn't do it that way. Is the check made while reading data?
Q2) Servo data is written in the same magnetic media, so it will also suffer
thermal decay. If read/write heads rewrite degraded sectors, what about
servo data?
Because I believe that this subject would be of some interest to most of you, he kindly agreed when I asked him for permission to post his answer in this forum. I hope you enjoy it.
******************************************************
Q1: Hard disk drives check themselves for any degradation -- regardless of the cause -- not exclusively thermal decay. The check involves reading an entire sector and checking the ECC (Error Correction Coding).
An ECC "check" has two parts. A "simple" calculation to determine if errors
are DETECTED. This involves calculating a "syndrome." If the syndrome is not 0, then a not-so-simple calculation must be preformed to determine where the error is located and how to CORRECT it. Any source of read error will result in a non-zero syndrome.
If the syndrome is non-zero, but the data is still correctable (because of
the very powerful ECC used by hard disk drives), the corrected data can be re-written to the sector. This re-starts the thermal decay "clock." It is
good practice to verify the rewrite by checking its ECC. If the sector
repeatedly has errors, the drive's defect management routines will map out the bad sector and logically replace it with a spare sector.
How often does the drive do this? I must use qualifying words like
"routinely" and "in general" and "typically" because drive manufacturers do not publicly define exactly what they are doing inside their drives. A large customer of a drive company (e.g., Dell) can probably get whatever
information they want, but it will only be under strict non-disclosure
agreements. Furthermore, every drive manufacturer does things a little
differently. Differences are found even in different drives from the same
manufacturer.
This check can happen on-the-fly, but there are also modes in which the
drive reads every sector (typically after an idle time of > 10 minutes) and
re-writes or re-allocates as needed. This is what I referred to in the
paper. At the Western Digital website (
http://www.wdc.com), I believe that there is information about their "data lifeguard" feature that may discuss some of this behavior.
Sure the check takes a long time, but if your drive is idle (and it is not
in a mobile device) there is little to lose and much to gain by making the
check. In an always-accessing server application, this is not a good option
and other system-level data integrity steps can be taken (e.g., RAID and
mirroring).
It is also possible to check the bits themselves without using the ECC, or
even in conjunction with the ECC, to gauge decay BEFORE an ECC detectable error occurs. I do not know if any company uses such methods in the field, although they are used in manufacturing as a test.
The bottom line to your question is, "routinely check" means whatever the
drive manufacturer has decided for that model. If you are a big purchaser,
the drive company will probably tell you. If not, you may be able to monitor the behavior of a group of idle drives and see (that is hear) what happens when. In some cases, the manufacturer may provide this level of information in their extensive "product manuals." These are different than their spec sheets or data sheets.
Q2: You are very observant to note that servo presents unique challenges.
All data on the disk are subject to thermal decay, however, some bits are
more susceptible than others.
For example, in addition to temperature causing bits to decay, magnetic
fields weaken bits also. These weakened bits are then more likely to be
affected by thermal decay. Ignoring stray external fields as an obvious
issue, magnetic fields come from other, in-drive, sources. Tightly spaced
transitions (high density bits in the "down-the-track" direction) weaken
each other (this causes a lot of other problems and is fundamentally linked
to the maximum capacity of a disk surface). Also, writing to a sector
results in a side-fringing field that can weaken the transitions on adjacent
tracks.
Servo sectors are typically (a qualifying word again) written at a much
lower density than data sectors. Plus, the servo portions of tracks are
aligned radially next to each other and are never written in the field.
Therefore, they are not subject to the side-erase effect due to
side-fringing fields during writing.
These two facts make servo information more stable than data -- in general. Furthermore, robust drive designs should be able to handle having a few servo sectors in a row (<=3) be corrupted -- but only at a few locations. To my knowledge, no drive re-writes servo information in the field. It is theoretically possible, but there are many practical issues.
One main issue is that if your lower-density, never-written-next-to, servo
bits are decaying, .... Your data bits are probably long gone!
********************************************************
Regards,
Daniel