Tools for hard drive diagnostics, repair, and data recovery
Post a reply

Re: Silent corruption due to sofware defragmenter?

February 25th, 2015, 4:39

xviruz wrote:An update: I did a scan through both my 4TBs, and each had one corrupt file with the exact same symptom.

Perhaps the corruption isn't really random.

Instead of looking for corrupt files, perhaps we should be looking for corrupt sectors. In other words, maybe the GPT metadata are consistently corrupting the same sectors of the HDD. That would explain why only one file per HDD is corrupt. I would use a disc editor to locate these sectors. I suspect that the corruption might occur around the 2TiB point.

Re: Silent corruption due to sofware defragmenter?

February 25th, 2015, 6:38

I have a theory that might explain the corruption, but it's very speculative.

I'm thinking that the OS may test the consistency of the GPT partition structure at bootup or whenever a drive is attached to a USB port. Specifically, the OS would compare the GPT information at the beginning of the drive (sectors 1 - 33) with the backup copy at the end of the drive. If the backup GPT doesn't match the primary GPT, then the backup would be reconstructed and rewritten to the drive. This would explain why the corrupted data are always the same.

However, it's clear that the original backup GPT data are not corrupt, so ISTM that the OS may be looking for it in the wrong place. Not only that, but the OS also appears to be restoring the corrected GPT data to the wrong place. Sometimes this wrong place falls within an existing file, resulting in insidious corruption.

I'm wondering whether the OS is being affected by a 32-bit LBA limitation during this time. This would mean that only 3TB or 4TB drives would be affected whereas 2TB drives would be immune. One place to look for the bogus GPT copies would be at the 4TB - 2TiB point on the 4TB drive and the 3TB - 2TiB point on the 3TB drive.

For example, the last sector of the 4TB drive is 7814037167 (= 0x1d1c0beaf).

If we ignore the 33rd bit, this becomes 3519069871 (= 0xd1c0beaf).

Therefore I would use a disc editor (eg DMDE) to examine sector 3519069871 of the 4TB physical drive, plus the previous 32 sectors.

Similarly, a 3TB drive would have a sector range of 0 - 5860533167 (= 0x15D50A3AF).

In this case one would examine sector 1565565871 (= 0x5D50A3AF).

Re: Silent corruption due to sofware defragmenter?

February 25th, 2015, 8:53

Thanks for the information.

I can't find similar events in my logs but this one appears quite frequently:
"Reset to device, \Device\RaidPort2, was issued."

Yes, I will focus on the Marvell controller.
It would not be the first time a Marvell controller is unreliable.
See http://www.tomshardware.co.uk/answers/i ... oller.html

If anyone knows a good SATA controller card, with 4 internal ports, I'm really interested.
Performance is not a criteria as I will only connect slow HDDs on it.
My Adaptec card was actually my first choice but it slows down boot time way too much (2 min), prevents HDDs from sleeping and crashes on computer sleep.

Re: Silent corruption due to sofware defragmenter?

February 25th, 2015, 12:38

Nice, thanks for the explanation and references.

I find it interesting that the backup GPT header is being misplaced but its contents still have the right sector count. The 4TB, for example, has "MyLBA" at 0x01D1C0BEAF (= 7814037167, the last sector of the drive) and "LastUsableLBA" at 0x01D1C0BEE8 (= 7814037134 = last sector - 33). Perhaps that's just an artifact of copying the primary GPT header.

However, three things don't make sense if the OS issued a real backup header write to the wrong address: (1) the CRC in this backup header (last 4 bytes) is always 0x0, which doesn't match the primary header; (2) the remainder of the sector occupied by the backup GPT header is not zeroed out; and (3) backup partition entries have yet to corrupt a file. Presumably (3) is because the backup partition entries are not created at all, which is why the CRC in the header is always 0x0. As for (2), the regular file data always resumes right after the end of the GPT header (relative offset 0x5B). In contrast, correct primary/backup headers are zero-filled to the end of the (512 byte in my case) sector.

The 32-bit LBA corruption came to mind as well, as it seems pretty reasonable. The corrupt files on the two 4TBs were (thankfully) different, but they're located at sectors 6391200192 and 6407760664 respectively, close to the end of the drive. I have not seen any corruption in the file (different on each drive) occupying sectors 3519069839-3519069871...

Re: Silent corruption due to sofware defragmenter?

February 25th, 2015, 16:27

Actually, daltonwide's corrupt header does have a non-zero CRC for the partition entries, so I shouldn't be so confident with my "observations".

Re: Silent corruption due to sofware defragmenter?

February 25th, 2015, 16:53

Maybe the CRC calculation is different between Windows 7 and Windows 8.1?

I would be surprised that the bug comes from the OS. I still suspect the controller/driver layer.
Maybe we see partition data just because the controller copies data from the last sectors (or an unerased buffer) by mistake, instead of reading the correct source?
Or does the specific data size of 16476 mean it can only come from an OS instruction?

Re: Silent corruption due to sofware defragmenter?

February 25th, 2015, 17:01

daltonwide wrote:I'm quite sure the files were correct when I copied them on the drives because I verified the copies and I did the backups from the now-corrupt files.
And I did not touch the files at all! No explicit defrag or whatever, just read only operations!
This happened on several files on different drives.

ISTM that the file corruption occurs in situ rather than during file transfers. This would suggest that particular sectors are being targeted rather than the files themselves.

xviruz wrote:The corrupt files on the two 4TBs were (thankfully) different, but they're located at sectors 6391200192 and 6407760664 respectively, close to the end of the drive.

There doesn't seem to be anything significant about those sector numbers. Moreover, they appear to be random. :?

xviruz wrote:(2) the remainder of the sector occupied by the backup GPT header is not zeroed out.

As for (2), the regular file data always resumes right after the end of the GPT header (relative offset 0x5B).

This would suggest that the relevant sectors are first read, then amended, and then written back to their original locations. The trailing bytes in the "EFI PART" sector could be "don't care" bytes. If these trailing bytes really do belong to the corrupted file instead of being introduced from somewhere else, then that would confirm that the affected sector is being edited rather than simply overwritten. Moreover, if the corrupt data were originating from a host side buffer, then we would expect to see junk data in the trailing bytes rather than the original file data.

Re: Silent corruption due to sofware defragmenter?

February 26th, 2015, 3:18

daltonwide wrote:Maybe the CRC calculation is different between Windows 7 and Windows 8.1?

I don't think so... They're both a CRC of the partition entries array (the ~16K before or after the header). Plus it seems like a pretty bad idea to change GPT header formats for no good reason between OS iterations.

daltonwide wrote:Or does the specific data size of 16476 mean it can only come from an OS instruction?

It's not so much the data size, but rather that this is the disk/partition metadata (GPT header + the location of the partitions) and therefore something that user-level programs will almost never access (see below).

As fzabkar mentioned, it seems like this corruption should only happen when connecting the drive or booting up the computer. The reason is because when a physical drive (let's assume non-boot drive) is connected or powered on, the first thing Windows does is read the first three sectors of the drive: the first sector is the MBR (for backwards "compatibility"), followed by the GPT header (which indicates that this drive uses GUID partitioning + sector number of the first partition entry), followed by the third sector which contains the first few partition entries. These partition entries indicate the type of the partition as well as the starting and ending sector numbers. Using this, Windows can discover where the "basic data partition" is located. The actual file system (NTFS metadata and all your files) begin at the starting sector of that partition (e.g., if data partition starts at sector 0x40800, the NTFS metadata will begin there, followed by your files). When Windows reads this, it will start locating all your files, folders and so on---this is when the drive appears and is accessible.

So basically, the only time Windows is touching or reading the GPT headers is during this initial phase. Since both our errors have GPT headers with the correct sector info and valid partition entries, this seems like a logical place for the corruption to happen---we must have first read this GPT/partition-entry data from the drive first. This is also why I believe most user-level programs will never read, let alone modify, the GPT header: programs interact with the file system, which is confined to a single partition (i.e., between sectors 0x40800 to some ending sector, say, 0x1D10CB7FF). But that's all assuming that the partition sector numbers are kept in memory: if for some reason the GPT header is rescanned, there's another opportunity for corruption.

daltonwide wrote:Maybe we see partition data just because the controller copies data from the last sectors (or an unerased buffer) by mistake, instead of reading the correct source?

This might be possible: the controller has the partition entries and GPT header cached (for whatever reason) and is told to do a read... but instead selectively dumps its cache somewhere. It depends on how large such a cache is though---possible if it's several MB, unlikely if it's only a few KB. Also, it's not zeroing out the rest of the corrupt sector, so the controller may be doing a read followed by a write, which suggests something more complex.

Additionally, I made a mistake saying my "partition entries are not written": they're right before the header in the same file---that's what the "Microsoft reserved partition" and "Basic data partitions" are. My affected drives have exactly those two partitions. In your case, only having "Basic data partition" indicates that that physical drive doesn't have a reserved partition. Having two entries would mean the physical drive has two partitions. So this means our symptoms are actually slightly different: you have a proper CRC for the partition entries (haven't checked if it's actually correct though), whereas the CRC in my header is flat out wrong: the entries exist (non-zero) but the CRC is garbage (all 0s).

fzabkar wrote:This would suggest that the relevant sectors are first read, then amended, and then written back to their original locations. The trailing bytes in the "EFI PART" sector could be "don't care" bytes. If these trailing bytes really do belong to the corrupted file instead of being introduced from somewhere else, then that would confirm that the affected sector is being edited rather than simply overwritten. Moreover, if the corrupt data were originating from a host side buffer, then we would expect to see junk data in the trailing bytes rather than the original file data.

The sources you linked say it must be zeroed. Interesting point you raise though: are writes to drives always issued in 512 byte (sector sized) chunks? If so, then it is being read in, corrupted and not zeroed out, and then written back.

Guess it's difficult to fully diagnose it without repeatability and a closer inspection of the full stack.

Re: Silent corruption due to sofware defragmenter?

February 26th, 2015, 13:23

Ok, I've finished recovering all my data and I double checked every file is currently clean.
I now try to reproduce a corruption.
I will start with my motherboad controller, then I will test the Marvell card.

If I want to cover all possible scenarios, I should:
- boot/shutdown the PC or plug/unplug the drive
- copy files to the drive
- defragment the drive

The two first ones are easy but do you know a simple (and reliable) solution to fragment a drive (so that the Windows defragmenter has some work to do)?!

Anything else I should test?

Re: Silent corruption due to sofware defragmenter?

February 26th, 2015, 17:01

daltonwide wrote:The two first ones are easy but do you know a simple (and reliable) solution to fragment a drive (so that the Windows defragmenter has some work to do)?!

Probably deleting some (large) old files and then copying in new ones to the drive should be enough to trigger fragmentation: deleting old files leaves "holes" that get filled by parts of new files. NTFS might be smart enough to prevent some types of fragmentation, so no guarantees.

daltonwide wrote:Anything else I should test?

Nothing really comes to mind...

It'd be useful to find out when Windows (or BIOS/UEFI) accesses the GPT header and partition entries and how it caches that information. Then you could craft tests to trigger those scenarios, under the assumption that they would be more likely to cause corruption, and repeat it to see if there's any patterns (same sector? same file? etc.). For example, if you set a physical drive to "offline" and then back "online" again (in the disk manager thing), does that trigger a GPT header read? If so, you could do that instead of rebooting or physically disconnecting on every test.

Re: Silent corruption due to sofware defragmenter?

March 19th, 2015, 7:45

Since my last post, I did many operations on all my drives connected to the motherboard SATA controller. Absolutely no corruption occurred.
Yesterday, I did a backup to the drives connected to the Marvell controller. I had two corruptions: a silent one with partition data and a controller crash that let the end of a file filled with zeros. Both files were part of my copy, no existing file was damaged.

The Marvell controller crashed when I tried to access two of its drives simultaneously (which usually works fine).
Windows started to log the Raid Reset event and both drives became unresponsive.
I already had this kind of crash when I hot-plugged/unplugged drives but this time it was just normal access.

So, my current conclusion is that the Marvell controller is highly unreliable and is the culprit for the silent corruptions.
The drives connected to it are in my hot swap bays. So, this can also be a factor but I don't think so because I use these bays for a long time and never had corruptions before.
I'm still using the default Windows driver for the Marvell card and maybe the problem would diseappear with the OEM driver. But for me this would not be acceptable.

I'm going to buy a card based on a Silicon Image controller and do other backups to confirm my conclusion.

Re: Silent corruption due to sofware defragmenter?

April 7th, 2015, 7:01

In the end, I replaced my Marvell card with two cards based on an ASM1061 controller.
This is the only change I made and I do not have corruptions anymore.
Post a reply