I made a recovery of a 4TB HDD, using both filesystem based recovery, with R-Studio, and so-called “raw” recovery, with R-Studio and Photorec (specifically for MPG files). For most file types, it's relatively easy to sort out the mess afterward : the files found by way of “raw” recovery are either exact duplicates of files extracted based on filesystem information, or partial duplicates (if the file was fragmented, or if the carving method adds or removes a small chunk of data at the end), or don't have any matching counterpart (if the corresponding file was no longer referenced as a file record – those are the ones I want to isolate). A duplicate remover like DoubleKiller can detect, for instance, partial duplicates based on the checksum of the first 100KB (for instance).
It gets much more complicated for files like MPG / VOB / MTS, which have numerous headers at regular intervals, and can therefore be extracted as small fragments, depending on which of these intermediate headers are considered as the beginning of a new file by the carving algorithm. (And the version of R-Studio used then did not indicate if any given “extra found file” was totally included in a file from “Root”, I don't know if it has changed recently, I've read about such a feature somewhere.)
So it looks like this :
https://www.cjoint.com/c/JIgfwgSZIyAAll files in orange and red on the right side are fragments of MPG and MTS files, and most of them must be a part of files located somewhere on the left side, within ISO DVD images, DVD folders with VOB files, MPG files, MTS / M2TS files (and also some files with no common extension). I'm trying to find a method to match them all, in a reliable and as unattended way as possible. There are about 42000 files to analyse.
What I tried so far :
–
extracting a short sequence of bytes from each file to a text file ;
– using that text file as a list of search terms to run a “Simultaneous search” with WinHex (in “logical” mode, which scans on a file by file basis within the open volume, deals with fragmented / compressed files, and reports the offset of search hits relative to the beginning of the file, in addition to the absolute offset, and it also allows to exclude irrelevant files, in this case a lot of videos of other types like MKV / MP4, which reduced the size of data to scan to about 1.7TB).
Caveats :
– the sequence must be specific enough, at first I thought it would be fine to extract 16 bytes at offset 1024, for instance, but many files have empty or redundant data at that spot, and others at other spots, it doesn't seem possible to find a “sweet” spot which is specific enough for all files ; extracting 64 bytes at offset 0 seems to work well for MPG files (few false positives), but not for MTS files (not specific enough) ; a much longer sequence would be impractical ;
– running the search as ASCII text is unreliable, because of null bytes and line breaks and whatnot, and
apparently there's not a strict “1-to-1 mapping” between ASCII and hexadecimal ; I tried using the provided “filter” to convert the strings to ISO-8859-1 (apparently the only text encoding which can be reliably translated to hex and back, if I understand it correctly), but so far I can't get this to work (my current knowledge and experience of PowerShell is very limited, just to get a single line to give me the expected result is exhausting, I've spent a good chunk of the night staring at those mysterious snippets of code for which I can't even find a proper explanation in a 1000 pages PowerShell PDF, it's quite humbling...) ;
– WinHex can perform a search in “Grep Hex”, with bytes represented as such : \x15\x1A\x6D... ; I managed to extract strings in this format with PowerShell, but the problem is that, when WinHex lists the search hits, the search terms are translated as ASCII text, instead of being copied in the same format, and if it begins with “00” (as it does with MPG headers) it appears totally empty ; in a later step, to verify if a matched file is indeed exactly and completely included in a valid file, I want to run a checksum calculation of that file and compare it with that of the corresponding segment of the valid file (based on the relative offset provided in the search result), and only the search term could allow to connect matching items from list A (search term + name of the file fragment + size) and list B (list of search hits) ; there doesn't seem to be an option in WinHex to display the search terms in hexadecimal ; the alternative would be to run the search with ASCII search terms, but then it goes back to the issue above ;
– then there's the issue of performance : I tried to run a search with WinHex and a list of 42378 search terms (one per file), it choked big time ; then with about 5000 it worked, after 4-5 hours it got more than 20000 hits (some from the fragments themselves, some from valid files), but, as mentioned above, since the field where the search terms should be is empty, it's useless as it is (even within the list of search result there's no way to sort the results so that two files found with the same search term get grouped).
So, is there anything else I could try to get this done ? Am I trying hard to reinvent the wheel, i.e. is there an existing tool designed for that purpose ? By the way, what is the standard practice in the data recovery business, regarding filesystem based recovery vs. “raw” recovery ? Is the latter performed on a case-by-case basis, only if the filesystem is too damaged, or if the client specifically requests that the recovery be as thorough as possible ? Are there recovery software which can cleverly determine that a MPG signature found within a MPG / VOB file is normal and shouldn't be displayed in the list of “raw” / “extra” files, and which won't display a gazillion of small fragments for what used to be a single contiguous file ?
I hope that it's clear enough...