All times are UTC - 5 hours [ DST ]




Post new topic Reply to topic  [ 9 posts ] 
Author Message
 Post subject: Finding fragment of files (MPG-MTS) inside of valid files
PostPosted: September 6th, 2020, 2:46 
Offline

Joined: November 22nd, 2017, 21:47
Posts: 293
Location: France
I made a recovery of a 4TB HDD, using both filesystem based recovery, with R-Studio, and so-called “raw” recovery, with R-Studio and Photorec (specifically for MPG files). For most file types, it's relatively easy to sort out the mess afterward : the files found by way of “raw” recovery are either exact duplicates of files extracted based on filesystem information, or partial duplicates (if the file was fragmented, or if the carving method adds or removes a small chunk of data at the end), or don't have any matching counterpart (if the corresponding file was no longer referenced as a file record – those are the ones I want to isolate). A duplicate remover like DoubleKiller can detect, for instance, partial duplicates based on the checksum of the first 100KB (for instance).

It gets much more complicated for files like MPG / VOB / MTS, which have numerous headers at regular intervals, and can therefore be extracted as small fragments, depending on which of these intermediate headers are considered as the beginning of a new file by the carving algorithm. (And the version of R-Studio used then did not indicate if any given “extra found file” was totally included in a file from “Root”, I don't know if it has changed recently, I've read about such a feature somewhere.)
So it looks like this :
https://www.cjoint.com/c/JIgfwgSZIyA
All files in orange and red on the right side are fragments of MPG and MTS files, and most of them must be a part of files located somewhere on the left side, within ISO DVD images, DVD folders with VOB files, MPG files, MTS / M2TS files (and also some files with no common extension). I'm trying to find a method to match them all, in a reliable and as unattended way as possible. There are about 42000 files to analyse.
What I tried so far :
extracting a short sequence of bytes from each file to a text file ;
– using that text file as a list of search terms to run a “Simultaneous search” with WinHex (in “logical” mode, which scans on a file by file basis within the open volume, deals with fragmented / compressed files, and reports the offset of search hits relative to the beginning of the file, in addition to the absolute offset, and it also allows to exclude irrelevant files, in this case a lot of videos of other types like MKV / MP4, which reduced the size of data to scan to about 1.7TB).
Caveats :
– the sequence must be specific enough, at first I thought it would be fine to extract 16 bytes at offset 1024, for instance, but many files have empty or redundant data at that spot, and others at other spots, it doesn't seem possible to find a “sweet” spot which is specific enough for all files ; extracting 64 bytes at offset 0 seems to work well for MPG files (few false positives), but not for MTS files (not specific enough) ; a much longer sequence would be impractical ;
– running the search as ASCII text is unreliable, because of null bytes and line breaks and whatnot, and apparently there's not a strict “1-to-1 mapping” between ASCII and hexadecimal ; I tried using the provided “filter” to convert the strings to ISO-8859-1 (apparently the only text encoding which can be reliably translated to hex and back, if I understand it correctly), but so far I can't get this to work (my current knowledge and experience of PowerShell is very limited, just to get a single line to give me the expected result is exhausting, I've spent a good chunk of the night staring at those mysterious snippets of code for which I can't even find a proper explanation in a 1000 pages PowerShell PDF, it's quite humbling...) ;
– WinHex can perform a search in “Grep Hex”, with bytes represented as such : \x15\x1A\x6D... ; I managed to extract strings in this format with PowerShell, but the problem is that, when WinHex lists the search hits, the search terms are translated as ASCII text, instead of being copied in the same format, and if it begins with “00” (as it does with MPG headers) it appears totally empty ; in a later step, to verify if a matched file is indeed exactly and completely included in a valid file, I want to run a checksum calculation of that file and compare it with that of the corresponding segment of the valid file (based on the relative offset provided in the search result), and only the search term could allow to connect matching items from list A (search term + name of the file fragment + size) and list B (list of search hits) ; there doesn't seem to be an option in WinHex to display the search terms in hexadecimal ; the alternative would be to run the search with ASCII search terms, but then it goes back to the issue above ;
– then there's the issue of performance : I tried to run a search with WinHex and a list of 42378 search terms (one per file), it choked big time ; then with about 5000 it worked, after 4-5 hours it got more than 20000 hits (some from the fragments themselves, some from valid files), but, as mentioned above, since the field where the search terms should be is empty, it's useless as it is (even within the list of search result there's no way to sort the results so that two files found with the same search term get grouped).

So, is there anything else I could try to get this done ? Am I trying hard to reinvent the wheel, i.e. is there an existing tool designed for that purpose ? By the way, what is the standard practice in the data recovery business, regarding filesystem based recovery vs. “raw” recovery ? Is the latter performed on a case-by-case basis, only if the filesystem is too damaged, or if the client specifically requests that the recovery be as thorough as possible ? Are there recovery software which can cleverly determine that a MPG signature found within a MPG / VOB file is normal and shouldn't be displayed in the list of “raw” / “extra” files, and which won't display a gazillion of small fragments for what used to be a single contiguous file ?

I hope that it's clear enough...


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 6th, 2020, 12:13 
Offline

Joined: May 13th, 2019, 7:50
Posts: 221
Location: Nederland
Step back. What happened to the drive in the first place and what was the original file system?

_________________
www.disktuna.com - photo repair service


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 6th, 2020, 14:27 
Offline

Joined: November 22nd, 2017, 21:47
Posts: 293
Location: France
Drive was re-formatted. Was originally formatted in Ext4 (in some kind of NAS). But how would it be relevant now ? (I did this recovery some months ago, the original drive has been repurposed since then, so all I have is the recovery. Again, most of the drive's contents could be recovered from the filesystem analysis, what I want is delete as much redundant files as possible ; there's about 1TB worth of MPG / MTS fragments, most of them redundant, I want to keep only those files among that bunch which are truly “extra”, if any, and delete the rest.)

The method I'm looking for would be useful in all situations where both a filesystem based recovery and a “raw” recovery are performed on the same storage device. I've had quite a few recovery situations where I had to painstakingly sort out the actual “extra” files from exact or partial copies of files which were correctly recovered through the filesystem analysis, but in this case, because of the sheer volume of extra files, and because of that specific issue with MPG / MTS files (preventing from detecting matching files with a standard duplicate remover), doing a manual check would be totally impractical and far too time-consuming.

I did manually check some of the bigger files (those were too large anyway for the purported method to work -- Photorec extracted MPG files with a size near 2GB for what was originally a DVD folder, hence with 1GB VOB files at most, therefore the string search would have worked but then the result of the checksum comparison would have been negative), and it went like this :
Code:
f2175681376.mpg 0-1073565695 = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_1.VOB
                1073565696-1073741823 = fragment of another movie, probably deleted
                1073741824-1140850687 (end of file) = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_2.VOB 0-67108863
f2178168944.mpg 0-1433599 = fragment of another movie, probably deleted
                1433600-1007890431 = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_2.VOB 67108864-1073565695
                1007890432-1008066559 = too short, unidentified
                1008066560-2014699519 (end of file) = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_3.VOB 0-1006632959
f2182356824.mpg 0-4722687 = too short, unidentified
                4722688-71655423 = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_3.VOB 1006632960-1073565695
                71655424-71831551 = unidentified
                71831552-1145397247 = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_4.VOB 0-1073565695
                1145397248-1145573375 = unidentified
                1145573376-2017988607 (end of file) = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_5.VOB 0-872415231
f2186554552.mpg 0-2969599 = unidentified
                2969600-204120063 = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_5.VOB 872415232-1073565695
                204120064-204296191 = unidentified
                204296192-1277861887 = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_6.VOB
                1277861888-1278038015 = unidentified
                1278038016-2016235519 (end of file) = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_7.VOB 0-738197503
f2190754504.mpg 0-77823 = unidentified
                77824-335446015 = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_7.VOB 738197504-1073565695
                335446016-335622143 = unidentified
                335622144-808830975 (end of file) = DANSE_AVEC_LOUPS\VIDEO_TS\VTS_01_8.VOB

In this particular case, one file recovered by Photorec contains data belonging to several distinct original files, but the vast majority are much smaller and must be entirely included within one original file. So the idea is to identify which file it belongs to (if any), and at which offset relative to the beginning of that file the matching chunk starts, then to verify if the checksum of the whole “carved” file matches the partial chunk of data from the original file. And to do that in batch, in as few unattended steps as possible.
If for instance a 16 bytes string is extracted from all files at offset 1024, then WinHex finds the string from file f123456.mpg, which has a size of 31457280, at offset 52428800 of file WHATEVER_DVD\VIDEO_TS\VTS_01_2.VOB, then, the MD5 of f123456.mpg should match the MD5 of the chunk of data from VTS_01_2.VOB between 52428800 and 83886079. If it does match, the file can be safely deleted ; if not, a manual check could be performed later, but a majority of files could be deleted without having to manually open them.
I have parts of a working solution, but I'm struggling to get the whole thing to work reliably.


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 6th, 2020, 14:41 
Offline

Joined: May 13th, 2019, 7:50
Posts: 221
Location: Nederland
Quote:
Drive was re-formatted. Was originally formatted in Ext4 (in some kind of NAS). But how would it be relevant now ?


Because depending on the file system and on the accident one could tell if a better recovery should be achievable so you don't have to bother with raw recovery.

Quote:
I did this recovery some months ago, the original drive has been repurposed since then, so all I have is the recovery


You mean you don't have disk image?

Quote:
The method I'm looking for would be useful in all situations where both a filesystem based recovery and a “raw” recovery are performed on the same storage device.


I think this is default behavior in ReclaiMe and probably other tools as well. If file is detected using both meta data and RAW scan, file is removed from RAW scan.

In general if file system scan delivers good results, don't bother with RAW results.

Quote:
In this particular case, one file recovered by Photorec contains data belonging to several distinct original files


This is why I and some other too regard PhotoRec or carving in general last resort options. It's the main drawback of carving.

_________________
www.disktuna.com - photo repair service


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 6th, 2020, 15:45 
Offline

Joined: November 22nd, 2017, 21:47
Posts: 293
Location: France
Quote:
You mean you don't have disk image?

It was a 4TB drive, with no physical defect, so I did the recovery operations directly. (Even if I had created an image, keeping a 4TB image in addition to the nearly 4TB recovery would have been too cumbersome.)

Quote:
In general if file system scan delivers good results, don't bother with RAW results.

Depending on what happened during the re-formatting, quite a few previously existing file records could have been overwritten, so to be rather thorough than sorry I generally extract “extra found files” for common file types and then remove the clutter. But I can understand that this is not a viable use of one's time in a professional environment with many new cases each week.

Quote:
I think this is default behavior in ReclaiMe and probably other tools as well. If file is detected using both meta data and RAW scan, file is removed from RAW scan.

Normally R-Studio displays “extra found files” which were also identified by filesystem analysis as “virtual hard links” (and extracts them as actual hard links if the same file is checked in both virtual directories in the recovery tree), but with MPG / VOB / MTS files this is unreliable. In some situations (at least that was the case with R-Studio 8.7 I was using for this recovery, it may have improved since then) there can be for instance 8 identified VOB files, of slightly decreasing sizes, with all the smaller ones being actually entirely contained within the bigger one (with no indication on that matter), which is silly... and the total size of the VOB virtual directory can therefore be gigantic, far greater than the total capacity of the drive. Photorec behaves better with VOB / MPG, that's why I ran it specifically for that file type in that situation, but it also has trouble extracting those particular files, even when they are not fragmented, whereas carving of other types of video files (AVI, MKV, MP4...) is generally much more reliable (they have a single header / signature, and some have a size field in their header).

But, regardless of how useful or not what I want to achieve is, would you have some clue as to how I could achieve it as intended ?
I found this thread yesterday before posting this request, it seems like a similar kind of task.


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 7th, 2020, 7:42 
Offline

Joined: November 22nd, 2017, 21:47
Posts: 293
Location: France
(Follow-up of a short bit of discussion in this other thread.)

A compromise nonetheless might be to use that tool to quickly identify “bad” ASCII search terms within a huge list. For instance, let's say I extract 16 bytes at offset 1024 from all those 42000+ files with this script :
Code:
$offset = 1024
$length = 16
foreach ($file in gci *.mpg, *.vob, *.mts) {
    $buffer = [Byte[]]::new($length)
    $stream = [System.IO.FileStream]::new($file.FullName, 'Open', 'Read')
    $stream.Position = $offset
    $readSize = $stream.Read($buffer, 0, $length)
    $stream.Dispose()
    if ($readSize) {
        $ascii = [System.Text.Encoding]::Default.GetString($buffer)
        $name = $file.FullName
        $size = $file.Length
        Add-Content -Path "G:\HGST 4To MPG-VOB-MTS logical search -- 1024 16.txt" -Value "$ascii   $name   $size"
        }
    }
    $buffer = $null

Then with Ted Notepad (a very powerful text editor) I can remove the $name and $size fields (they should be useful later), and the idea would be to parse that list of 16 bytes strings with bgrep to detect those which contain “problem” bytes, i.e. null bytes (00), or line breaks (0D, 0A), or any character which doesn't reliably translate to and from its hexadecimal or binary equivalent, or lines with too many identical characters. Then somehow exclude those files, move the ones for which the search string looks good, then run the string extraction script again with $offset = 2048, and so on, until there's a good ASCII search string for all files.
Are there other problematic bytes ? And how may I proceed to isolate those lines and match them back with the corresponding files ?
Also, would it make sense to use 28591 / ISO-8859-1 when extracting the strings, or is it irrelevant at this stage ? I mean, does the “Default.GetString” command reliably copy all bytes to the output exactly as they are in the input, or not ? And if not, how would I modify the script to use that specific codepage ? (Based on the article linked above it is not included in the default options ; based on the script provided by the author I tried a few things but it didn't work.)
Code:
        $Encoding = [Text.Encoding]::GetEncoding(28591)
        $StreamReader = New-Object IO.StreamReader -ArgumentList $buffer, $Encoding => fails, this command apparently expects a path and won't accept the previously defined variable $buffer

Then, when performing the actual search, would it make sense to select that specific codepage / encoding ? What are the characters which would match in ISO-8859-1 which would not in regular ASCII ? Or am I completely misunderstanding the whole thing ?


Now, regarding partial MD5 calculation, I know a small CLI tool which works well : dsfo (from the dsfok pack). With this command :
Code:
dsfo "G:\dummy file.ext" 1024 32 $

it calculates the MD5 of "dummy file.ext" for the 32 bytes chunk starting at offset 1024, with this output :
Code:
H:\>dsfo "G:\Cover 20200905.png" 1024 32 $
OK, 32 bytes, 0.000s, MD5 = 47a6148e5e71d111a231b4e1322ae5f9

So I can create a list of dsfo commands and get a list of matching or non-matching MD5, then somehow edit the list to get only the files which do not match.
But would there be a better way to do what I want with PowerShell ? That is : once I know that, for instance, file f123456.mpg at offset 1024 matches file WHATEVER_DVD\VIDEO_TS\VTS_01_2.VOB at offset 52428800, the script would have to 1) calculate the MD5 of f123456.mpg (whole file) ; 2) calculate the MD5 of the chunk of WHATEVER_DVD\VIDEO_TS\VTS_01_2.VOB starting at offset 52428800 - 1024 = 52427776, and with a size equal to the size of f123456.mpg ; 3) if the checksums do match, move f123456.mpg to a “OK” subdirectory, if not, move it to a “NO” subdirectory.


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 8th, 2020, 15:17 
Offline

Joined: March 7th, 2009, 12:43
Posts: 987
Location: Angel Data Recovery
You did a mess between normal recovery and RAW and want to dig into it.
Step back and do it properly.
1) Scan and find all structured data, map it.
2) Make RAW scan only unmapped area to avoid dublications.

_________________
Angel Data Recovery


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 21st, 2020, 16:18 
Offline

Joined: November 22nd, 2017, 21:47
Posts: 293
Location: France
Sorry for replying a bit late...

Quote:
You did a mess between normal recovery and RAW and want to dig into it.
Step back and do it properly.
1) Scan and find all structured data, map it.
2) Make RAW scan only unmapped area to avoid dublications.

As I wrote above : “the original drive has been repurposed since then, so all I have is the recovery”, so this is not an option. That's why I wanted to try the method described above, however crude it may look it should work, I was merely requesting a few technical tips to make it more streamlined.
What kind of softwares allow to “map” structured data ? I used R-Studio (v. 8.7), which doesn't have such a feature as far as I know (unless the “technician” license provides something like this). R-Studio does the signature search and the filesystem analysis during the same scan, it should be aware that a MPG / VOB file found by file signature search within an area allocated to a MPG / VOB file identified through the filesystem analysis obviously belongs to that file, instead of listing those hundreds of fragments with no indication about the redundancy. (I read something in the release notes for a more recent version about “overlapping files”, but it was for the “technician” license only.)


Top
 Profile  
 
 Post subject: Re: Finding fragment of files (MPG-MTS) inside of valid file
PostPosted: September 22nd, 2020, 3:09 
Offline

Joined: March 7th, 2009, 12:43
Posts: 987
Location: Angel Data Recovery
UFS Pro and Data extractor.

_________________
Angel Data Recovery


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC - 5 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group