Last month I received a warning that there's almost no more free space left for my Google account, so I decided to delete all files in Google Photos.
But before that, I downloaded all files via Google Takeout. I decided to copy files to my local backup if I don't have a local copy, for example, panorama photos generated by Google Photos. However I don't know which exactly ones that I should copy.
There are ~27k files in total. I processed them in different steps, through each step I expect to find more files that I do not need to copy.
Binary Comparison
I can easily find binary identical files, by computing hashsums of all files.
Unfortunatley, only ~7k files have a binary identical local backup.
Binary Content Comparison
It is possible two files have slighly different metadata but exactly same content (i.e. video stream).
Using `ffmpeg -f md5 -` I can compute a hashsum of the content of a file.
Unforunately, only 1 file is found in this step. Quite disappointing!
Of course, I should probably skip the first step, and just use ffmpeg from the beginning.
ImageMagick Comparison
Using `compare -metric mse`, I can compute a "similarity metric" between two files. But before that I need to scale the two input files into the same resolution.
To identify "visually identical" image files, I need to set a threshold of the MSE metric. I also use other signals like EXIF datetime and filenames. For example, if two files have exactly the same filename and exif datetime, I can use a very permissive MSE threshold.
With a few rounds of tuning, eventually I found a good set of thresholds. The end result was pretty good. I found local visually identical files for ~18k files.
These files are not binary identical, because Google Photos would rescale and compress the input files.
In general ImageMagick works pretty well, but there are some issues:
- It cannot handle CR2 files, it depends on `ufraw-batch`, which is no longer available on Debian.
- It cannot handle video files well, it converts video files to something huge, which is quite inefficient.
Manual Comparison
For about ~500 files, some potential candidates (as local duplicate) are found, but I'm not sure yet they are valid or not. So I just created a HTML file that present all files and their candidates side by side.
It's proven quite helpful. Most candidates are actual duplicates, but I was able to identify the rest.
Final Step
After all the steps, there are ~500 files left. I just copied all of them to my local backup.
Other Tools
There are other tools I found during the process, but didn't use:
Comments