Find Duplicate Photos and Videos

Last month I received a warning that there's almost no more free space left for my Google account, so I decided to delete all files in Google Photos.

But before that, I downloaded all files via Google Takeout. I decided to copy files to my local backup if I don't have a local copy, for example, panorama photos generated by Google Photos. However I don't know which exactly ones that I should copy.

There are ~27k files in total. I processed them in different steps, through each step I expect to find more files that I do not need to copy.

Binary Comparison

I can easily find binary identical files, by computing hashsums of all files.

Unfortunatley, only ~7k files have a binary identical local backup.

Binary Content Comparison

It is possible two files have slighly different metadata but exactly same content (i.e. video stream).

Using `ffmpeg -f md5 -` I can compute a hashsum of the content of a file.

Unforunately, only 1 file is found in this step. Quite disappointing!

Of course, I should probably skip the first step, and just use ffmpeg from the beginning.

ImageMagick Comparison

Using `compare -metric mse`, I can compute a "similarity metric" between two files. But before that I need to scale the two input files into the same resolution.

To identify "visually identical" image files, I need to set a threshold of the MSE metric. I also use other signals like EXIF datetime and filenames. For example, if two files have exactly the same filename and exif datetime, I can use a very permissive MSE threshold.

With a few rounds of tuning, eventually I found a good set of thresholds. The end result was pretty good. I found local visually identical files for ~18k files.

These files are not binary identical, because Google Photos would rescale and compress the input files.

In general ImageMagick works pretty well, but there are some issues:

- It cannot handle CR2 files, it depends on `ufraw-batch`, which is no longer available on Debian.

- It cannot handle video files well, it converts video files to something huge, which is quite inefficient.

Manual Comparison

For about ~500 files, some potential candidates (as local duplicate) are found, but I'm not sure yet they are valid or not. So I just created a HTML file that present all files and their candidates side by side.

It's proven quite helpful. Most candidates are actual duplicates, but I was able to identify the rest.

Final Step

After all the steps, there are ~500 files left. I just copied all of them to my local backup.

Other Tools

There are other tools I found during the process, but didn't use:

WangLu's Notes

Search This Blog