Skip to main content

Find Duplicate Photos and Videos

Last month I received a warning that there's almost no more free space left for my Google account, so I decided to delete all files in Google Photos.

But before that, I downloaded all files via Google Takeout. I decided to copy files to my local backup if I don't have a local copy, for example, panorama photos generated by Google Photos. However I don't know which exactly ones that I should copy.

There are ~27k files in total. I processed them in different steps, through each step I expect to find more files that I do not need to copy.

Binary Comparison

I can easily find binary identical files, by computing hashsums of all files.

Unfortunatley, only ~7k files have a binary identical local backup.

Binary Content Comparison

It is possible two files have slighly different metadata but exactly same content (i.e. video stream).

Using `ffmpeg -f md5 -` I can compute a hashsum of the content of a file.

Unforunately, only 1 file is found in this step. Quite disappointing!

Of course, I should probably skip the first step, and just use ffmpeg from the beginning.

ImageMagick Comparison

Using `compare -metric mse`, I can compute a "similarity metric" between two files. But before that I need to scale the two input files into the same resolution.

To identify "visually identical" image files, I need to set a threshold of the MSE metric. I also use other signals like EXIF datetime and filenames. For example, if two files have exactly the same filename and exif datetime, I can use a very permissive MSE threshold.

With a few rounds of tuning, eventually I found a good set of thresholds. The end result was pretty good. I found local visually identical files for ~18k files.

These files are not binary identical, because Google Photos would rescale and compress the input files.

In general ImageMagick works pretty well, but there are some issues:
- It cannot handle CR2 files, it depends on `ufraw-batch`, which is no longer available on Debian.
- It cannot handle video files well, it converts video files to something huge, which is quite inefficient.

Manual Comparison

For about ~500 files, some potential candidates (as local duplicate) are found, but I'm not sure yet they are valid or not. So I just created a HTML file that present all files and their candidates side by side.

It's proven quite helpful. Most candidates are actual duplicates, but I was able to identify the rest.

Final Step

After all the steps, there are ~500 files left. I just copied all of them to my local backup.

Other Tools

There are other tools I found during the process, but didn't use:

Comments

Popular posts from this blog

Determine Perspective Lines With Off-page Vanishing Point

In perspective drawing, a vanishing point represents a group of parallel lines, in other words, a direction. For any point on the paper, if we want a line towards the same direction (in the 3d space), we simply draw a line through it and the vanishing point. But sometimes the vanishing point is too far away, such that it is outside the paper/canvas. In this example, we have a point P and two perspective lines L1 and L2. The vanishing point VP is naturally the intersection of L1 and L2. The task is to draw a line through P and VP, without having VP on the paper. I am aware of a few traditional solutions: 1. Use extra pieces of paper such that we can extend L1 and L2 until we see VP. 2. Draw everything in a smaller scale, such that we can see both P and VP on the paper. Draw the line and scale everything back. 3. Draw a perspective grid using the Brewer Method. #1 and #2 might be quite practical. #3 may not guarantee a solution, unless we can measure distances/p...

Qubes OS: First Impressions

A few days ago, while browsing security topics online, Qubes OS surfaced—whether via YouTube recommendations or search results, I can't recall precisely. Intrigued by its unique approach to security through compartmentalization, I delved into the documentation and watched some demos. My interest was piqued enough that I felt compelled to install it and give it a try firsthand. My overall first impression of Qubes OS is highly positive. Had I discovered it earlier, I might have reconsidered starting my hardware password manager project. Conceptually, Qubes OS is not much different from running a bunch of virtual machines simultaneously. However, its brilliance lies in the seamless desktop integration and the well-designed template system, making it far more user-friendly than a manual VM setup. I was particularly impressed by the concept of disposable VMs for temporary tasks and the clear separation of critical functions like networking (sys-net) and USB handling (sys-usb) into the...

Exploring Immutable Distros and Declarative Management

My current server setup, based on Debian Stable and Docker, has served me reliably for years. It's stable, familiar, and gets the job done. However, an intriguing article I revisited recently about Fedora CoreOS, rpm-ostree, and OSTree native containers sparked my curiosity and sent me down a rabbit hole exploring alternative approaches to system management. Could there be a better way? Core Goals & Requirements Before diving into new technologies, I wanted to define what "better" means for my use case: The base operating system must update automatically and reliably. Hosted services (applications) should be updatable either automatically or manually, depending on the service. Configuration and data files need to be easy to modify, and crucially, automatically tracked and backed up. Current Setup: Debian Stable + Docker My current infrastructure consists of several servers, all running Debian Stable. System Updates are andled automatically via unattended-upgrades. Se...