Four strategies to resolve duplicate music files
October 22, 2013 in digital music by Dan Gravell
As you acquire music for your computer audio collection, from multiple sources, it's inevitable that inconsistencies are introduced and mistakes are made. One type of mistake, some might argue, is to collect duplicate music files.
On the other hand, some might say duplicate tracks don't matter. If you are sensitive to the extra storage space expended and the additional "noise" in your music library, what can you do?
There's a brute force, but enjoyable solution: listen to all of your music! But this isn't practical and probably won't work on any non-trivial size of library. Here are four approaches to finding duplicates.
1. Use checksums
A checksum is a piece of data that is calculated from source data. It is normally much smaller than the source data. There are many pieces of software available to generate checksums. Importantly, different data can produce the same, clashing, checksum, but this is extremely rare (to an extent depending on the algorithm).
In theory, you could run a checksum generator against every file in your music library, then compare the checksums to see which are equal, before reviewing the actual files and then deleting if they are duplicates.
But there are likely show stoppers to this approach, over and above the likelihood of a checksum clash. The main problem is false negatives. For the checksum to produce the same result, the files must likely be exactly the same. This is unlikely, given slightly different silences before and after tracks, different file formats and so on.
However, even if you could isolate the same audio data in the same format, you also have to consider the rest of the music file, comprising metadata and more. All this must also be exactly the same.
In reality, checksumming is unlikely to work for detecting duplicate tracks in a music library.
2. Analyse your metadata
This is more likely to work. By comparing your music files' internal metadata, including data such as the track name, position and containing release, you can quickly spot duplicates.
Most tag editors allow you to sort your tracks by track name, release name and so on. By first organising by artist name, then track name, you are likely to be able to quickly list all duplicates together. They may have slightly different names, but it should be a simple process of checking they are duplicates and then deleting one of them.
Watch out for special versions of the tracks which you may want to keep. For example: live versions, acoustic versions and the like. You can normally see if a track is one of these by reviewing the containing release name.
Other tags exist that may provide an even faster result. You can try looking for MusicBrainz track, recording or work IDs which are globally unique identifiers for a particular work.
A duplicate track ID is almost certainly a candidate for deletion, because it identifies a given song on a given release. Duplicate recordings are often used likewise, with the additional possibility that the recording has been used in other releases too. It's MusicBrainz works that you have to be careful with; these identify a given composition, so deleting all tracks with the same works ID may lead you to deleting different versions of the same song; probably not something you wanted.
Using metadata is a good approach, but it has an obvious drawback... it requires metadata! What if you have no metadata?
3. Get fingerprinting
Audio fingerprinting is a method of producing a piece of data by analysing the actual audio data in a music file. This means there's no requirement for metadata, plus only the audio data is analysed, unlike with checksums, so the likelihood of finding a duplicate is higher.
bliss uses a fingerprinting service called Acoustid (more accurately, the fingerprinter is called Chromaprint). Lukáš Lalinský, the founder of the Acoustid project, wrote about comparing Chromaprint fingerprints. Here's an example. Basically, the less white the better.
The downside of fingerprinting is it requires some work to get it to reliably identify duplicates. Fingerprinting includes periods of lead-in and lead-out silence in its calculation, so where silence has been added to otherwise duplicate tracks, a naive comparison will fail to match. More work is required to "align" the fingerprints to see if a match can be found.
4. Use an app
There are applications out there which claim to find duplicates for you. In reality, they use combinations of the above methods (apart from checksumming, which would be madness) but make it easier for you by providing a nice GUI, automated actions to delete duplicates and so on.
Take a look at TuneUp for iTunes users and Jaikoz for others.
Providing a means to identify duplicates is a long standing and now much supported suggestion on the bliss ideas forum. If you have any comments on this idea, or simply want to vote for the suggestion, you should drop by!
Thanks to JD Hancock for the image above.