Need help to verify duplicate detection

**vagskal** · 2010-09-05, 19:45

Below are the results on 7.5.2 latest build and the first part of the server.log where the plugin seemed to have an issue:

Code:

Detecting using (number of bytes): 1000000
Detected: 115777
Checksum duplicates: 975 Show checksum duplicates  (Incorrect duplicates: 52) 
Duplicates: 923 Show duplicates 

Detecting using (number of bytes): 500000
Detected: 115777
Checksum duplicates: 975 Show checksum duplicates  (Incorrect duplicates: 52) 
Duplicates: 923 Show duplicates 

Detecting using (number of bytes): 250000
Detected: 115777
Checksum duplicates: 1001 Show checksum duplicates  (Incorrect duplicates: 78) 
Duplicates: 923 Show duplicates 




[10-09-05 09:54:27.4051] main::init (323) Starting Squeezebox Server (v7.5.2, r31264, Sat Aug 28 02:06:44 PDT 2010) perl 5.010000
[10-09-05 09:54:36.6707] Slim::Utils::Strings::parseStrings (351) Error: Parsing line 1: ï»¿# Max Spicer, May 2007
[10-09-05 09:54:47.9365] Slim::Utils::Misc::msg (1165) Warning: [09:54:47.9362] "my" variable $dbh masks earlier declaration in same scope at C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 319.
[10-09-05 09:54:47.9368] Slim::Utils::Misc::msg (1165) Warning: [09:54:47.9366] "my" variable $sth masks earlier declaration in same scope at C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 320.
[10-09-05 09:54:53.5301] Slim::Schema::Storage::throw_exception (82) Error: DBI Exception: DBD::mysql::db do failed: Unknown column 'audiosize' in 'field list'
[10-09-05 09:54:53.5305] Slim::Schema::Storage::throw_exception (82) Backtrace:

   frame 0: Slim::Utils::Log::logBacktrace (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>Slim/Schema/Storage.pm line 82)
   frame 1: Slim::Schema::Storage::throw_exception (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>DBIx/Class/Storage/DBI.pm line 957)
   frame 2: DBIx::Class::Storage::DBI::__ANON__ (C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 104)
   frame 3: (eval) (C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 104)
   frame 4: Plugins::DuplicateDetector::Plugin::initDatabase (C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 81)
   frame 5: Plugins::DuplicateDetector::Plugin::initPlugin (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>Slim/Utils/PluginManager.pm line 328)
   frame 6: (eval) (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>Slim/Utils/PluginManager.pm line 328)
   frame 7: Slim::Utils::PluginManager::load (slimserver.pl line 507)
   frame 8: main::init (slimserver.pl line 578)
   frame 9: main::main (slimserver.pl line 99)
   frame 10: PerlSvc::Interactive (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>PerlSvc.pm line 99)
   frame 11: PerlSvc::_interactive (slimserver.pl line 0)
   frame 12: (eval) (slimserver.pl line 0)
[10-09-05 09:54:53.8575] Plugins::DuplicateDetector::Plugin::initDatabase (111) Duplicate Detector: Creating database tables
[10-09-05 09:54:53.9546] Plugins::DuplicateDetector::Plugin::createIndex (137) No smdidIndex index found in duplicatedetector_tracks, creating index...

Let me know if you want me to try with an even higher setting or if you would like to see the duplicates lists or the entire server.log.

**erland** · 2010-09-05, 20:09

Originally posted by vagskal

Below are the results on 7.5.2 latest build and the first part of the server.log where the plugin seemed to have an issue:

You can ignore the SQL exception in the server.log, I didn't find any way to hide it. It's harmless and it's only output once for users which have had the previous version of the plugin installed.

Originally posted by vagskal

Let me know if you want me to try with an even higher setting or if you would like to see the duplicates lists or the entire server.log.

It's the "Incorrect duplicates" lists that are mostly interested, could you please:
- Post the incorrectduplicates.txt file for one of the executions, doesn't matter which one.
- Try and see if there is anything special with those tracks that could cause incorrect duplicates, for example a lot of silence in the beginning or something similar.

If one or several of the rows in the incorrectduplicates.txt file starts with "NOCHECKSUM-", that indicates that no checksum calculation could be performed for these files. In that case it's very interesting to know the file format of these files and to verify that they are possible to play through SBS. I've seen some issues like this for m4a files from another user.

**MrSinatra** · 2010-09-05, 21:54

hey Erland,

as always, you impress. questions for you on this, although i can see you're still beta testing it:

what is defined as a duplicate? same title, artist, etc? some kind of audio fingerprint? what if bitrates or formats are different of the same song? what if the source is different, like a remastered cd?

will you be able to checkmark dupes you want to delete?

**erland** · 2010-09-06, 04:36

Originally posted by MrSinatra

what is defined as a duplicate? same title, artist, etc? some kind of audio fingerprint? what if bitrates or formats are different of the same song? what if the source is different, like a remastered cd?

It needs to be the exact same rip as it checks that the compressed audio bytes section of the file is the same. So remastered cd, different formats, bitrates will not be detected as duplicates.

Originally posted by MrSinatra

will you be able to checkmark dupes you want to delete?

No, for three reasons:
1. SBS might not have write access to the file system where the music files are.
2. I don't want to make it easy for users to accidentally delete their music files.
3. The intention of this plugin is to verify the algorithm that's used to identify a specific music file even if it has been re-tagged or renamed/moved. The algorithm will later be used to connect metadata and statistics to a specific music file and make sure that relation survives a rename/move or retagging of the file.

It is possible to export all duplicates to a text file.

**andyg** · 2010-09-06, 04:58

Need help to verify duplicate detection

On Sep 5, 2010, at 11:36 PM, erland wrote:

>
> MrSinatra;574572 Wrote:
>>
>> what is defined as a duplicate? same title, artist, etc? some kind of
>> audio fingerprint? what if bitrates or formats are different of the
>> same song? what if the source is different, like a remastered cd?
>>
> It needs to be the exact same rip as it checks that the compressed
> audio bytes section of the file is the same. So remastered cd,
> different formats, bitrates will not be detected as duplicates.

Right, this is not doing audio fingerprinting, just a checksum. Fingerprinting is hard and slow, as you have to decode every type of audio format to PCM and then process the raw audio data.

**audiomuze** · 2010-09-06, 05:55

Hi Erland

Would it not be possible to limit the md5 hash to say 1000 bytes or something similarly small if you read forward from the midpoint of the audio portion of the file, regardless of file format?

**erland** · 2010-09-06, 06:40

Originally posted by audiomuze

Would it not be possible to limit the md5 hash to say 1000 bytes or something similarly small if you read forward from the midpoint of the audio portion of the file, regardless of file format?

Yes possibly, Andy plans to try if that works better so we will know as soon as he have implemented a new version of the Audio::Scan module that supports this which people can try.

A possible issue is that we are talking about compressed data which means that the compression algorithm might cause problems. I don't have any detailed knowledge about this but I suspect the real data might be stored in the beginning of the file and the later part of the file might just be instructions at which points to insert the different data sections when uncompressing. If you know about compression algorithms, you know that most of them try to store a common data section once and have a list of all occurrences of that section in the uncompressed file. Of course, the list of pointers might be as good as a real data section from a checksum perspective.

The 0.2 version combines the MD5 checksum with the number of compressed audio bytes in the file and this made it a lot better than the previous approach which only used the MD5 checksum. In the result "Duplicates" shows the files that have both the same checksum and the same number of compressed audio bytes. The "Incorrect duplicates" is the list of files that have the same checksum but not the same number of compressed audio bytes.

Since the intention is to use this later on to connect manually entered metadata/statistics to individual music files it really needs to be as close to 100% as possible.

**MrSinatra** · 2010-09-06, 17:06

thx for the info...

so if i understand what you guys are saying, the intention is to devise a system where a file can be uniquely identified to SBS and continuously tracked even if it were to be moved around, retagged, etc, correct?

and to stress test this system, you are seeing how best to identify duplicates, so that later you can be confident that SBS won't misidentify one file as being another file that in reality, it really isn't... yes?

once confident that SBS can uniquely identify a given file, SBS can store stats and other data about the file that don't have to be in the file or its tag. right?

i'm all for that, but i just hope there will be some kind of "Cache clearing" method to be able to tell SBS to forget the info collected if the user desires, and/or a way to import/export it to other installs.

neat work!

**erland** · 2010-09-06, 17:46

Originally posted by MrSinatra

so if i understand what you guys are saying, the intention is to devise a system where a file can be uniquely identified to SBS and continuously tracked even if it were to be moved around, retagged, etc, correct?

Yes

Originally posted by MrSinatra

and to stress test this system, you are seeing how best to identify duplicates, so that later you can be confident that SBS won't misidentify one file as being another file that in reality, it really isn't... yes?

Yes, the purpose is to get help testing it with all possible variations of encodings/file formats and to make sure it's unique enough in a large library so two songs doesn't get the same identity.

Originally posted by MrSinatra

once confident that SBS can uniquely identify a given file, SBS can store stats and other data about the file that don't have to be in the file or its tag. right?

Yes, I'm not sure standard SBS is going to do it but there are plans to use it in some third party stuff.

Originally posted by MrSinatra

i'm all for that, but i just hope there will be some kind of "Cache clearing" method to be able to tell SBS to forget the info collected if the user desires, and/or a way to import/export it to other installs.

The reason to have it as a third party add-on is that you don't need to use it if you don't like it. Assuming it's fast enough it should be safe to recalculate the identity each time a track is scanned, so SBS can keep deleting its database contents during a full rescan but we will be able to have additional tables with persistent metadata/statistics which can be reconnected to the correct tracks after/during the scanning.

If it isn't fast enough, we need to implement some caching mechanism or optimize the performance in some other way but let's handle that after we know it's needed. At the moment the focus is to make sure the identification process is good enough.

Export/import possibilities for metadata/statistics entered manually is always important, but the main reason for that is to make it possible to take a backup or to export it for usage in some other application. In addition to this it of course also has to be possible to clear all metadata/statistics if you like to start over from the beginning.

**Philip Meyer** · 2010-09-06, 20:51

Need help to verify duplicate detection

>1. Goto "Extras/Duplicate Detector" in SBS web interface and start
>detection. You need to hit the "Refresh" link to see the current
>progress.
>
Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.

I note though that SBS reports I have 35852 songs, but Duplicate Detector says it is checking 35871 songs. Could this be because it is picking up non-track records (eg. playlist names, urls to other resources, etc)?

>2. The default is to detect using the first 10000 audio bytes.
>
Not sure that it is going to be safe to take a sub-set of audio bytes (or at least without looking at other audio attributes too). There could be different versions of songs that are edited, eg. the first part of the song is identical, but one version is a bit longer. Perhaps the checksum also needs to include audio length, or take n bytes at the start AND n bytes at the end for the checksum calculation.

I can think of one example, where I have an album in two formats - the original where songs are distinct tracks, and an enhanced version where songs are cross-faded.

What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?

So far, it has detected 113 checksum duplicates (all incorrect duplicates) out of 6856 songs (Duplicates: 0).

I looked at the results, and there are groups that I assume it believes are duplicates, such as:

64b80a25505a34d0c723dce617ced261-00d02343 M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\01 - On The Road Again (Alternate Take).mp3
64b80a25505a34d0c723dce617ced261-0079bc14 M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\02 - Nine Below Zero.mp3
[...in this case, all 45 songs on the album are listed...]
64b80a25505a34d0c723dce617ced261-001930fa M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\45 - Huh.mp3

The first part up to the "-" is always the same, the second part of the number is different. What does this mean?

Phil

**erland** · 2010-09-06, 23:28

Originally posted by Philip Meyer

The first part up to the "-" is always the same, the second part of the number is different. What does this mean?

Checksum (first part) is the same and the audio length (last part) isn't, combining them makes it unique.

**Philip Meyer** · 2010-09-07, 01:36

Need help to verify duplicate detection

>Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.
>
It finished (left it going unattended). Not sure how long it took - couldn't find anything in the log to indicate when it had actually finished, but it did take a significant amount of time. This will add to rescan times.

Detecting using (number of bytes): 10000
Detected: 35871
Checksum duplicates: 1405 duplicates (Incorrect duplicates: 1325)
Duplicates: 80

It actually highlighted a handfull of actual duplicates that I wasn't aware of (eg. where I have remix versions from old CD singles that are named differently but are actually the same thing).

After I ignored duplicates that obviously weren't duplicates (eg. due to cue sheets - see below), there were actually only two false positives. The checksum from the first 10,000 bytes matches, and the song length is identical, but the songs are certainly not the same.

>What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?
>
Answer to this is that every track that comes from the same source file in a cue sheet is detected as a duplicate. eg. in the duplicates log file, I have 26 lines of:

71eb89d6770416109fca35e97cde57e8-34d5db7e M:\Music\Surround Sound\The Beatles\Love\Love.dts.flac

I have a single .mov file that SBS understands (and just plays the audio content). This appears in the Duplicates log as:

NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom Yorke\Other\Rabbit In Your Headlights.mov

I guess your code doesn't understand .mov files, but SBS can.

**erland** · 2010-09-07, 04:47

Originally posted by Philip Meyer

>Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.
>
It finished (left it going unattended). Not sure how long it took - couldn't find anything in the log to indicate when it had actually finished, but it did take a significant amount of time. This will add to rescan times.

As long as the "Duplicates: 80" number isn't increased, you can try to decrease the number of bytes used in "Duplicate Detector" settings page. In my small FLAC library, I was able to go as low as 680 without any duplicates.

Originally posted by Philip Meyer

>What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?
>
Answer to this is that every track that comes from the same source file in a cue sheet is detected as a duplicate. eg. in the duplicates log file, I have 26 lines of:

71eb89d6770416109fca35e97cde57e8-34d5db7e M:\Music\Surround Sound\The Beatles\Love\Love.dts.flac

Ok, this is a problem we need to solve.

Andy, if you are reading this, is there any way to handle this in the Audio::Scan module ?

If not, is there some other metadata that could be added to the MD5 in similar way as I did with audio size to make sure each track on a cue sheet get a unique identity ? Some track offset maybe ?

Originally posted by Philip Meyer

I have a single .mov file that SBS understands (and just plays the audio content). This appears in the Duplicates log as:

NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom Yorke\Other\Rabbit In Your Headlights.mov

I guess your code doesn't understand .mov files, but SBS can.

Andy, if you are reading this, is there any way to handle mov files in the Audio::Scan module ?

**andyg** · 2010-09-07, 05:36

Need help to verify duplicate detection

On Sep 6, 2010, at 11:47 PM, erland wrote:
> Ok, this is a problem we need to solve.
>
> Andy, if you are reading this, is there any way to handle this in the
> Audio::Scan module ?
>
> If not, is there some other metadata that could be added to the MD5 in
> similar way as I did with audio size to make sure each track on a cue
> sheet get a unique identity ? Some track offset maybe ?

OK, Audio::Scan could let you provide an alternate starting byte offset for the calculation, then you could run it on the same file with the start offset of each track (which is already stored in the database) and get different checksums.

How about:

Audio::Scan->scan_info( $file, { md5_size => 1024, md5_offset => $track_start } );

FYI in the next version of Audio::Scan the default md5_offset is determined by:

audio_offset + (audio_size / 2) - (md5_size / 2);

So a user-supplied md5_offset would just override that default.

> Philip Meyer;574862 Wrote:
>>
>> I have a single .mov file that SBS understands (and just plays the
>> audio content). This appears in the Duplicates log as:
>>
>> NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom
>> Yorke\Other\Rabbit In Your Headlights.mov
>>
>> I guess your code doesn't understand .mov files, but SBS can.
>>
> Andy, if you are reading this, is there any way to handle mov files in
> the Audio::Scan module ?

Hmm, if it's handled OK by SBS that means Audio::Scan is being used. We don't use any other format modules now. Can you send me the file that doesn't work?

http://wiki.slimdevices.com/index.php/Large_File_Upload

**erland** · 2010-09-07, 05:51

Originally posted by andyg

On Sep 6, 2010, at 11:47 PM, erland wrote:
> Ok, this is a problem we need to solve.
>
> Andy, if you are reading this, is there any way to handle this in the
> Audio::Scan module ?
>
> If not, is there some other metadata that could be added to the MD5 in
> similar way as I did with audio size to make sure each track on a cue
> sheet get a unique identity ? Some track offset maybe ?

OK, Audio::Scan could let you provide an alternate starting byte offset for the calculation, then you could run it on the same file with the start offset of each track (which is already stored in the database) and get different checksums.

How about:

Audio::Scan->scan_info( $file, { md5_size => 1024, md5_offset => $track_start } );

FYI in the next version of Audio::Scan the default md5_offset is determined by:

audio_offset + (audio_size / 2) - (md5_size / 2);

Does this default mean that it will work with cue sheets without me specifying a specific md5_offset parameter ?

If not, is the offset in the database an offset in the compressed or uncompressed audio data ? If it's an offset in the compressed audio data, this should work.

Is the audio_size of an individual cue sheet track also available in the database ?
I suppose I might need this if I like to ensure that I don't include data from the previous or next track in the checksum calculation.

Need help to verify duplicate detection

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment