Need help to verify duplicate detection

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • vagskal
    Senior Member
    • Oct 2008
    • 643

    #16
    Below are the results on 7.5.2 latest build and the first part of the server.log where the plugin seemed to have an issue:
    Code:
    Detecting using (number of bytes): 1000000
    Detected: 115777
    Checksum duplicates: 975 Show checksum duplicates  (Incorrect duplicates: 52) 
    Duplicates: 923 Show duplicates 
    
    Detecting using (number of bytes): 500000
    Detected: 115777
    Checksum duplicates: 975 Show checksum duplicates  (Incorrect duplicates: 52) 
    Duplicates: 923 Show duplicates 
    
    Detecting using (number of bytes): 250000
    Detected: 115777
    Checksum duplicates: 1001 Show checksum duplicates  (Incorrect duplicates: 78) 
    Duplicates: 923 Show duplicates 
    
    
    
    
    [10-09-05 09:54:27.4051] main::init (323) Starting Squeezebox Server (v7.5.2, r31264, Sat Aug 28 02:06:44 PDT 2010) perl 5.010000
    [10-09-05 09:54:36.6707] Slim::Utils::Strings::parseStrings (351) Error: Parsing line 1: # Max Spicer, May 2007
    [10-09-05 09:54:47.9365] Slim::Utils::Misc::msg (1165) Warning: [09:54:47.9362] "my" variable $dbh masks earlier declaration in same scope at C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 319.
    [10-09-05 09:54:47.9368] Slim::Utils::Misc::msg (1165) Warning: [09:54:47.9366] "my" variable $sth masks earlier declaration in same scope at C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 320.
    [10-09-05 09:54:53.5301] Slim::Schema::Storage::throw_exception (82) Error: DBI Exception: DBD::mysql::db do failed: Unknown column 'audiosize' in 'field list'
    [10-09-05 09:54:53.5305] Slim::Schema::Storage::throw_exception (82) Backtrace:
    
       frame 0: Slim::Utils::Log::logBacktrace (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>Slim/Schema/Storage.pm line 82)
       frame 1: Slim::Schema::Storage::throw_exception (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>DBIx/Class/Storage/DBI.pm line 957)
       frame 2: DBIx::Class::Storage::DBI::__ANON__ (C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 104)
       frame 3: (eval) (C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 104)
       frame 4: Plugins::DuplicateDetector::Plugin::initDatabase (C:\Documents and Settings\All Users\Application Data\Squeezebox\Cache\InstalledPlugins/Plugins/DuplicateDetector/Plugin.pm line 81)
       frame 5: Plugins::DuplicateDetector::Plugin::initPlugin (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>Slim/Utils/PluginManager.pm line 328)
       frame 6: (eval) (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>Slim/Utils/PluginManager.pm line 328)
       frame 7: Slim::Utils::PluginManager::load (slimserver.pl line 507)
       frame 8: main::init (slimserver.pl line 578)
       frame 9: main::main (slimserver.pl line 99)
       frame 10: PerlSvc::Interactive (/<C:\Program\SQUEEZ~1\server\SQUEEZ~3.EXE>PerlSvc.pm line 99)
       frame 11: PerlSvc::_interactive (slimserver.pl line 0)
       frame 12: (eval) (slimserver.pl line 0)
    [10-09-05 09:54:53.8575] Plugins::DuplicateDetector::Plugin::initDatabase (111) Duplicate Detector: Creating database tables
    [10-09-05 09:54:53.9546] Plugins::DuplicateDetector::Plugin::createIndex (137) No smdidIndex index found in duplicatedetector_tracks, creating index...
    Let me know if you want me to try with an even higher setting or if you would like to see the duplicates lists or the entire server.log.
    2 x SB3 (wired), Receiver (wired), Boom (wireless), Controller, iPeng on iPhone 4 & iPad, muso on remote computer running Win 7 64-bit | 7.7.3 on Win XP

    Comment

    • erland
      Senior Member
      • Jan 2006
      • 11323

      #17
      Originally posted by vagskal
      Below are the results on 7.5.2 latest build and the first part of the server.log where the plugin seemed to have an issue:
      You can ignore the SQL exception in the server.log, I didn't find any way to hide it. It's harmless and it's only output once for users which have had the previous version of the plugin installed.

      Originally posted by vagskal
      Let me know if you want me to try with an even higher setting or if you would like to see the duplicates lists or the entire server.log.
      It's the "Incorrect duplicates" lists that are mostly interested, could you please:
      - Post the incorrectduplicates.txt file for one of the executions, doesn't matter which one.
      - Try and see if there is anything special with those tracks that could cause incorrect duplicates, for example a lot of silence in the beginning or something similar.

      If one or several of the rows in the incorrectduplicates.txt file starts with "NOCHECKSUM-", that indicates that no checksum calculation could be performed for these files. In that case it's very interesting to know the file format of these files and to verify that they are possible to play through SBS. I've seen some issues like this for m4a files from another user.
      Erland Lindmark (My homepage)
      Developer of many plugins/applets
      Starting with LMS 8.0 I no longer support my plugins/applets (see here for more information )

      Comment

      • MrSinatra

        #18
        hey Erland,

        as always, you impress. questions for you on this, although i can see you're still beta testing it:

        what is defined as a duplicate? same title, artist, etc? some kind of audio fingerprint? what if bitrates or formats are different of the same song? what if the source is different, like a remastered cd?

        will you be able to checkmark dupes you want to delete?

        Comment

        • erland
          Senior Member
          • Jan 2006
          • 11323

          #19
          Originally posted by MrSinatra
          what is defined as a duplicate? same title, artist, etc? some kind of audio fingerprint? what if bitrates or formats are different of the same song? what if the source is different, like a remastered cd?
          It needs to be the exact same rip as it checks that the compressed audio bytes section of the file is the same. So remastered cd, different formats, bitrates will not be detected as duplicates.

          Originally posted by MrSinatra
          will you be able to checkmark dupes you want to delete?
          No, for three reasons:
          1. SBS might not have write access to the file system where the music files are.
          2. I don't want to make it easy for users to accidentally delete their music files.
          3. The intention of this plugin is to verify the algorithm that's used to identify a specific music file even if it has been re-tagged or renamed/moved. The algorithm will later be used to connect metadata and statistics to a specific music file and make sure that relation survives a rename/move or retagging of the file.

          It is possible to export all duplicates to a text file.
          Erland Lindmark (My homepage)
          Developer of many plugins/applets
          Starting with LMS 8.0 I no longer support my plugins/applets (see here for more information )

          Comment

          • Andy Grundman
            Former Squeezebox Guy
            • Jan 2006
            • 7395

            #20
            Need help to verify duplicate detection

            On Sep 5, 2010, at 11:36 PM, erland wrote:

            >
            > MrSinatra;574572 Wrote:
            >>
            >> what is defined as a duplicate? same title, artist, etc? some kind of
            >> audio fingerprint? what if bitrates or formats are different of the
            >> same song? what if the source is different, like a remastered cd?
            >>

            > It needs to be the exact same rip as it checks that the compressed
            > audio bytes section of the file is the same. So remastered cd,
            > different formats, bitrates will not be detected as duplicates.


            Right, this is not doing audio fingerprinting, just a checksum. Fingerprinting is hard and slow, as you have to decode every type of audio format to PCM and then process the raw audio data.

            Comment

            • audiomuze
              Senior Member
              • Oct 2009
              • 1427

              #21
              Hi Erland

              Would it not be possible to limit the md5 hash to say 1000 bytes or something similarly small if you read forward from the midpoint of the audio portion of the file, regardless of file format?
              puddletag - now packaged in most Linux distributions.

              Comment

              • erland
                Senior Member
                • Jan 2006
                • 11323

                #22
                Originally posted by audiomuze
                Would it not be possible to limit the md5 hash to say 1000 bytes or something similarly small if you read forward from the midpoint of the audio portion of the file, regardless of file format?
                Yes possibly, Andy plans to try if that works better so we will know as soon as he have implemented a new version of the Audio::Scan module that supports this which people can try.

                A possible issue is that we are talking about compressed data which means that the compression algorithm might cause problems. I don't have any detailed knowledge about this but I suspect the real data might be stored in the beginning of the file and the later part of the file might just be instructions at which points to insert the different data sections when uncompressing. If you know about compression algorithms, you know that most of them try to store a common data section once and have a list of all occurrences of that section in the uncompressed file. Of course, the list of pointers might be as good as a real data section from a checksum perspective.

                The 0.2 version combines the MD5 checksum with the number of compressed audio bytes in the file and this made it a lot better than the previous approach which only used the MD5 checksum. In the result "Duplicates" shows the files that have both the same checksum and the same number of compressed audio bytes. The "Incorrect duplicates" is the list of files that have the same checksum but not the same number of compressed audio bytes.

                Since the intention is to use this later on to connect manually entered metadata/statistics to individual music files it really needs to be as close to 100% as possible.
                Erland Lindmark (My homepage)
                Developer of many plugins/applets
                Starting with LMS 8.0 I no longer support my plugins/applets (see here for more information )

                Comment

                • MrSinatra

                  #23
                  thx for the info...

                  so if i understand what you guys are saying, the intention is to devise a system where a file can be uniquely identified to SBS and continuously tracked even if it were to be moved around, retagged, etc, correct?

                  and to stress test this system, you are seeing how best to identify duplicates, so that later you can be confident that SBS won't misidentify one file as being another file that in reality, it really isn't... yes?

                  once confident that SBS can uniquely identify a given file, SBS can store stats and other data about the file that don't have to be in the file or its tag. right?

                  i'm all for that, but i just hope there will be some kind of "Cache clearing" method to be able to tell SBS to forget the info collected if the user desires, and/or a way to import/export it to other installs.

                  neat work!

                  Comment

                  • erland
                    Senior Member
                    • Jan 2006
                    • 11323

                    #24
                    Originally posted by MrSinatra
                    so if i understand what you guys are saying, the intention is to devise a system where a file can be uniquely identified to SBS and continuously tracked even if it were to be moved around, retagged, etc, correct?
                    Yes

                    Originally posted by MrSinatra
                    and to stress test this system, you are seeing how best to identify duplicates, so that later you can be confident that SBS won't misidentify one file as being another file that in reality, it really isn't... yes?
                    Yes, the purpose is to get help testing it with all possible variations of encodings/file formats and to make sure it's unique enough in a large library so two songs doesn't get the same identity.

                    Originally posted by MrSinatra
                    once confident that SBS can uniquely identify a given file, SBS can store stats and other data about the file that don't have to be in the file or its tag. right?
                    Yes, I'm not sure standard SBS is going to do it but there are plans to use it in some third party stuff.

                    Originally posted by MrSinatra
                    i'm all for that, but i just hope there will be some kind of "Cache clearing" method to be able to tell SBS to forget the info collected if the user desires, and/or a way to import/export it to other installs.
                    The reason to have it as a third party add-on is that you don't need to use it if you don't like it. Assuming it's fast enough it should be safe to recalculate the identity each time a track is scanned, so SBS can keep deleting its database contents during a full rescan but we will be able to have additional tables with persistent metadata/statistics which can be reconnected to the correct tracks after/during the scanning.

                    If it isn't fast enough, we need to implement some caching mechanism or optimize the performance in some other way but let's handle that after we know it's needed. At the moment the focus is to make sure the identification process is good enough.

                    Export/import possibilities for metadata/statistics entered manually is always important, but the main reason for that is to make it possible to take a backup or to export it for usage in some other application. In addition to this it of course also has to be possible to clear all metadata/statistics if you like to start over from the beginning.
                    Erland Lindmark (My homepage)
                    Developer of many plugins/applets
                    Starting with LMS 8.0 I no longer support my plugins/applets (see here for more information )

                    Comment

                    • Phil Meyer
                      Senior Member
                      • Apr 2005
                      • 5610

                      #25
                      Need help to verify duplicate detection

                      >1. Goto "Extras/Duplicate Detector" in SBS web interface and start
                      >detection. You need to hit the "Refresh" link to see the current
                      >progress.
                      >

                      Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.

                      I note though that SBS reports I have 35852 songs, but Duplicate Detector says it is checking 35871 songs. Could this be because it is picking up non-track records (eg. playlist names, urls to other resources, etc)?

                      >2. The default is to detect using the first 10000 audio bytes.
                      >

                      Not sure that it is going to be safe to take a sub-set of audio bytes (or at least without looking at other audio attributes too). There could be different versions of songs that are edited, eg. the first part of the song is identical, but one version is a bit longer. Perhaps the checksum also needs to include audio length, or take n bytes at the start AND n bytes at the end for the checksum calculation.

                      I can think of one example, where I have an album in two formats - the original where songs are distinct tracks, and an enhanced version where songs are cross-faded.

                      What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?

                      So far, it has detected 113 checksum duplicates (all incorrect duplicates) out of 6856 songs (Duplicates: 0).

                      I looked at the results, and there are groups that I assume it believes are duplicates, such as:

                      64b80a25505a34d0c723dce617ced261-00d02343 M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\01 - On The Road Again (Alternate Take).mp3
                      64b80a25505a34d0c723dce617ced261-0079bc14 M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\02 - Nine Below Zero.mp3
                      [...in this case, all 45 songs on the album are listed...]
                      64b80a25505a34d0c723dce617ced261-001930fa M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\45 - Huh.mp3

                      The first part up to the "-" is always the same, the second part of the number is different. What does this mean?

                      Phil

                      Comment

                      • erland
                        Senior Member
                        • Jan 2006
                        • 11323

                        #26
                        Originally posted by Philip Meyer
                        The first part up to the "-" is always the same, the second part of the number is different. What does this mean?
                        Checksum (first part) is the same and the audio length (last part) isn't, combining them makes it unique.
                        Erland Lindmark (My homepage)
                        Developer of many plugins/applets
                        Starting with LMS 8.0 I no longer support my plugins/applets (see here for more information )

                        Comment

                        • Phil Meyer
                          Senior Member
                          • Apr 2005
                          • 5610

                          #27
                          Need help to verify duplicate detection

                          >Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.
                          >

                          It finished (left it going unattended). Not sure how long it took - couldn't find anything in the log to indicate when it had actually finished, but it did take a significant amount of time. This will add to rescan times.

                          Detecting using (number of bytes): 10000
                          Detected: 35871
                          Checksum duplicates: 1405 duplicates (Incorrect duplicates: 1325)
                          Duplicates: 80

                          It actually highlighted a handfull of actual duplicates that I wasn't aware of (eg. where I have remix versions from old CD singles that are named differently but are actually the same thing).

                          After I ignored duplicates that obviously weren't duplicates (eg. due to cue sheets - see below), there were actually only two false positives. The checksum from the first 10,000 bytes matches, and the song length is identical, but the songs are certainly not the same.

                          >What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?
                          >

                          Answer to this is that every track that comes from the same source file in a cue sheet is detected as a duplicate. eg. in the duplicates log file, I have 26 lines of:

                          71eb89d6770416109fca35e97cde57e8-34d5db7e M:\Music\Surround Sound\The Beatles\Love\Love.dts.flac


                          I have a single .mov file that SBS understands (and just plays the audio content). This appears in the Duplicates log as:

                          NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom Yorke\Other\Rabbit In Your Headlights.mov

                          I guess your code doesn't understand .mov files, but SBS can.

                          Comment

                          • erland
                            Senior Member
                            • Jan 2006
                            • 11323

                            #28
                            Originally posted by Philip Meyer
                            >Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.
                            >

                            It finished (left it going unattended). Not sure how long it took - couldn't find anything in the log to indicate when it had actually finished, but it did take a significant amount of time. This will add to rescan times.
                            As long as the "Duplicates: 80" number isn't increased, you can try to decrease the number of bytes used in "Duplicate Detector" settings page. In my small FLAC library, I was able to go as low as 680 without any duplicates.


                            Originally posted by Philip Meyer

                            >What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?
                            >

                            Answer to this is that every track that comes from the same source file in a cue sheet is detected as a duplicate. eg. in the duplicates log file, I have 26 lines of:

                            71eb89d6770416109fca35e97cde57e8-34d5db7e M:\Music\Surround Sound\The Beatles\Love\Love.dts.flac
                            Ok, this is a problem we need to solve.

                            Andy, if you are reading this, is there any way to handle this in the Audio::Scan module ?

                            If not, is there some other metadata that could be added to the MD5 in similar way as I did with audio size to make sure each track on a cue sheet get a unique identity ? Some track offset maybe ?

                            Originally posted by Philip Meyer
                            I have a single .mov file that SBS understands (and just plays the audio content). This appears in the Duplicates log as:

                            NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom Yorke\Other\Rabbit In Your Headlights.mov

                            I guess your code doesn't understand .mov files, but SBS can.
                            Andy, if you are reading this, is there any way to handle mov files in the Audio::Scan module ?
                            Erland Lindmark (My homepage)
                            Developer of many plugins/applets
                            Starting with LMS 8.0 I no longer support my plugins/applets (see here for more information )

                            Comment

                            • Andy Grundman
                              Former Squeezebox Guy
                              • Jan 2006
                              • 7395

                              #29
                              Need help to verify duplicate detection

                              On Sep 6, 2010, at 11:47 PM, erland wrote:
                              > Ok, this is a problem we need to solve.
                              >
                              > Andy, if you are reading this, is there any way to handle this in the
                              > Audio::Scan module ?
                              >
                              > If not, is there some other metadata that could be added to the MD5 in
                              > similar way as I did with audio size to make sure each track on a cue
                              > sheet get a unique identity ? Some track offset maybe ?


                              OK, Audio::Scan could let you provide an alternate starting byte offset for the calculation, then you could run it on the same file with the start offset of each track (which is already stored in the database) and get different checksums.

                              How about:

                              Audio::Scan->scan_info( $file, { md5_size => 1024, md5_offset => $track_start } );

                              FYI in the next version of Audio::Scan the default md5_offset is determined by:

                              audio_offset + (audio_size / 2) - (md5_size / 2);

                              So a user-supplied md5_offset would just override that default.

                              > Philip Meyer;574862 Wrote:
                              >>
                              >> I have a single .mov file that SBS understands (and just plays the
                              >> audio content). This appears in the Duplicates log as:
                              >>
                              >> NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom
                              >> Yorke\Other\Rabbit In Your Headlights.mov
                              >>
                              >> I guess your code doesn't understand .mov files, but SBS can.
                              >>

                              > Andy, if you are reading this, is there any way to handle mov files in
                              > the Audio::Scan module ?


                              Hmm, if it's handled OK by SBS that means Audio::Scan is being used. We don't use any other format modules now. Can you send me the file that doesn't work?



                              Comment

                              • erland
                                Senior Member
                                • Jan 2006
                                • 11323

                                #30
                                Originally posted by andyg
                                On Sep 6, 2010, at 11:47 PM, erland wrote:
                                > Ok, this is a problem we need to solve.
                                >
                                > Andy, if you are reading this, is there any way to handle this in the
                                > Audio::Scan module ?
                                >
                                > If not, is there some other metadata that could be added to the MD5 in
                                > similar way as I did with audio size to make sure each track on a cue
                                > sheet get a unique identity ? Some track offset maybe ?


                                OK, Audio::Scan could let you provide an alternate starting byte offset for the calculation, then you could run it on the same file with the start offset of each track (which is already stored in the database) and get different checksums.

                                How about:

                                Audio::Scan->scan_info( $file, { md5_size => 1024, md5_offset => $track_start } );

                                FYI in the next version of Audio::Scan the default md5_offset is determined by:

                                audio_offset + (audio_size / 2) - (md5_size / 2);
                                Does this default mean that it will work with cue sheets without me specifying a specific md5_offset parameter ?

                                If not, is the offset in the database an offset in the compressed or uncompressed audio data ? If it's an offset in the compressed audio data, this should work.

                                Is the audio_size of an individual cue sheet track also available in the database ?
                                I suppose I might need this if I like to ensure that I don't include data from the previous or next track in the checksum calculation.
                                Erland Lindmark (My homepage)
                                Developer of many plugins/applets
                                Starting with LMS 8.0 I no longer support my plugins/applets (see here for more information )

                                Comment

                                Working...