Home of the Squeezebox™ & Transporter® network music players.
Page 1 of 8 123 ... LastLast
Results 1 to 10 of 72
  1. #1
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,915

    Need help to verify duplicate detection

    Is there someone with a large music library or with a library with a lot of different file formats that could help me a bit ?

    It works in my small 3500 tracks library but I would like to try it with something larger.

    I would like you to test the "Duplicate Detector" plugin which exists in my testing repository:
    Code:
    http://erlandplugins.googlecode.com/svn/repository/trunk/testing.xml
    You will need the latest Squeezebox Server 7.5 nightly or 7.6 nightly release, revision 31264 or later.

    I would like to to install the plugin and:
    1. Goto "Extras/Duplicate Detector" in SBS web interface and start detection. You need to hit the "Refresh" link to see the current progress.
    2. The default is to detect using the first 10000 audio bytes, if you don't get any incorrect duplicates reported try decreasing this number in the Duplicate Detector settings pages and see how low you can go.
    3. Look in the server.log and see if you get any strange errors during the detection.

    The plan for this is a lot more than detecting duplicates, this is just an initial experiment to make sure it's possible to uniquely identify music files when ignoring tags and just looking at the audio data. The long term intention is to be able to use this to connect statistics and metadata to a track and still be able to handle that the file is moved, renamed or re-tagged.

    It won't detect duplicate files if they use different file formats, it only looks at the audio data and since this differs from FLAC and MP3 version of the same time it won't consider these to be duplicates.
    It should consider two files a duplicate if they have the same audio data but have different tags.

    Report back how low/high you had to set the setting without getting any incorrect duplicates reported.
    Also report which operating system you have verified it on and what kind of music files you have (FLAC, MP3, ...).

    The performance difference between 7.5 and 7.6 was very big in my library, 7.6 was many, many times faster.
    Erland Isaksson (My homepage)
    (Developer of many plugins/applets (both free and commercial).
    If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

    Interested in the future of music streaming ? ickStream - A world of music at your fingertips.

  2. #2
    Senior Member vagskal's Avatar
    Join Date
    Oct 2008
    Location
    Sweden
    Posts
    643
    I just tried the plugin with the default setting. I only get Unable to calculate checksum error messages. The web UI reports as many duplicates as there are files detected. The .txt file for viewing the duplicate files is empty.

    I guess there must be something wrong.

    Version: 7.5.2 - r31223 @ Thu Aug 19 02:06:58 PDT 2010
    Hostname: musik
    Server IP Address: 192.168.1.101
    Server HTTP Port Number: 9000
    Operating system: Windows XP - EN - cp1252
    Platform Architecture: 586
    Perl Version: 5.10.0 - MSWin32-x86-multi-thread
    MySQL Version: 5.0.22-community-nt-log
    Total Players Recognized: 4
    Attached Files Attached Files
    2 x SB3 (wired), Receiver (wired), Boom (wireless), Controller, iPeng on iPhone 4 & iPad, muso on remote computer running Win 7 64-bit | 7.7.3 on Win XP

  3. #3
    Senior Member
    Join Date
    Dec 2009
    Location
    Germany
    Posts
    732
    Quote Originally Posted by vagskal View Post
    I just tried the plugin with the default setting. I only get Unable to calculate checksum error messages. The web UI reports as many duplicates as there are files detected. The .txt file for viewing the duplicate files is empty.

    I guess there must be something wrong.

    Version: 7.5.2 - r31223 @ Thu Aug 19 02:06:58 PDT 2010
    ...
    Your SBS version seems to be too old, r31264 and up are needed for it to work.

  4. #4
    Administrator andyg's Avatar
    Join Date
    Jan 2006
    Location
    Pittsburgh, PA
    Posts
    7,395
    For some reason the "Show duplicates" link downloads a file called duplicates.txt.rdp (Safari/OSX 10.6). Is there a reason you used a .binfile extension instead of just .txt?

    I'm getting a lot of duplicates even with the default of 10000 bytes. I will look into that, makes me worry I did something wrong in Audio::Scan.

  5. #5
    Administrator andyg's Avatar
    Join Date
    Jan 2006
    Location
    Pittsburgh, PA
    Posts
    7,395
    Yeah it's LAME padding causing the problem, hmm...

  6. #6
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,915
    Quote Originally Posted by andyg View Post
    For some reason the "Show duplicates" link downloads a file called duplicates.txt.rdp (Safari/OSX 10.6). Is there a reason you used a .binfile extension instead of just .txt?
    I didn't want it to open in the browser if the file got huge, so I wanted the content type to be set to "application/octet-stream". It works for me with Safari on OSX 10.6 but maybe that's because I'm running SBS on a separate Linux machine ?

    If you have any ideas why it appends .rdp, let me know.

    Quote Originally Posted by andyg View Post
    I'm getting a lot of duplicates even with the default of 10000 bytes. I will look into that, makes me worry I did something wrong in Audio::Scan.

    ...

    Yeah it's LAME padding causing the problem, hmm...
    Do you mean that LAME encoded files will cause incorrect duplicates ? Is this independent of md5_size settings ?

    I've got reports from a user with 100 000 tracks library and so far he seems to get different number of duplicates when using different md5_size settings, so far he has tested a couple of settings between 10 000 and 10 000 000 and all report different number of duplicates.

    Is this an indication that MD5 might not be good enough in a large library ?
    Erland Isaksson (My homepage)
    (Developer of many plugins/applets (both free and commercial).
    If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

    Interested in the future of music streaming ? ickStream - A world of music at your fingertips.

  7. #7
    Administrator andyg's Avatar
    Join Date
    Jan 2006
    Location
    Pittsburgh, PA
    Posts
    7,395
    MD5 is fine, the problem is the first 10000 bytes of these files are identical. I think the easiest way to deal with it is to not take the bytes from the very beginning of the file but from somewhere in the middle.

  8. #8
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,915
    Quote Originally Posted by andyg View Post
    MD5 is fine, the problem is the first 10000 bytes of these files are identical. I think the easiest way to deal with it is to not take the bytes from the very beginning of the file but from somewhere in the middle.
    It just felt strange that using a md5_size of 100 000 reports different number of duplicates than a setting of 500 000, shouldn't the padding be irrelevant when using a larger md5_size settings ?
    Erland Isaksson (My homepage)
    (Developer of many plugins/applets (both free and commercial).
    If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

    Interested in the future of music streaming ? ickStream - A world of music at your fingertips.

  9. #9
    Administrator andyg's Avatar
    Join Date
    Jan 2006
    Location
    Pittsburgh, PA
    Posts
    7,395

    Need help to verify duplicate detection

    On Sep 3, 2010, at 1:02 PM, erland wrote:

    >
    > andyg;574039 Wrote:
    >> MD5 is fine, the problem is the first 10000 bytes of these files are
    >> identical. I think the easiest way to deal with it is to not take the
    >> bytes from the very beginning of the file but from somewhere in the
    >> middle.
    >>

    > It just felt strange that using a md5_size of 100 000 reports different
    > number of duplicates than a setting of 500 000, shouldn't the padding be
    > irrelevant when using a larger md5_size settings ?


    Yeah, any false-positive duplicates need to be investigated as to why so many bytes are identical.

  10. #10
    Senior Member vagskal's Avatar
    Join Date
    Oct 2008
    Location
    Sweden
    Posts
    643
    Quote Originally Posted by copperstate View Post
    Your SBS version seems to be too old, r31264 and up are needed for it to work.
    Thanks, that did it.

    I scanned my +115k library and 9,548 duplicates were found with the default setting. When I clicked show duplicates SBS stalled (music playing stopped, mysql at 50% CPU) and I left it that way overnight. No change the morning after so I had to force SBS to quit. The same happened when I after a restart of SBS tried showing the duplicates again.

    When I checked the show duplicates .txt file in the beginning of the scan it worked and reported duplicates that were not duplicates (it was perhaps only half a page of data then).

    Log file is attached.
    Attached Files Attached Files
    2 x SB3 (wired), Receiver (wired), Boom (wireless), Controller, iPeng on iPhone 4 & iPad, muso on remote computer running Win 7 64-bit | 7.7.3 on Win XP

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •