Home of the Squeezebox™ & Transporter® network music players.
Page 3 of 8 FirstFirst 12345 ... LastLast
Results 21 to 30 of 72
  1. #21
    Senior Member audiomuze's Avatar
    Join Date
    Oct 2009
    Posts
    1,001
    Hi Erland

    Would it not be possible to limit the md5 hash to say 1000 bytes or something similarly small if you read forward from the midpoint of the audio portion of the file, regardless of file format?
    SqueezeWand | Vivere DAC MKI | ATC SCA2 | ATC SCM100ASLT

    Linux finally gets a great audio tagger: puddletag - now packaged in most Linux distributions.

  2. #22
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,942
    Quote Originally Posted by audiomuze View Post
    Would it not be possible to limit the md5 hash to say 1000 bytes or something similarly small if you read forward from the midpoint of the audio portion of the file, regardless of file format?
    Yes possibly, Andy plans to try if that works better so we will know as soon as he have implemented a new version of the Audio::Scan module that supports this which people can try.

    A possible issue is that we are talking about compressed data which means that the compression algorithm might cause problems. I don't have any detailed knowledge about this but I suspect the real data might be stored in the beginning of the file and the later part of the file might just be instructions at which points to insert the different data sections when uncompressing. If you know about compression algorithms, you know that most of them try to store a common data section once and have a list of all occurrences of that section in the uncompressed file. Of course, the list of pointers might be as good as a real data section from a checksum perspective.

    The 0.2 version combines the MD5 checksum with the number of compressed audio bytes in the file and this made it a lot better than the previous approach which only used the MD5 checksum. In the result "Duplicates" shows the files that have both the same checksum and the same number of compressed audio bytes. The "Incorrect duplicates" is the list of files that have the same checksum but not the same number of compressed audio bytes.

    Since the intention is to use this later on to connect manually entered metadata/statistics to individual music files it really needs to be as close to 100% as possible.
    Erland Isaksson (My homepage)
    Lead platform developer of ickStream Music Platform - A world of music at your fingertips

    (Also developer of many plugins/applets (both free and commercial). If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

  3. #23
    MrSinatra
    Guest
    thx for the info...

    so if i understand what you guys are saying, the intention is to devise a system where a file can be uniquely identified to SBS and continuously tracked even if it were to be moved around, retagged, etc, correct?

    and to stress test this system, you are seeing how best to identify duplicates, so that later you can be confident that SBS won't misidentify one file as being another file that in reality, it really isn't... yes?

    once confident that SBS can uniquely identify a given file, SBS can store stats and other data about the file that don't have to be in the file or its tag. right?

    i'm all for that, but i just hope there will be some kind of "Cache clearing" method to be able to tell SBS to forget the info collected if the user desires, and/or a way to import/export it to other installs.

    neat work!

  4. #24
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,942
    Quote Originally Posted by MrSinatra View Post
    so if i understand what you guys are saying, the intention is to devise a system where a file can be uniquely identified to SBS and continuously tracked even if it were to be moved around, retagged, etc, correct?
    Yes

    Quote Originally Posted by MrSinatra View Post
    and to stress test this system, you are seeing how best to identify duplicates, so that later you can be confident that SBS won't misidentify one file as being another file that in reality, it really isn't... yes?
    Yes, the purpose is to get help testing it with all possible variations of encodings/file formats and to make sure it's unique enough in a large library so two songs doesn't get the same identity.

    Quote Originally Posted by MrSinatra View Post
    once confident that SBS can uniquely identify a given file, SBS can store stats and other data about the file that don't have to be in the file or its tag. right?
    Yes, I'm not sure standard SBS is going to do it but there are plans to use it in some third party stuff.

    Quote Originally Posted by MrSinatra View Post
    i'm all for that, but i just hope there will be some kind of "Cache clearing" method to be able to tell SBS to forget the info collected if the user desires, and/or a way to import/export it to other installs.
    The reason to have it as a third party add-on is that you don't need to use it if you don't like it. Assuming it's fast enough it should be safe to recalculate the identity each time a track is scanned, so SBS can keep deleting its database contents during a full rescan but we will be able to have additional tables with persistent metadata/statistics which can be reconnected to the correct tracks after/during the scanning.

    If it isn't fast enough, we need to implement some caching mechanism or optimize the performance in some other way but let's handle that after we know it's needed. At the moment the focus is to make sure the identification process is good enough.

    Export/import possibilities for metadata/statistics entered manually is always important, but the main reason for that is to make it possible to take a backup or to export it for usage in some other application. In addition to this it of course also has to be possible to clear all metadata/statistics if you like to start over from the beginning.
    Erland Isaksson (My homepage)
    Lead platform developer of ickStream Music Platform - A world of music at your fingertips

    (Also developer of many plugins/applets (both free and commercial). If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

  5. #25
    Senior Member Philip Meyer's Avatar
    Join Date
    Apr 2005
    Location
    UK
    Posts
    5,587

    Need help to verify duplicate detection

    >1. Goto "Extras/Duplicate Detector" in SBS web interface and start
    >detection. You need to hit the "Refresh" link to see the current
    >progress.
    >

    Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.

    I note though that SBS reports I have 35852 songs, but Duplicate Detector says it is checking 35871 songs. Could this be because it is picking up non-track records (eg. playlist names, urls to other resources, etc)?

    >2. The default is to detect using the first 10000 audio bytes.
    >

    Not sure that it is going to be safe to take a sub-set of audio bytes (or at least without looking at other audio attributes too). There could be different versions of songs that are edited, eg. the first part of the song is identical, but one version is a bit longer. Perhaps the checksum also needs to include audio length, or take n bytes at the start AND n bytes at the end for the checksum calculation.

    I can think of one example, where I have an album in two formats - the original where songs are distinct tracks, and an enhanced version where songs are cross-faded.

    What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?

    So far, it has detected 113 checksum duplicates (all incorrect duplicates) out of 6856 songs (Duplicates: 0).

    I looked at the results, and there are groups that I assume it believes are duplicates, such as:

    64b80a25505a34d0c723dce617ced261-00d02343 M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\01 - On The Road Again (Alternate Take).mp3
    64b80a25505a34d0c723dce617ced261-0079bc14 M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\02 - Nine Below Zero.mp3
    [...in this case, all 45 songs on the album are listed...]
    64b80a25505a34d0c723dce617ced261-001930fa M:\Music\Phil's Music\Blues\Canned Heat\Uncanned\45 - Huh.mp3

    The first part up to the "-" is always the same, the second part of the number is different. What does this mean?

    Phil

  6. #26
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,942
    Quote Originally Posted by Philip Meyer View Post
    The first part up to the "-" is always the same, the second part of the number is different. What does this mean?
    Checksum (first part) is the same and the audio length (last part) isn't, combining them makes it unique.
    Erland Isaksson (My homepage)
    Lead platform developer of ickStream Music Platform - A world of music at your fingertips

    (Also developer of many plugins/applets (both free and commercial). If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

  7. #27
    Senior Member Philip Meyer's Avatar
    Join Date
    Apr 2005
    Location
    UK
    Posts
    5,587

    Need help to verify duplicate detection

    >Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.
    >

    It finished (left it going unattended). Not sure how long it took - couldn't find anything in the log to indicate when it had actually finished, but it did take a significant amount of time. This will add to rescan times.

    Detecting using (number of bytes): 10000
    Detected: 35871
    Checksum duplicates: 1405 duplicates (Incorrect duplicates: 1325)
    Duplicates: 80

    It actually highlighted a handfull of actual duplicates that I wasn't aware of (eg. where I have remix versions from old CD singles that are named differently but are actually the same thing).

    After I ignored duplicates that obviously weren't duplicates (eg. due to cue sheets - see below), there were actually only two false positives. The checksum from the first 10,000 bytes matches, and the song length is identical, but the songs are certainly not the same.

    >What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?
    >

    Answer to this is that every track that comes from the same source file in a cue sheet is detected as a duplicate. eg. in the duplicates log file, I have 26 lines of:

    71eb89d6770416109fca35e97cde57e8-34d5db7e M:\Music\Surround Sound\The Beatles\Love\Love.dts.flac


    I have a single .mov file that SBS understands (and just plays the audio content). This appears in the Duplicates log as:

    NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom Yorke\Other\Rabbit In Your Headlights.mov

    I guess your code doesn't understand .mov files, but SBS can.

  8. #28
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,942
    Quote Originally Posted by Philip Meyer View Post
    >Running this now - will confirm how long it takes to complete against a library of 35871 tracks when its finished.
    >

    It finished (left it going unattended). Not sure how long it took - couldn't find anything in the log to indicate when it had actually finished, but it did take a significant amount of time. This will add to rescan times.
    As long as the "Duplicates: 80" number isn't increased, you can try to decrease the number of bytes used in "Duplicate Detector" settings page. In my small FLAC library, I was able to go as low as 680 without any duplicates.


    Quote Originally Posted by Philip Meyer View Post

    >What about support for cue sheets? i.e. a single audio file that is chopped up into songs via a cue sheet - each song references the same audio file, but with a different sub-set of the data. Does the duplicate checker take account of this; treating each track as if it were a separate file chopped into segments?
    >

    Answer to this is that every track that comes from the same source file in a cue sheet is detected as a duplicate. eg. in the duplicates log file, I have 26 lines of:

    71eb89d6770416109fca35e97cde57e8-34d5db7e M:\Music\Surround Sound\The Beatles\Love\Love.dts.flac
    Ok, this is a problem we need to solve.

    Andy, if you are reading this, is there any way to handle this in the Audio::Scan module ?

    If not, is there some other metadata that could be added to the MD5 in similar way as I did with audio size to make sure each track on a cue sheet get a unique identity ? Some track offset maybe ?

    Quote Originally Posted by Philip Meyer View Post
    I have a single .mov file that SBS understands (and just plays the audio content). This appears in the Duplicates log as:

    NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom Yorke\Other\Rabbit In Your Headlights.mov

    I guess your code doesn't understand .mov files, but SBS can.
    Andy, if you are reading this, is there any way to handle mov files in the Audio::Scan module ?
    Erland Isaksson (My homepage)
    Lead platform developer of ickStream Music Platform - A world of music at your fingertips

    (Also developer of many plugins/applets (both free and commercial). If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

  9. #29
    Administrator andyg's Avatar
    Join Date
    Jan 2006
    Location
    Pittsburgh, PA
    Posts
    7,395

    Need help to verify duplicate detection

    On Sep 6, 2010, at 11:47 PM, erland wrote:
    > Ok, this is a problem we need to solve.
    >
    > Andy, if you are reading this, is there any way to handle this in the
    > Audio::Scan module ?
    >
    > If not, is there some other metadata that could be added to the MD5 in
    > similar way as I did with audio size to make sure each track on a cue
    > sheet get a unique identity ? Some track offset maybe ?


    OK, Audio::Scan could let you provide an alternate starting byte offset for the calculation, then you could run it on the same file with the start offset of each track (which is already stored in the database) and get different checksums.

    How about:

    Audio::Scan->scan_info( $file, { md5_size => 1024, md5_offset => $track_start } );

    FYI in the next version of Audio::Scan the default md5_offset is determined by:

    audio_offset + (audio_size / 2) - (md5_size / 2);

    So a user-supplied md5_offset would just override that default.

    > Philip Meyer;574862 Wrote:
    >>
    >> I have a single .mov file that SBS understands (and just plays the
    >> audio content). This appears in the Duplicates log as:
    >>
    >> NOCHECKSUM-NOSIZE M:\Music\Phil's Music\Progressive Rock\Thom
    >> Yorke\Other\Rabbit In Your Headlights.mov
    >>
    >> I guess your code doesn't understand .mov files, but SBS can.
    >>

    > Andy, if you are reading this, is there any way to handle mov files in
    > the Audio::Scan module ?


    Hmm, if it's handled OK by SBS that means Audio::Scan is being used. We don't use any other format modules now. Can you send me the file that doesn't work?

    http://wiki.slimdevices.com/index.php/Large_File_Upload


  10. #30
    Senior Member erland's Avatar
    Join Date
    Dec 2005
    Location
    Sweden
    Posts
    10,942
    Quote Originally Posted by andyg View Post
    On Sep 6, 2010, at 11:47 PM, erland wrote:
    > Ok, this is a problem we need to solve.
    >
    > Andy, if you are reading this, is there any way to handle this in the
    > Audio::Scan module ?
    >
    > If not, is there some other metadata that could be added to the MD5 in
    > similar way as I did with audio size to make sure each track on a cue
    > sheet get a unique identity ? Some track offset maybe ?


    OK, Audio::Scan could let you provide an alternate starting byte offset for the calculation, then you could run it on the same file with the start offset of each track (which is already stored in the database) and get different checksums.

    How about:

    Audio::Scan->scan_info( $file, { md5_size => 1024, md5_offset => $track_start } );

    FYI in the next version of Audio::Scan the default md5_offset is determined by:

    audio_offset + (audio_size / 2) - (md5_size / 2);
    Does this default mean that it will work with cue sheets without me specifying a specific md5_offset parameter ?

    If not, is the offset in the database an offset in the compressed or uncompressed audio data ? If it's an offset in the compressed audio data, this should work.

    Is the audio_size of an individual cue sheet track also available in the database ?
    I suppose I might need this if I like to ensure that I don't include data from the previous or next track in the checksum calculation.
    Erland Isaksson (My homepage)
    Lead platform developer of ickStream Music Platform - A world of music at your fingertips

    (Also developer of many plugins/applets (both free and commercial). If you like to encourage future presence on this forum and/or third party plugin/applet development, consider purchasing some plugins)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •