Home of the Squeezebox™ & Transporter® network music players.
Results 1 to 3 of 3
  1. #1
    John A. Tamplin
    Guest

    Re: Primary key for Song: Strawman SQL databaseintegration thoughts

    On Fri, 13 Aug 2004, Pat Farrell wrote:

    > > However remote, there is the possibility that two
    > >different songs will hash to the same value.

    >
    > Yes, the probability is non-zero. But very small.
    > There are only about 500,000 songs on all the pop/rock/
    > r+b/jazz/americana/.... CDs in print. Classical is another
    > world, I don't have the numbers handy, but I expect that
    > it is less than the pop stuff.
    >
    > A 120 bit hash's collision expectation, including the birthday paradox,
    > is too small to worry about. But, I changed the schema just
    > to make you and Dale happy.


    I used MD5 to clean up an image collection and remove duplicates. I was
    surprised to find about a dozen collisions with non-identical images out
    of about 100,000 images. What I discovered was that very small images had
    a significantly higher chance of colliding since there were fixed headers
    that were largely the same and in these cases only 30-40 bytes of actual
    data that varied. I doubt there are any really short songs to cause this
    sort of problem -- however, I still think the database should allow
    duplicates in the hash. The import interface can warn the user of the
    duplication and ask what they want to do with the song, but you can
    imagine a user being very frustrated if they can't store some song just
    because the hash matches something else.

    > More importantly, and less theologically,
    > I want to know which songs are identical.
    > If I want to shuffle all my, say Crosby and Nash songs,
    > I don't want three copies of Marrakesh Express in the mix
    > just because it is on six of their greatest hits albums,
    > plus the original album, plus the box set.


    The problem is these six different versions of the song are unlikely to be
    bit-for-bit identical. Even ignoring the issue of reading the bits from
    the disc, the production of the audio is likely to not use the same bits.
    In my case, I have found the length to even vary slightly. So, you have
    to solve this another way anyway.

    > I have no problem with this approach. Given the changed
    > schema, are you happy?


    Yes, other than I would still like to see a non-unique index on the hash
    (or equivalently add the autoincrement key as the last component of that
    index).

    > I don't trust the meta data (specifically the tags) in any way
    > for any of my interesting music. For pop, sure. Classic rock
    > is nearly always right in both Freedb and CDDB. Jazz is
    > usually wrong, and classical works are a disaster.


    I clean the tags up myself as I find things that are incorrect, plus I
    have a lot of aliases that normalize the data and correct misspellings
    during ripping. If you don't fix the tag data and don't trust it, then
    you might as well not have it.

    > I don't see that in your initial posting, or don't understand what you mean.
    > To my mind, bootlegs are just like different masterings or mixing,
    > they are different songs. Of course, you can point
    > all the data in the artist, performer, composer, engineer and
    > other tables to one or many songs.


    You are mixing someone else's posting -- I never talked about separating
    performance from the idea of a song. I think it is reasonable to track
    that (for example, covers of a song), but I don't know if it is worth the
    extra level in the data to cover something that will have to be manually
    built since none of the metadata currently has the information necessary
    to build it.

    > You also posted there:
    > >I know some databases support strings of non-fixed length.
    > >For file paths that would seem like a good idea. (I don't recall
    > >if MySQL is one of those that does.) If there is a fixed limit
    > >it had better be as long as the filesystem allows a path to be.

    >
    > I know MySql has varchar, that is what the sql standard for variable
    > length specifies. The schemas reflect varchar where
    > it made sense to me. I don't know the details of the package Vidur
    > prefered, but any decent RDBMS handles them.


    Likewise, this wasn't me. Any database always has a fixed limit on a
    variable string, even if it 16G or something. Some databases will
    actually allocate the maximum space and varchar just means that it keeps
    track of the length for you, so you don't want to make it outrageously
    large.

    Some databases also pay a penalty for using varchar (for example,
    fixed-length portions are stored in the main record, variable-length
    portions are allocated elsewhere and accessed via a pointer in that row),
    and on those you may want to specify a minimum reserved length that will
    get 90% of the rows. However, the DDL is probably going to need to be
    database-specific anyway, so that can be handled at schema creation time
    for the appropriate database. Also, if a user needs to increase it they
    can.

    > Some DBMS packages have arbitrary limits to the
    > size of varchar fields. Maximums of 255 and 511 are common,
    > I think Oracle allows 2047.
    > At some point, it becomes a doctor, doctor, problem.


    Yes, I would suggest a maximum length of any varchar as 254 (some DBs
    limit it to 255 including the length byte and use the length value 255 to
    denote null [as opposed to 0 being a string containing no characters]).

    --
    John A. Tamplin jat (AT) jaet (DOT) org
    770/436-5387 HOME 4116 Manson Ave
    Smyrna, GA 30082-3723

  2. #2
    Peter Speck
    Guest

    Re: Primary key for Song: Strawman SQL databaseintegration thoughts

    Hi,

    Warning, here comes a nitpick email...


    > Yes, I would suggest a maximum length of any varchar as 254 (some DBs
    > limit it to 255 including the length byte and use the length value 255
    > to
    > denote null [as opposed to 0 being a string containing no characters]).


    Sounds like the "640K ought to be enough for everyone" syndrome.

    Checking my music collection for path name lengths, I get:
    Length of pathname / number of files:
    279: 1
    253: 1
    216: 1
    211: 1
    208: 2
    207: 1
    206: 1
    204: 1
    203: 1
    202: 1
    197: 1
    196: 1
    195: 2
    194: 1
    191: 1
    190: 1
    189: 1
    187: 1
    186: 2
    182: 2
    181: 4
    176: 1
    ....
    find /Volumes/frede/Music/ | perl -e '$h{length($_)-1}++ while (<>);
    print "$_: $h{$_}\n" foreach ( sort { $b <=> $a } keys %h );'

    Longest path (utf-8), created by iTunes:
    /Volumes/frede/Music/Woldemar Nelson, Orchester der Bayreuther
    Festspiele and Chor der Bayreuther Festspiele/Der fliegende
    Hollńnder/1-05 Rezitativ und Arie_ Wie oft in Meeres tiefsten Schlund _
    Dich frage ich, gepriesener Engel Gottes _ Nur eine Hoffnung soll mir
    bleiben.aif

    Vidur chose SQLite. The documentation says: "SQLite is typeless. The
    datatype for the column does not restrict what data may be put in that
    column. All information is stored as null-terminated strings."
    http://www.sqlite.org/datatype3.html
    http://www.sqlite.org/lang.html#createtable

    It seems like it treats char(n), varchar(n) and text as the same. I
    can't find anywhere in the documentation where it says if it enforces
    the length restriction, and I haven't looked into the source, nor
    installed/tested it, so I don't know.

    AFAIK PostgreSQL and MySQL stores text and varchar(n) very efficient
    but DB2 is horrible ("variable-length portions are allocated
    elsewhere") and Oracle doesn't handle text in where clauses.

    So my suggestion would be to make it "text" so it just works perfectly
    with SlimDevices preferred database (and be as simple as possible), and
    people wanting to use clustered Oracle etc would have to run the schema
    files through a converter... (that would be the simplest part of using
    an Oracle database...) No need to overdesign.

    > create table song (
    > songKey int(10) auto_increment primary key,
    > songHash char (28) unique,


    Would then become:
    create table song (
    songKey int auto_increment primary key,
    songHash text unique,
    somePath text,

    Note that SQLite resizes an integer column automatically depending on
    the value, so no size specification is needed. Negative values uses 9
    bytes!

    My last nit-pick would be the inconsistent naming of columns: songKey
    versus performerid (both int pritmary key).

    To make it easier to read when made all lower/uppercased (SQL databases
    are caseinsensitive), I would suggest using underscore for
    word-delimiter and always use _id for unique keys: song_id,
    performer_id, song_hash, file_path

    ----
    - Peter Speck

  3. #3
    Jack Coates
    Guest

    Re: Primary key for Song: Strawman SQLdatabaseintegration thoughts

    > Hi,
    >
    > Warning, here comes a nitpick email...
    >
    >
    >> Yes, I would suggest a maximum length of any varchar as 254 (some DBs
    >> limit it to 255 including the length byte and use the length value 255
    >> to
    >> denote null [as opposed to being a string containing no characters]).

    >
    > Sounds like the "640K ought to be enough for everyone" syndrome.

    ....
    > Longest path (utf-8), created by iTunes:
    > /Volumes/frede/Music/Woldemar Nelson, Orchester der Bayreuther
    > Festspiele and Chor der Bayreuther Festspiele/Der fliegende
    > Hollńnder/1-05 Rezitativ und Arie_ Wie oft in Meeres tiefsten Schlund _
    > Dich frage ich, gepriesener Engel Gottes _ Nur eine Hoffnung soll mir
    > bleiben.aif
    >


    This is going to be a lot more common with classical I'd think, but your
    point stands. FWIW, my collection yields:
    [jack@felix jack]$ find /mnt/music/ | perl -e '$h{length($_)-1}++ while
    (<>); print "$_: $h{$_}\n" foreach ( sort { $b <=> $a } keys %h );' | sort
    -n | tail
    173: 3
    174: 1
    177: 1
    178: 1
    180: 1
    182: 1
    184: 1
    189: 1
    199: 1
    200: 2
    [jack@felix jack]$
    ....
    ....
    >> create table song (
    >> songKey int(10) auto_increment primary key,
    >> songHash char (28) unique,

    >
    > Would then become:
    > create table song (
    > songKey int auto_increment primary key,
    > songHash text unique,
    > somePath text,
    >


    As someone else mentioned, using "track" instead of "song" makes a lot of
    sense. Not worth losing sleep over, but this is the time to make the
    change.

    > Note that SQLite resizes an integer column automatically depending on
    > the value, so no size specification is needed. Negative values uses 9
    > bytes!
    >
    > My last nit-pick would be the inconsistent naming of columns: songKey
    > versus performerid (both int pritmary key).
    >
    > To make it easier to read when made all lower/uppercased (SQL databases
    > are caseinsensitive), I would suggest using underscore for
    > word-delimiter and always use _id for unique keys: song_id,
    > performer_id, song_hash, file_path
    >


    Not to pick nits in your nit-pick, but MS SQL case-sensitivity is an
    option, and maybe half of the databases I've seen in the wild have it
    turned on.
    --
    Jack At Monkeynoodle.Org: It's A Scientific Venture...
    "Believe what you're told; there'd be chaos if everyone thought for
    themselves." -- Top Dog hotdog stand, Berkeley, CA

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •