PDA

View Full Version : Re: Primary key for Song: Strawman SQL databaseintegration thoughts



John A. Tamplin
2004-08-13, 05:25
On Fri, 13 Aug 2004, Pat Farrell wrote:

> > However remote, there is the possibility that two
> >different songs will hash to the same value.
>
> Yes, the probability is non-zero. But very small.
> There are only about 500,000 songs on all the pop/rock/
> r+b/jazz/americana/.... CDs in print. Classical is another
> world, I don't have the numbers handy, but I expect that
> it is less than the pop stuff.
>
> A 120 bit hash's collision expectation, including the birthday paradox,
> is too small to worry about. But, I changed the schema just
> to make you and Dale happy.

I used MD5 to clean up an image collection and remove duplicates. I was
surprised to find about a dozen collisions with non-identical images out
of about 100,000 images. What I discovered was that very small images had
a significantly higher chance of colliding since there were fixed headers
that were largely the same and in these cases only 30-40 bytes of actual
data that varied. I doubt there are any really short songs to cause this
sort of problem -- however, I still think the database should allow
duplicates in the hash. The import interface can warn the user of the
duplication and ask what they want to do with the song, but you can
imagine a user being very frustrated if they can't store some song just
because the hash matches something else.

> More importantly, and less theologically,
> I want to know which songs are identical.
> If I want to shuffle all my, say Crosby and Nash songs,
> I don't want three copies of Marrakesh Express in the mix
> just because it is on six of their greatest hits albums,
> plus the original album, plus the box set.

The problem is these six different versions of the song are unlikely to be
bit-for-bit identical. Even ignoring the issue of reading the bits from
the disc, the production of the audio is likely to not use the same bits.
In my case, I have found the length to even vary slightly. So, you have
to solve this another way anyway.

> I have no problem with this approach. Given the changed
> schema, are you happy?

Yes, other than I would still like to see a non-unique index on the hash
(or equivalently add the autoincrement key as the last component of that
index).

> I don't trust the meta data (specifically the tags) in any way
> for any of my interesting music. For pop, sure. Classic rock
> is nearly always right in both Freedb and CDDB. Jazz is
> usually wrong, and classical works are a disaster.

I clean the tags up myself as I find things that are incorrect, plus I
have a lot of aliases that normalize the data and correct misspellings
during ripping. If you don't fix the tag data and don't trust it, then
you might as well not have it.

> I don't see that in your initial posting, or don't understand what you mean.
> To my mind, bootlegs are just like different masterings or mixing,
> they are different songs. Of course, you can point
> all the data in the artist, performer, composer, engineer and
> other tables to one or many songs.

You are mixing someone else's posting -- I never talked about separating
performance from the idea of a song. I think it is reasonable to track
that (for example, covers of a song), but I don't know if it is worth the
extra level in the data to cover something that will have to be manually
built since none of the metadata currently has the information necessary
to build it.

> You also posted there:
> >I know some databases support strings of non-fixed length.
> >For file paths that would seem like a good idea. (I don't recall
> >if MySQL is one of those that does.) If there is a fixed limit
> >it had better be as long as the filesystem allows a path to be.
>
> I know MySql has varchar, that is what the sql standard for variable
> length specifies. The schemas reflect varchar where
> it made sense to me. I don't know the details of the package Vidur
> prefered, but any decent RDBMS handles them.

Likewise, this wasn't me. Any database always has a fixed limit on a
variable string, even if it 16G or something. Some databases will
actually allocate the maximum space and varchar just means that it keeps
track of the length for you, so you don't want to make it outrageously
large.

Some databases also pay a penalty for using varchar (for example,
fixed-length portions are stored in the main record, variable-length
portions are allocated elsewhere and accessed via a pointer in that row),
and on those you may want to specify a minimum reserved length that will
get 90% of the rows. However, the DDL is probably going to need to be
database-specific anyway, so that can be handled at schema creation time
for the appropriate database. Also, if a user needs to increase it they
can.

> Some DBMS packages have arbitrary limits to the
> size of varchar fields. Maximums of 255 and 511 are common,
> I think Oracle allows 2047.
> At some point, it becomes a doctor, doctor, problem.

Yes, I would suggest a maximum length of any varchar as 254 (some DBs
limit it to 255 including the length byte and use the length value 255 to
denote null [as opposed to 0 being a string containing no characters]).

--
John A. Tamplin jat (AT) jaet (DOT) org
770/436-5387 HOME 4116 Manson Ave
Smyrna, GA 30082-3723

Peter Speck
2004-08-13, 07:52
Hi,

Warning, here comes a nitpick email...


> Yes, I would suggest a maximum length of any varchar as 254 (some DBs
> limit it to 255 including the length byte and use the length value 255
> to
> denote null [as opposed to 0 being a string containing no characters]).

Sounds like the "640K ought to be enough for everyone" syndrome.

Checking my music collection for path name lengths, I get:
Length of pathname / number of files:
279: 1
253: 1
216: 1
211: 1
208: 2
207: 1
206: 1
204: 1
203: 1
202: 1
197: 1
196: 1
195: 2
194: 1
191: 1
190: 1
189: 1
187: 1
186: 2
182: 2
181: 4
176: 1
....
find /Volumes/frede/Music/ | perl -e '$h{length($_)-1}++ while (<>);
print "$_: $h{$_}\n" foreach ( sort { $b <=> $a } keys %h );'

Longest path (utf-8), created by iTunes:
/Volumes/frede/Music/Woldemar Nelson, Orchester der Bayreuther
Festspiele and Chor der Bayreuther Festspiele/Der fliegende
Holländer/1-05 Rezitativ und Arie_ Wie oft in Meeres tiefsten Schlund _
Dich frage ich, gepriesener Engel Gottes _ Nur eine Hoffnung soll mir
bleiben.aif

Vidur chose SQLite. The documentation says: "SQLite is typeless. The
datatype for the column does not restrict what data may be put in that
column. All information is stored as null-terminated strings."
http://www.sqlite.org/datatype3.html
http://www.sqlite.org/lang.html#createtable

It seems like it treats char(n), varchar(n) and text as the same. I
can't find anywhere in the documentation where it says if it enforces
the length restriction, and I haven't looked into the source, nor
installed/tested it, so I don't know.

AFAIK PostgreSQL and MySQL stores text and varchar(n) very efficient
but DB2 is horrible ("variable-length portions are allocated
elsewhere") and Oracle doesn't handle text in where clauses.

So my suggestion would be to make it "text" so it just works perfectly
with SlimDevices preferred database (and be as simple as possible), and
people wanting to use clustered Oracle etc would have to run the schema
files through a converter... (that would be the simplest part of using
an Oracle database...) No need to overdesign.

> create table song (
> songKey int(10) auto_increment primary key,
> songHash char (28) unique,

Would then become:
create table song (
songKey int auto_increment primary key,
songHash text unique,
somePath text,

Note that SQLite resizes an integer column automatically depending on
the value, so no size specification is needed. Negative values uses 9
bytes!

My last nit-pick would be the inconsistent naming of columns: songKey
versus performerid (both int pritmary key).

To make it easier to read when made all lower/uppercased (SQL databases
are caseinsensitive), I would suggest using underscore for
word-delimiter and always use _id for unique keys: song_id,
performer_id, song_hash, file_path

----
- Peter Speck

Jack Coates
2004-08-13, 09:22
> Hi,
>
> Warning, here comes a nitpick email...
>
>
>> Yes, I would suggest a maximum length of any varchar as 254 (some DBs
>> limit it to 255 including the length byte and use the length value 255
>> to
>> denote null [as opposed to being a string containing no characters]).
>
> Sounds like the "640K ought to be enough for everyone" syndrome.
....
> Longest path (utf-8), created by iTunes:
> /Volumes/frede/Music/Woldemar Nelson, Orchester der Bayreuther
> Festspiele and Chor der Bayreuther Festspiele/Der fliegende
> Holländer/1-05 Rezitativ und Arie_ Wie oft in Meeres tiefsten Schlund _
> Dich frage ich, gepriesener Engel Gottes _ Nur eine Hoffnung soll mir
> bleiben.aif
>

This is going to be a lot more common with classical I'd think, but your
point stands. FWIW, my collection yields:
[jack@felix jack]$ find /mnt/music/ | perl -e '$h{length($_)-1}++ while
(<>); print "$_: $h{$_}\n" foreach ( sort { $b <=> $a } keys %h );' | sort
-n | tail
173: 3
174: 1
177: 1
178: 1
180: 1
182: 1
184: 1
189: 1
199: 1
200: 2
[jack@felix jack]$
....
....
>> create table song (
>> songKey int(10) auto_increment primary key,
>> songHash char (28) unique,
>
> Would then become:
> create table song (
> songKey int auto_increment primary key,
> songHash text unique,
> somePath text,
>

As someone else mentioned, using "track" instead of "song" makes a lot of
sense. Not worth losing sleep over, but this is the time to make the
change.

> Note that SQLite resizes an integer column automatically depending on
> the value, so no size specification is needed. Negative values uses 9
> bytes!
>
> My last nit-pick would be the inconsistent naming of columns: songKey
> versus performerid (both int pritmary key).
>
> To make it easier to read when made all lower/uppercased (SQL databases
> are caseinsensitive), I would suggest using underscore for
> word-delimiter and always use _id for unique keys: song_id,
> performer_id, song_hash, file_path
>

Not to pick nits in your nit-pick, but MS SQL case-sensitivity is an
option, and maybe half of the databases I've seen in the wild have it
turned on.
--
Jack At Monkeynoodle.Org: It's A Scientific Venture...
"Believe what you're told; there'd be chaos if everyone thought for
themselves." -- Top Dog hotdog stand, Berkeley, CA