John A. Tamplin
2004-08-13, 05:25
On Fri, 13 Aug 2004, Pat Farrell wrote:
> > However remote, there is the possibility that two
> >different songs will hash to the same value.
>
> Yes, the probability is non-zero. But very small.
> There are only about 500,000 songs on all the pop/rock/
> r+b/jazz/americana/.... CDs in print. Classical is another
> world, I don't have the numbers handy, but I expect that
> it is less than the pop stuff.
>
> A 120 bit hash's collision expectation, including the birthday paradox,
> is too small to worry about. But, I changed the schema just
> to make you and Dale happy.
I used MD5 to clean up an image collection and remove duplicates. I was
surprised to find about a dozen collisions with non-identical images out
of about 100,000 images. What I discovered was that very small images had
a significantly higher chance of colliding since there were fixed headers
that were largely the same and in these cases only 30-40 bytes of actual
data that varied. I doubt there are any really short songs to cause this
sort of problem -- however, I still think the database should allow
duplicates in the hash. The import interface can warn the user of the
duplication and ask what they want to do with the song, but you can
imagine a user being very frustrated if they can't store some song just
because the hash matches something else.
> More importantly, and less theologically,
> I want to know which songs are identical.
> If I want to shuffle all my, say Crosby and Nash songs,
> I don't want three copies of Marrakesh Express in the mix
> just because it is on six of their greatest hits albums,
> plus the original album, plus the box set.
The problem is these six different versions of the song are unlikely to be
bit-for-bit identical. Even ignoring the issue of reading the bits from
the disc, the production of the audio is likely to not use the same bits.
In my case, I have found the length to even vary slightly. So, you have
to solve this another way anyway.
> I have no problem with this approach. Given the changed
> schema, are you happy?
Yes, other than I would still like to see a non-unique index on the hash
(or equivalently add the autoincrement key as the last component of that
index).
> I don't trust the meta data (specifically the tags) in any way
> for any of my interesting music. For pop, sure. Classic rock
> is nearly always right in both Freedb and CDDB. Jazz is
> usually wrong, and classical works are a disaster.
I clean the tags up myself as I find things that are incorrect, plus I
have a lot of aliases that normalize the data and correct misspellings
during ripping. If you don't fix the tag data and don't trust it, then
you might as well not have it.
> I don't see that in your initial posting, or don't understand what you mean.
> To my mind, bootlegs are just like different masterings or mixing,
> they are different songs. Of course, you can point
> all the data in the artist, performer, composer, engineer and
> other tables to one or many songs.
You are mixing someone else's posting -- I never talked about separating
performance from the idea of a song. I think it is reasonable to track
that (for example, covers of a song), but I don't know if it is worth the
extra level in the data to cover something that will have to be manually
built since none of the metadata currently has the information necessary
to build it.
> You also posted there:
> >I know some databases support strings of non-fixed length.
> >For file paths that would seem like a good idea. (I don't recall
> >if MySQL is one of those that does.) If there is a fixed limit
> >it had better be as long as the filesystem allows a path to be.
>
> I know MySql has varchar, that is what the sql standard for variable
> length specifies. The schemas reflect varchar where
> it made sense to me. I don't know the details of the package Vidur
> prefered, but any decent RDBMS handles them.
Likewise, this wasn't me. Any database always has a fixed limit on a
variable string, even if it 16G or something. Some databases will
actually allocate the maximum space and varchar just means that it keeps
track of the length for you, so you don't want to make it outrageously
large.
Some databases also pay a penalty for using varchar (for example,
fixed-length portions are stored in the main record, variable-length
portions are allocated elsewhere and accessed via a pointer in that row),
and on those you may want to specify a minimum reserved length that will
get 90% of the rows. However, the DDL is probably going to need to be
database-specific anyway, so that can be handled at schema creation time
for the appropriate database. Also, if a user needs to increase it they
can.
> Some DBMS packages have arbitrary limits to the
> size of varchar fields. Maximums of 255 and 511 are common,
> I think Oracle allows 2047.
> At some point, it becomes a doctor, doctor, problem.
Yes, I would suggest a maximum length of any varchar as 254 (some DBs
limit it to 255 including the length byte and use the length value 255 to
denote null [as opposed to 0 being a string containing no characters]).
--
John A. Tamplin jat (AT) jaet (DOT) org
770/436-5387 HOME 4116 Manson Ave
Smyrna, GA 30082-3723
> > However remote, there is the possibility that two
> >different songs will hash to the same value.
>
> Yes, the probability is non-zero. But very small.
> There are only about 500,000 songs on all the pop/rock/
> r+b/jazz/americana/.... CDs in print. Classical is another
> world, I don't have the numbers handy, but I expect that
> it is less than the pop stuff.
>
> A 120 bit hash's collision expectation, including the birthday paradox,
> is too small to worry about. But, I changed the schema just
> to make you and Dale happy.
I used MD5 to clean up an image collection and remove duplicates. I was
surprised to find about a dozen collisions with non-identical images out
of about 100,000 images. What I discovered was that very small images had
a significantly higher chance of colliding since there were fixed headers
that were largely the same and in these cases only 30-40 bytes of actual
data that varied. I doubt there are any really short songs to cause this
sort of problem -- however, I still think the database should allow
duplicates in the hash. The import interface can warn the user of the
duplication and ask what they want to do with the song, but you can
imagine a user being very frustrated if they can't store some song just
because the hash matches something else.
> More importantly, and less theologically,
> I want to know which songs are identical.
> If I want to shuffle all my, say Crosby and Nash songs,
> I don't want three copies of Marrakesh Express in the mix
> just because it is on six of their greatest hits albums,
> plus the original album, plus the box set.
The problem is these six different versions of the song are unlikely to be
bit-for-bit identical. Even ignoring the issue of reading the bits from
the disc, the production of the audio is likely to not use the same bits.
In my case, I have found the length to even vary slightly. So, you have
to solve this another way anyway.
> I have no problem with this approach. Given the changed
> schema, are you happy?
Yes, other than I would still like to see a non-unique index on the hash
(or equivalently add the autoincrement key as the last component of that
index).
> I don't trust the meta data (specifically the tags) in any way
> for any of my interesting music. For pop, sure. Classic rock
> is nearly always right in both Freedb and CDDB. Jazz is
> usually wrong, and classical works are a disaster.
I clean the tags up myself as I find things that are incorrect, plus I
have a lot of aliases that normalize the data and correct misspellings
during ripping. If you don't fix the tag data and don't trust it, then
you might as well not have it.
> I don't see that in your initial posting, or don't understand what you mean.
> To my mind, bootlegs are just like different masterings or mixing,
> they are different songs. Of course, you can point
> all the data in the artist, performer, composer, engineer and
> other tables to one or many songs.
You are mixing someone else's posting -- I never talked about separating
performance from the idea of a song. I think it is reasonable to track
that (for example, covers of a song), but I don't know if it is worth the
extra level in the data to cover something that will have to be manually
built since none of the metadata currently has the information necessary
to build it.
> You also posted there:
> >I know some databases support strings of non-fixed length.
> >For file paths that would seem like a good idea. (I don't recall
> >if MySQL is one of those that does.) If there is a fixed limit
> >it had better be as long as the filesystem allows a path to be.
>
> I know MySql has varchar, that is what the sql standard for variable
> length specifies. The schemas reflect varchar where
> it made sense to me. I don't know the details of the package Vidur
> prefered, but any decent RDBMS handles them.
Likewise, this wasn't me. Any database always has a fixed limit on a
variable string, even if it 16G or something. Some databases will
actually allocate the maximum space and varchar just means that it keeps
track of the length for you, so you don't want to make it outrageously
large.
Some databases also pay a penalty for using varchar (for example,
fixed-length portions are stored in the main record, variable-length
portions are allocated elsewhere and accessed via a pointer in that row),
and on those you may want to specify a minimum reserved length that will
get 90% of the rows. However, the DDL is probably going to need to be
database-specific anyway, so that can be handled at schema creation time
for the appropriate database. Also, if a user needs to increase it they
can.
> Some DBMS packages have arbitrary limits to the
> size of varchar fields. Maximums of 255 and 511 are common,
> I think Oracle allows 2047.
> At some point, it becomes a doctor, doctor, problem.
Yes, I would suggest a maximum length of any varchar as 254 (some DBs
limit it to 255 including the length byte and use the length value 255 to
denote null [as opposed to 0 being a string containing no characters]).
--
John A. Tamplin jat (AT) jaet (DOT) org
770/436-5387 HOME 4116 Manson Ave
Smyrna, GA 30082-3723