PDA

View Full Version : Re: Strawman SQL database integration thoughts



Pat Farrell
2004-08-12, 10:45
At 01:29 PM 8/12/2004, "Jack Coates" <jack (AT) monkeynoodle (DOT) org> wrote:
>all three seem unacceptable to me, for reasons you note (and additionally,
>the performance of option 1 will be awful).

I disagree. The performance is no slower than reading the file
and is only done occasionally. Library maintenance is a very low
frequency occurrence. The hash function is so fast that the
IO time (especially in Perl or Java) overwhelms it.

Try it, you'll see it isn't a major issue.

If you want to optimize things, you can do a SHA1 of the directory
information, keep it, and don't recalculate the songs if the directory
info is unchanged.


The problem with an auto increment key for the song id [ which I use
in several other places in the strawman schema], is that you want
identical songs to have the same songID. We can argue about
what is "identical" and I threw out three possible definitions.

Seems to me that if the Greatest Hits album contains the same (exactly) song
as the main album, then it is only one song, and one songID.
If it is another take of the song, another mix, master, etc. then the
bits will be different and the hash will be different.


Pat

Jack Coates
2004-08-12, 11:05
> At 01:29 PM 8/12/2004, "Jack Coates" <jack (AT) monkeynoodle (DOT) org> wrote:
>>all three seem unacceptable to me, for reasons you note (and
>> additionally,
>>the performance of option 1 will be awful).
>
> I disagree. The performance is no slower than reading the file
> and is only done occasionally. Library maintenance is a very low
> frequency occurrence. The hash function is so fast that the
> IO time (especially in Perl or Java) overwhelms it.
>

[jack@felix jack]$ cd /mnt/music/They_Might_Be_Giants/Mink\ Car/
[jack@felix Mink Car]$ time md5sum 03_Man\,\ It\'s\ So\ Loud\ In\ Here.mp3
a990c8b1e883c3da078ca1517eb49452 03_Man, It's So Loud In Here.mp3
0.11user 0.02system 0:00.12elapsed 101%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (143major+16minor)pagefaults 0swaps
[jack@felix Mink Car]$[jack@felix Mink Car]$ cd
.../../dj_Cheb_i_Sabbah/Shri_Durga/
[jack@felix Shri_Durga]$ time md5sum dj_Cheb_i_Sabbah_-_Shri_Durga.mp3
1a83aee34339ab9ed6908946e157c161 dj_Cheb_i_Sabbah_-_Shri_Durga.mp3
0.17user 0.04system 0:00.27elapsed 75%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (142major+16minor)pagefaults 0swaps
[jack@felix Shri_Durga]$ cd ../../Henryk_Gorecki/Symphony_No_3_Op_36_1976/
[jack@felix Symphony_No_3_Op_36_1976]$ time md5sum
Henryk_Gorecki_-_Lento__CantabileSemplice.mp3
64927f03b3d25d93581b01771bdeeb37
Henryk_Gorecki_-_Lento__CantabileSemplice.mp3
0.36user 0.04system 0:00.43elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (143major+16minor)pagefaults 0swaps
> Try it, you'll see it isn't a major issue.

okay, I can be convinced. That's a 3:30, an 11:30, and a 30:00 track.

>
> If you want to optimize things, you can do a SHA1 of the directory
> information, keep it, and don't recalculate the songs if the directory
> info is unchanged.
>
>
> The problem with an auto increment key for the song id [ which I use
> in several other places in the strawman schema], is that you want
> identical songs to have the same songID. We can argue about
> what is "identical" and I threw out three possible definitions.
>
> Seems to me that if the Greatest Hits album contains the same (exactly)
> song
> as the main album, then it is only one song, and one songID.
> If it is another take of the song, another mix, master, etc. then the
> bits will be different and the hash will be different.
>

What if they were ripped at different bitrates though?

--
Jack At Monkeynoodle.Org: It's A Scientific Venture...
"Believe what you're told; there'd be chaos if everyone thought for
themselves." -- Top Dog hotdog stand, Berkeley, CA

Dale E Martin
2004-08-12, 11:09
> The problem with an auto increment key for the song id [ which I use
> in several other places in the strawman schema], is that you want
> identical songs to have the same songID. We can argue about
> what is "identical" and I threw out three possible definitions.

Is a flac rip of song (or "track" as was suggested) equivalent to an mp3
rip? It's still the same song, but the bytes will look very different.
Same with varying bit rates and a variety of other factors. I think it's
going to be up to a person to decide in the long run.

Later,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
dmartin (AT) cliftonlabs (DOT) com
http://www.cliftonlabs.com
pgp key available

Michael Brouwer
2004-08-12, 11:32
Flac files actually contain the md5 hash of the original uncompressed
bytes as metadata which would be perfect to uniquely identify a
particular song (though I think track is a better name).

Besides tracks and albums I'd really like to see the addition of a
group of tracks to the schema. Many classical CDs consist of multiple
tracks that together form a piece/work/group (call it what you like).
This is also true for some pop CDs though it's less common there.

Michael


On Aug 12, 2004, at 11:09 AM, Dale E Martin wrote:

>> The problem with an auto increment key for the song id [ which I use
>> in several other places in the strawman schema], is that you want
>> identical songs to have the same songID. We can argue about
>> what is "identical" and I threw out three possible definitions.
>
> Is a flac rip of song (or "track" as was suggested) equivalent to an
> mp3
> rip? It's still the same song, but the bytes will look very different.
> Same with varying bit rates and a variety of other factors. I think
> it's
> going to be up to a person to decide in the long run.
>
> Later,
> Dale
> --
> Dale E. Martin, Clifton Labs, Inc.
> Senior Computer Engineer
> dmartin (AT) cliftonlabs (DOT) com
> http://www.cliftonlabs.com
> pgp key available
>

Robert Moser
2004-08-12, 11:39
Dale E Martin wrote:
>>The problem with an auto increment key for the song id [ which I use
>>in several other places in the strawman schema], is that you want
>>identical songs to have the same songID. We can argue about
>>what is "identical" and I threw out three possible definitions.
>
>
> Is a flac rip of song (or "track" as was suggested) equivalent to an mp3
> rip? It's still the same song, but the bytes will look very different.
> Same with varying bit rates and a variety of other factors. I think it's
> going to be up to a person to decide in the long run.
>
> Later,
> Dale
Not only that, but if the CD is not perfect, you could possibly get
different data for two rips of the same song, even with everything else
left the same. Even if the same exact song is on two different CD's I
really doubt if they are going to be bit-exact after ripping.

Dale E Martin
2004-08-12, 11:51
> Flac files actually contain the md5 hash of the original uncompressed
> bytes as metadata which would be perfect to uniquely identify a
> particular song (though I think track is a better name).

Even in that case, the bytes of the original won't always be equivalent.
What if you downloaded from a source that watermarked the audio after
ripping it?

Also, two rips of the same track using the same drive twice in a row won't
always give you identicial data. It depends on the drive - in particular
if your media is not perfect when you do the ripping. Here's one reference
about this from the cdparanoia FAQ:
http://www.xiph.org/paranoia/faq.html#diff

I've got a CD-Rom drive here at the office that will rip through a severely
scratched CD and cdparanoia won't see a single error - and the track will
sound very bad when it's done. Presumably, on multiple reads, the drive is
just giving back (bad) buffered data.

Other rippers don't even try to get it perfect.

I don't think that comparing bytes (or hashed bytes) is going to do what
you want.

Thanks,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
dmartin (AT) cliftonlabs (DOT) com
http://www.cliftonlabs.com
pgp key available

Michael Brouwer
2004-08-12, 15:40
That's really because cdparanioa isn't bit accurate. If you use EAC
in secure more to rip your CDs and you calibrate the drive with accurip
first you will actually get the exact same track with the exact same
MD5 hash each time (even if you rip it on different machines/drives).
Assuming there were no read errors reported by EAC. I've verified this
myself.

Michael

On Aug 12, 2004, at 11:51 AM, Dale E Martin wrote:

>> Flac files actually contain the md5 hash of the original uncompressed
>> bytes as metadata which would be perfect to uniquely identify a
>> particular song (though I think track is a better name).
>
> Even in that case, the bytes of the original won't always be
> equivalent.
> What if you downloaded from a source that watermarked the audio after
> ripping it?
>
> Also, two rips of the same track using the same drive twice in a row
> won't
> always give you identicial data. It depends on the drive - in
> particular
> if your media is not perfect when you do the ripping. Here's one
> reference
> about this from the cdparanoia FAQ:
> http://www.xiph.org/paranoia/faq.html#diff
>
> I've got a CD-Rom drive here at the office that will rip through a
> severely
> scratched CD and cdparanoia won't see a single error - and the track
> will
> sound very bad when it's done. Presumably, on multiple reads, the
> drive is
> just giving back (bad) buffered data.
>
> Other rippers don't even try to get it perfect.
>
> I don't think that comparing bytes (or hashed bytes) is going to do
> what
> you want.
>
> Thanks,
> Dale
> --
> Dale E. Martin, Clifton Labs, Inc.
> Senior Computer Engineer
> dmartin (AT) cliftonlabs (DOT) com
> http://www.cliftonlabs.com
> pgp key available
>

Dale E Martin
2004-08-12, 16:56
> That's really because cdparanioa isn't bit accurate. If you use EAC
> in secure more to rip your CDs and you calibrate the drive with accurip
> first you will actually get the exact same track with the exact same
> MD5 hash each time (even if you rip it on different machines/drives).
> Assuming there were no read errors reported by EAC. I've verified this
> myself.

Not if we have the same album, but mine is remastered and yours isn't.
Then will have the same album, same songs, same tracklist, etc and we still
won't be able to automatically say "the same song" by comparing bytes.
Perhaps this is a pathological case, but I think that assuming everyone
would/could/will use EAC in secure mode having calibrated the drive with
accurip is pathological too. I think the point is that calculating a
primary key off of the data will cause more troubles that it's worth.

Take care,
Dale
--
Dale E. Martin, Clifton Labs, Inc.
Senior Computer Engineer
dmartin (AT) cliftonlabs (DOT) com
http://www.cliftonlabs.com
pgp key available

Jack Coates
2004-08-12, 21:11
> That's really because cdparanioa isn't bit accurate. If you use EAC
> in secure more to rip your CDs and you calibrate the drive with accurip
> first you will actually get the exact same track with the exact same
> MD5 hash each time (even if you rip it on different machines/drives).
> Assuming there were no read errors reported by EAC. I've verified this
> myself.
>

that's fine, but possible != widely done. I think it's pretty clear from
the discussion so far that primary tag needs to be separate from any hash
of the track's attributes. Next argument, please.

"Oh, you wanted arguments? This is abuse, arguments are down the hall.
Stupid git."
--
Jack At Monkeynoodle.Org: It's A Scientific Venture...
"Believe what you're told; there'd be chaos if everyone thought for
themselves." -- Top Dog hotdog stand, Berkeley, CA