PDA

View Full Version : help with script to scrape album year data



bklaas
2007-10-29, 10:32
Over the weekend I spent a little time trying to figure out a way to populate the year field in ID3 tags where it's missing.

I wrote a Perl script (on Linux, but I think it's portable to other OS's) to zip through my collection, find files with missing year tags, and output the results one-line-per-artist/album to a file.

Now I'm looking for a way to scrape the missing year data from the web somewhere by supplying an artist-album tuple. For example, I send "artist=Beck&album=Sea+Change" to TBD website/whatever, and I parse the year data from the result and write the tag accordingly.

Has anyone done this or have advice on where to look for these data? I'm looking for something that would return structured data in an as-simple-as-possible format for parsing.

cheers,
#!/ben

btw-- would be happy to share what I've got so far, but figure I'll wait until I add the scraping code.

snarlydwarf
2007-10-29, 10:38
Musicbrainz often has that information, the catch is you may not agreee with their definition of release date.

(ie, re-issues and remasters are supposed to get a new release date...)

Allmusic.com may also have it, but they want a license signed before using their data.

Benway
2007-10-29, 10:51
CPAN has lots of modules for querying FreeDB.

The problem you'll find is that they tend to need the DiscID or
the cdrom available as /dev/cdrom etc...

Searching through CPAN modules with "freedb id3" finds
WebService::FreeDB which looks like it
will do the trick.

However accuracy may be a problem when multiple records are found.

bklaas
2007-10-29, 13:30
CPAN has lots of modules for querying FreeDB.

The problem you'll find is that they tend to need the DiscID or
the cdrom available as /dev/cdrom etc...

Searching through CPAN modules with "freedb id3" finds
WebService::FreeDB which looks like it
will do the trick.

However accuracy may be a problem when multiple records are found.

Digging my old CDs out for this would take WAY more time then just manually searching google for album release year. CDDB is not a good solution here, esp. because I don't have DiscIDs saved into the tags either. I started ripping music well before I understood why verbose tag metadata was a good idea. I'm trying to come up with a solution that doesn't involve the physical media.

I will give Musicbrainz a shot, though my prior experience with that service has not been good.

cheers,
#!/ben

radish
2007-10-29, 13:37
Discogs.com? They have a fairly google like search engine and the DB is pretty damn extensive. In fact, the biggest problem you're likely to run into is narrowing results down - even a pretty specific query like this:

http://www.discogs.com/search?type=all&q=change+or+die+artist%3Asunscreem+country%3AUK+fo rmat%3Acd

gives 3 results. I guess you'd have to come up with something to pull each result and take a best guess of the correct year.

bklaas
2007-10-29, 13:46
fantastico, radish! That looks like it might just work.

cheers,
#!/ben

radish
2007-10-29, 14:23
Thinking about it, Amazon Web Services might work well too - I know a lot of apps use it for getting cover art, you can give it some pretty vague queries and it will do it's best to match. The advantage that would have would be that parsing neat XML is typically easier than scraping HTML, in my experience.

http://www.amazon.com/E-Commerce-Service-AWS-home-page/b/ref=sc_fe_l_2/105-5797087-3222059?ie=UTF8&node=12738641&no=342430011&me=A36L942TSJ2AJA

bklaas
2007-10-29, 14:30
Discogs has an API. Doesn't look too bad.

http://www.discogs.com/help/api

vrobin
2007-10-29, 14:43
Digging my old CDs out for this would take WAY more time then just manually searching google for album release year. CDDB is not a good solution here, esp. because I don't have DiscIDs saved into the tags either. I started ripping music well before I understood why verbose tag metadata was a good idea. I'm trying to come up with a solution that doesn't involve the physical media.

I will give Musicbrainz a shot, though my prior experience with that service has not been good.

cheers,
#!/ben

You can "fuzz search" cddb or musicbrainz with the media files, even if the result are less exact than with the real disc/discid.

I think I remember musicbrainz and discogs information about release date are not that bad, they may even include some "original release date". If your albums are not too rare you could look at wikipedia.

But if I were you, I would a bot to query google with an algorithm like this:



search google with "full album name"
do: fetch Nth result page
in the Nth page look for NNNN patterns near the album name
collect all NNNN you found in the page
while at least XX NNNN date fields are collected

For each collected list of NNNN analyze statistically (if a date is present 95% of time keep it silently, 75% keep it with a Notice, 50% with a warning)


This algorithm can be fooled by re-release date, but if you select a good pattern detection you can get good results...