PDA

View Full Version : still no presentation-layer dedupe of albums



chroma
2010-01-20, 01:36
I do not understand why it is necessary for me to get five first-level indistinguishable results if i have five copies (of various quality or encoding method, or from multiple users libraries) of the same album in my collection.

I want a basic heuristic on the presentation layer that says:

If for a search Album Name, Artist, Year are all the same between multiple entries, show the one with the largest aggregate bitrate or filesize. Don't show the others.

This solves multiple presentation layer problems at once.. If you have a better copy of the same album, you will always get that and only that. If you have a more complete copy of the same album, you will always get that and only that. You won't have the continuous annoyance of having the same album show more than once in search results.

At this point, as the maintainers of the priorities at logitech don't seem to think this is a real problem, I'd be willing to pay someone directly to make the software change to fix this. All of my various SB devices are becoming idle because nobody wants to deal with the trouble.

thanks

DaveWr
2010-01-20, 02:40
I don't understand why you have that many duplicates in the first place. If you want the highest bit rate, just delete the rest (well move them out of you music folder).

IMHO this is not a show stopper. There are many more important searching / playlisting issues.

Dave

chroma
2010-01-20, 17:08
I have a collection of collections used by other individuals and software. They have organization which is important to retain for their own reasons.

The place to clean them up is at the presentation level, or at the very least at the scanner level.

I declare a $250 starting bounty on getting this done.

dc

DaveWr
2010-01-21, 00:44
I agree with you about the logic, but the commercial need for this must be small. You need a third party developer, or maybe one of Erland's plug ins might help, Custom Scan + Custom Browse.

Dave

erland
2010-01-22, 10:24
Are you talking about searching or browsing ?

Are the sub collections stable ? If they are you can just create your collection by creating a directory that just contains links/shortcuts to the albums you want to include.

Because the SBS scanner auto detects files on the disk sometimes, for example when you browse by music folder, it's hard to do this as a third party plugin. I've done some experiments with Multi Library plugin but if I remember correctly it only selected lossless instead of lossy compressed track if both existed.

Are you sure your tags are exactly the same on the albums or do you expect it to also detect albums that have almost the same title ?

Theoretically you can probably do it with Multi Library plugin but we are talking about pretty advanced SQL queries.

A solution that might work as long as you don't use the "Music Folder" browse method is to edit the schema_optimize.sql script so it removes the albums you don't want in the library. This is probably the easiest way to accomplish the functionality but it only works as long as you don't want to browse by "Music Folder".
It might not work in a future version of SBS when auto scan of your library is supported. So if you want something future proof this might not be the best solution.

Zaragon
2010-01-23, 07:26
I would ask the question how does the system know that the multiple sets are different except for the bit-rate or encoding? There are lots of things that seem obvious but that when you try to code them become absolutely horrendous.

Bearing in mind the SC deals in tracks and it is the tags and disk structure that allow it to 'understand' albums.

So if you have the same track in two different directories with identical tags but encoded differently (different bit rates or codec) are they the same track musically or are they different but just happen to have the same tags.

Whilst you might say that that could never happen then consider that there are no official tags for any track just what you put on there yourself, perhaps with the help of an online source. So any user might choose to accidentally make identical tags. Also consider a scenario where an album is updated, perhaps with remix, but is otherwise identical to the old one then it might get identical tags.

What happens if you have different album art in the two directories, does this make the two tracks different? What about if in one directory you have an additional track does that make the two original tracks different?

Now one can easily say "of course it does" to any of the questions but can everyone say the same thing. One only looks at it from the point of view of one's only library.

Whilst it might be possible to do complex analysis of the tracks to try to guess if they are similar musically you have two problems, one of speed and processing power and secondly which one does it choose to display and play. I can envisage situations where two tracks are determined to be identical but aren't and you can't every get one of them to play.

I seem to recall this has been discussed before and one solution was that if two identically tagged tracks (differing only by codec/bit-rate) are in the same directory then they could be considered to be the same track thus allowing the system to choose the most appropriate to play to a given device. Though I'm sure that there are some ripping/management apps that might un-intentially create that situation for different tracks.

However, that wouldn't work in the OP's situation because he wishes to maintain an existing directory structure.

In the OP's case perhaps he should look at the custom library plugins then create some bespoke tags in the files. The the individual devices can look at a tailored bespoke library.

bobkoure
2010-01-23, 08:59
So maybe the answer is to move the encoding and bitrate into the tags, either as a separate tags or maybe just as one of the multiple genres.

If you've got everything sorted into directory trees, you could just use an MP3Tags action to either add a tag (stupidly easy) or to add a genre (slightly more difficult - look at "format values").

If you can't add tags, then, as previously mentioned (I think) look at multiple library support - control which folder trees are in which library by manipulating shortcuts/links.

As far as not understanding why you have this issue, well, you're way out under that long tail...

chroma
2010-01-23, 14:44
i think you guys are making this entirely too complicated...

this is an album level, not a track level, function. the only comparison i care about is at the album level. i'll take whatever art and individual track versions exist with the album that is selected.

i require retention of browse music folder, yes.. that is how some people get access to the collections that they have put together and understand the structure of.

the higher level browses & searches though should only select the more information dense / more complete version of an album.

for example, the following query does 95% of what is required:

select C2.id, C2.title, C2.contributor, C2.year, SUM(tracks.filesize)
FROM albums as C1, albums as C2
JOIN tracks WHERE C1.title = C2.title AND C1.contributor = C2.contributor AND (c1.year = c2.year
OR c1.year = 0 OR c2.year = 0) AND C1.id < C2.id AND tracks.album = c2.id
group by tracks.album order by title;

i'm not talking about eliminating piddly differences in the tags. if someone wants to build independent heuristics for that, it would certainly be cool to allow them to build comparison regexps or something, but thats not the point here. my collections are meticulously organized and the tags normalized.

as demonstrated in that query, we're just looking at albums which have the same title, artist, year, and then adding up all of the filesizes within that album, and ultimately (although this step is not in the query above) selecting the largest response per similar album, and perhaps tagging the rest with a new schema boolean "hideThis".

i agree that normalization and correction of tags is an external problem. but if they are already corrected, and i have two versions of the same album, i want only the best copy, as defined by a very simple and inexpensive heuristic which can be applied to existing tags, i want only one response to my searches and browses.

clear now?

chroma
2010-01-23, 15:22
say you have a collective living situation. common recreation areas and sound systems. squeezeboxes in each of them. people come in with their own collections of stuff. there is a fair amount of overlap because they have similar taste. each collection is very well groomed on its own. the collection drives are added to the common server / NAS. under normal circumstances, we want to search and browse across all collections. but many dupes occur. in individuals own spaces, they can run multilibrary or itunes or whatever they want across their own root directory on the NAS. (most people still choose to buy their own squeezeboxes or run a software client and run across the entire library, though, even in their own rooms.)

several things occur regularly:

- seven+ hits for the same album search. no quick way to determine which one has been ripped at 64k mono vs flac. in the common spaces, where the sound system is of very high quality, the more information-dense version is universally desirable.
- multiple hits for the same album search. multiple copies of the same quality. only one is useful.
- multiple hits for the same album search. one copy is incomplete or incorrectly tagged. the most complete one is still desirable.
- search by artist. have to scroll through 3-10 copies of the (apparently) same result per album name.
- search by track name. have to scroll through 3-15 copies of the (apparently) same result per track name. [even in this case, by removing at the album level all but the most information dense version, we get where we want.. a single result from the best album choice]

again, i'll certainly agree that the algorithm used for selection may change per individual taste, but as long as it assumes that the data has been externally normalized (which i agree is not slimserver's problem), it should do a reasonable job of making the presentation usable. the only other major selection criterion i can think of which might be more desirable for some people is "select album copy with more items". in my case, i've found that total byte count works in almost all cases for accomplishing that as a side-effect.

MelonMonkey
2010-01-24, 09:24
You mentioned "corrected" tags, yet your music collection is anything but correct. IMO, you're looking for a solution to be implemented universally for a problem that is singular to your system that you yourself have created.

You can, as suggested, fix your music collection. Much faster than a database query. If you only want the BEST copy to exist in SBS, then learn how to use file system links to show SBS what you want while hiding the rest. Much cleaner and definitely not ambiguous like your proposed solution. If you don't see the issues in the solution you're proposing then you're not looking at anything close to the whole picture.

Your issue is an extreme corner case but you believe that it can be fixed by a simple query without actually looking at how SBS currently handles tracks and populates its own DB. Then you ignore additional common and corner cases that will most definitely affect others when implementing your suggestion.

You may want to look at the source code - which you may find easy enough to insert your query into, thereby solving your unique issue. Just a suggestion.

Making something more convoluted and complex is generally not the way to also make it stable and usable.

bobkoure
2010-01-24, 09:58
i think you guys are making this entirely too complicated...Or maybe we're trying to think of a group of users (other than just you and your house/apartment/dorm mates) who might need something like this. Actually, it's not clear that they care about this much, so it may be just you.

I'd suggest hand-crafting a directory of shortcuts, pointing to the music you want, and letting it go at that. Note that you can have multiple layers of shortcuts. Combine that with multiple library support and you're done.

Mnyb
2010-01-24, 10:11
But browse music folder is just that. it specially done to not read tags or implement any intelligence. it's there so people can browse and use untagged or badly tagged files ? anything but a simple file browser for browse music folder will create a mess for countless other users ? you browse to a file and play it, thats what browse music folder do (and it adds it to the dB, which i think should be optional)

Properly tagged files can be accessed trough artist album genre year etc .

In that case I *think' (i have not tried myself) a combination of multlibrary and custom browse plugin can sort things out . Not the way you want it but in some more manageable way.

Otherwise better file organization is the correct solution, I would never keep eleven version of the same album in the dir/path that sbs can scan.
That would obviusly create a mess.
Diskspace is cheap just copy and agregate the wanted files in the music directory and keep the excess outside the server
a clever use of shortcuts/symlinks then ?

erland
2010-01-24, 10:53
for example, the following query does 95% of what is required:

select C2.id, C2.title, C2.contributor, C2.year, SUM(tracks.filesize)
FROM albums as C1, albums as C2
JOIN tracks WHERE C1.title = C2.title AND C1.contributor = C2.contributor AND (c1.year = c2.year
OR c1.year = 0 OR c2.year = 0) AND C1.id < C2.id AND tracks.album = c2.id
group by tracks.album order by title;

If you are familiar with SQL, take a look at the Multi Library and Custom Browse plugins. I'm pretty sure you should be able to do a custom library and corresponding browse menu that matches your requirements.

You won't be able to search with this solution, but you will probably be able to create browse menus that only includes the albums you want.

chroma
2010-01-27, 21:33
i appreciate that some of you are trying to be helpful.

let me clarify the requirements again:

- the filesystem must remain unmodified. no symlinks. no changing whats there. it is a correct set of collections of collections. it is correctly tagged.
- users want to be able to use the "browse music folder" and see everything, as would be natural per point 1.
- some number of items are duplicates. these should not show up in a browse or search, but should continue to show up in a directory browse.
- lets leave it to me to decide what should be hidden or not. i've already done that as shown below. i do understand the logic applied to determine album alignment, and i do not consider this a difficult problem. in fact, i believe application of the following logic will be generally useful for some people, and not harmful for others. its alright if you think that my requirements are a corner case. i find this hard to believe, but lets not belabor that now.
- once items are tagged to be hidden, they should not show up in any search or browse views, except when viewing the real directory structure.

i've written the following code to do what i want, albeit in a very messy and expensive way:



#!/usr/bin/python
# presentation layer album dedupe

import sys
import MySQLdb
from collections import defaultdict

try:
conn = MySQLdb.connect (host = "127.0.0.1",
user = "root",
passwd = "",
port = 9092,
db = "slimserver")
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
sys.exit (1)

cursor = conn.cursor ()

cursor.execute ("select C2.id, C2.titlesort, C2.contributor, C2.year, SUM(tracks.filesize) \
FROM albums as C1, albums as C2 \
JOIN tracks WHERE C1.titlesort = C2.titlesort AND C1.contributor = C2.contributor \
AND (c1.year = c2.year OR c2.year = 0) \
AND C1.id != C2.id AND tracks.album = c2.id \
group by tracks.album order by titlesort")

d = defaultdict(list)
hidelist = []

while (1):
row = cursor.fetchone ()
if row == None:
break

d[row[1]].append((row[4],row[0]))

for k, v in d.items():
if len(v) > 1:
print "%s:" % k
largest = 0
selected = 0
# first pass gets largest value
for l, i in v:
if largest < l:
largest = l

# next pass determines rows to hide
for l, i in v:
if l != largest:
if not l:
print "hide: <Null> %d" % (i);
else:
print "hide: %d %d" % (l, i)
hidelist.append(i)
else:
if selected:
# duplicate size, arbitrary selection
print "hide: %d %d" % (l, i)
hidelist.append(i)
else:
selected+=1
print "keep: %d %d" % (l, i)


print "hiding %d duplicate albums" % (len(hidelist))

for i in hidelist:
updatestr = "UPDATE albums SET invisible = 1 WHERE id = " + str(i)
cursor.execute (updatestr)

# hacky, but works for now: just blow away records of bad dupes in the db
# everything above this line is fine. everything below this line should be
# replaced by a server side SELECT exception rather than this expensive
# set of deletes

cursor.execute ("delete tracks.* from tracks LEFT JOIN albums on tracks.album = albums.id \
where albums.invisible = 1");
cursor.execute ("delete from albums where albums.invisible = 1");

cursor.close ()
conn.commit ()
conn.close ()



As I have an intermediate step wherein I flag each album as invisible by adding a column to the table for albums i don't want, and could easily add that column to tracks as well, it seems much more reasonable to just select those out when doing searches/browses. That way I'm not in a race with the scanner, and don't have to run expensive deletes.

I tried to add this logic in Slim::Control::Queries.pm in the following places, but don't see it reflected in the SQL queries being passed to the server, so I apparently have misplaced my strategy somewhat:

in albumsQuery:
$where->{'me.invisible'} = {'!=' => '1'};

in artistsQuery:
$where->{'me.invisible'} = {'!=' => '1'};
$where_va->{'me.invisible'} = {'!=' => '1'};

in titlesQuery:
$where->{'album.invisible'} = {'!=' => '1'};
push @{$attr->{'join'}}, 'album';

So, without any more of the telling me why I don't know what I actually want, can someone kindly help me figure out where I should place modifications with this intent so that my searches and browses run against the db select out rows flagged as invisible?

Thanks.

erland
2010-01-28, 00:05
Doesn't the removed albums and tracks appear again when you use Music Folder browsing ?
I thought the SBS scanner rescanned automatically when browsing a folder with the browse Music Folder mechanism.

Letten
2010-01-28, 02:03
I have a collection of collections used by other individuals and software.

Forgive me, but this sounds illegal. Like copying friends music collections and dont want to bother deleting duplicates! (I'm not saying you are)

The usual scenario is that you use SC on your own collection of music with no duplicates. Some people have Lossless and Lossy copies (for portable players) and keep those in seperate folders and just use SC on the Lossless version. Thats it, no big problem.

Even if your scenario is completely legal, it is rare, and I dont think the limited developer resources should cater for this scenario. There are much more important issues and wishes.

chroma
2010-01-28, 02:18
Doesn't the removed albums and tracks appear again when you use Music Folder browsing ?
I thought the SBS scanner rescanned automatically when browsing a folder with the browse Music Folder mechanism.

yes, which is why i want them removed at the presentation layer rather than having to remove them entirely from the DB. now that I have a flag set on each track that i don't want to appear in search or browse, that should be easy, just need to find the right entry point.

JJZolx
2010-01-28, 04:30
yes, which is why i want them removed at the presentation layer rather than having to remove them entirely from the DB. now that I have a flag set on each track that i don't want to appear in search or browse, that should be easy, just need to find the right entry point.

Ugh. I want nothing of the sort. Remove them yourself if it's so important to you. I get sick of these "the software needs to figure out what I want" requests. They're like an endless running bad joke.

It's an incidental concern, of course, because it will never happen in Squeezebox Server, no matter how lazy you choose to be.

chroma
2010-01-28, 13:17
I've provided the heuristic, the code, the driving use case.

Rather than telling me further that I don't know what I want or that I am part of some 'bad joke', there are three reasonable options here:

1) provide a working model for the existing software which meets the stated requirements
2) provide the minor assistance i am requesting so that I, or any other user, who decides not only that they think behavior should be different, but choose to implement flagging of that behavior themselves, needs to meet the requirements
or
3) shut up and let other people interact who will provide useful means to reaching 1 or 2

Its clear that this request creates strong reactions in some of you. Those reactions have been presented. I still have a problem to solve and a significant investment in the related product.

At this point all I need is generally useful. Do not show items in search or browse which have a boolean flag set in the db. I've added the columns, set the flags. That logic may or may not be interesting to others. Perhaps others may implement their own heuristics, most can use none at all.

Now, no more opinions please. Thats already been overdone. Lets get down to actually solving the problem.

Thanks

Phil Leigh
2010-01-28, 14:50
Lets get down to actually solving the problem.

Thanks

I guess you know that your OP set the tone for this thread. There is no "the problem" - there is just "your problem".

Please will someone tell Chroma how to patch in his code into his installation (this will NEVER be a feature of SBS) so we can all sleep easily in our beds...

aubuti
2010-01-28, 14:57
I declare a $250 starting bounty on getting this done.
...maybe you need to increase the bounty

chrisla
2010-01-28, 19:14
I'll match his $250.

-Chris

On Thu, Jan 28, 2010 at 1:57 PM, aubuti <
aubuti.45iren1264715881 (AT) no-mx (DOT) forums.slimdevices.com> wrote:

>
> chroma;508694 Wrote:
> > I declare a $250 starting bounty on getting this done.
> ...maybe you need to increase the bounty
>
>
> --
> aubuti
> ------------------------------------------------------------------------
> aubuti's Profile: http://forums.slimdevices.com/member.php?userid=2074
> View this thread: http://forums.slimdevices.com/showthread.php?t=74288
>
>