PDA

View Full Version : FLAC tags and international characters.



Sam Doshi
2004-05-11, 14:19
FLAC files with accented or other international characters in their tags
show up bizarrely on both my SqueezeBox and the web interface. As an
example Fauré shows up as Faur**, where ** are some odd glyphs. Now I
know the software can handle the characters properly as mp3 tags and
file names are displayed fine.

So....,

a) Am I being thick :) Have I just missed something?
b) Is it because FLAC uses unicode for it's tags?
c) Is there something I can do about it?
d) Will it be fixed?
e) errrr, dunno.


Sam

Pat Farrell
2004-05-11, 14:47
At 05:19 PM 5/11/2004, you wrote:
>FLAC files with accented or other international characters in their tags
>show up bizarrely on both my SqueezeBox and the web interface. As an
>example Fauré shows up as Faur**, where ** are some odd glyphs. Now I know
>the software can handle the characters properly as mp3 tags and file names
>are displayed fine.
>a) Am I being thick :) Have I just missed something?

Nope, its a feature.
For example, I had
Andres Segovia tagged as Andrés Segovia. And the é causes problems

>b) Is it because FLAC uses unicode for it's tags?

I'm not sure it is really Unicode as much as non-American characters.

>c) Is there something I can do about it?

Go to my SlimServer-oriented utility page and look at the stuff for flac.
http://www.pfarrell.com/music/slimsoftware.html
The "strange" command will at least identify files with FLAC tags
(really Ogg tags) that are likely to cause problems.

I tweaked the "bad" files by hand, altho you clearly could
expect a program like mine to actually fix them for you :-)

Pat

Pat Farrell
2004-05-12, 08:48
At 05:35 AM 5/12/2004, Dolf Dijkstra wrote:
>As far as I know are the flac (ogg) comments UTF-8. This means that flac
>is capable of storing extended characters.


A little known, and to my knowledge never implemented, feature of ID3V2
tags is that they also support both UTF-8 and real unicode.

>The last step is the font in the browser.

This is almost the last step.
The remaining problem is that the SqueezeBox display has a very limited
font set. It can't display a whole lot of things.


>IMHO changing the flac comments by replacing the extended characters by
>ascii is not a good way out.

No argument that it is not correct, but I see no alternative to the
problem with the SqueezeBox display itself. It simply doesn't have
the ability to display even western European character sets.
German, French, the Scandinavian countries all add a couple of
extra glyphs.

This is much more of a problem with classical-style genres, where
the artists and piece names are often not English or American.

Pat

Phil Barrett
2004-05-12, 09:22
On 12 May 2004, at 16:48, Pat Farrell wrote:
> A little known, and to my knowledge never implemented, feature of
> ID3V2
> tags is that they also support both UTF-8 and real unicode.

A quick Google shows a number of (claimed) implementations, including
for example www.id3editor.com.

NB what do you mean by "UTF-8 and real unicode"? UTF-8 is as much "real
unicode" as UTF-16 and UTF-16BE which are the other encodings supported
by ID3V2.

Phil

Sam Doshi
2004-05-12, 09:27
Actually the SqueezeBox can display western european character sets.
Just not with FLAC/Ogg tags. Try it with an mp3.

Sam.

S. Ben Melhuish
2004-05-12, 09:43
Sam Doshi wrote:
> Actually the SqueezeBox can display western european character sets.
> Just not with FLAC/Ogg tags. Try it with an mp3.

I don't think the problem is the format of the audio file; I think the
problem is the encoding of the tags, or actually the SlimServer's (lack
of) handling of the encoding.

The description "an A followed by some weird character" sounds an awful
lot like UTF8 encoding, but interpreted as ISO 8859 or some similar
single-byte encoding.

Here's how I think the SlimServer should behave, to help address these
problems:

* When reading text (e.g. from a tag), it should deduce, as best it can,
the encoding -- raw ASCII, UTF8, UCS16, whatever. In an ideal world,
the encoding would always be indicated somehow -- in the audio file
spec, or as part of the contents of the tag itself -- but it's possible
that some guesswork might be needed.

* It should store everything internally in some consistent format (say,
Unicode encoded as UTF8), converting input on the fly as necessary.

* When sending it out to some output (HTML, XML, a SliMP3 or
Squeezebox), it should know what encoding is supported by that output.
(E.g., web browsers can generally handle UTF8, as long as that encoding
is indicated in the HTTP response headers; SliMP3 seems to only take (a
subset of?) ISO 8859.) If transcoding is necessary, it should do so,
gracefully degrading characters as possible (e.g., if the output can't
handle an em-dash, it should be converted to a -- (double dash)).

As near as I can tell, the SlimServer generally has little or no idea
how the tags it reads are encoded; similarly, it doesn't know what
encoding its output devices can take. As a result, they're just passed
through blindly, with the weird results people see on their players.

This is a project I've been planning to work on for a while, but other
projects (including life) have taken higher priority. I still plan on
doing it, if nobody else has by the time I get to it.

-- S. Ben Melhuish
sben (AT) pile (DOT) org

Pat Farrell
2004-05-12, 10:22
At 12:22 PM 5/12/2004, Phil Barrett wrote:
>NB what do you mean by "UTF-8 and real unicode"? UTF-8 is as much "real
>unicode" as UTF-16 and UTF-16BE which are the other encodings supported by
>ID3V2.

Don't get too upset. I was thinking about it from a programmer centric view.
UTF-8 allows null octet delimited strings, aka C-style strcpy, to work.
Real unicode support expects 16 bit values. Any Latin-1 characters have
the high octet zero'd so C-style code dies horribly.

UTF-8 is a nice persistence model for Unicode, as it is more compact
than 16 bit encodings. But I personally hate null delimited string formats,
there are too many places where it can lead to buffer overflows,
subtle bugs, and security holes. YMMV

at


Pat Farrell pfarrell (AT) pfarrell (DOT) com
http://www.pfarrell.com

Phil Barrett
2004-05-12, 10:46
On 12 May 2004, at 18:22, Pat Farrell wrote:
> At 12:22 PM 5/12/2004, Phil Barrett wrote:
>
> NB what do you mean by "UTF-8 and real unicode"? UTF-8 is as much
> "real unicode" as UTF-16 and UTF-16BE which are the other encodings
> supported by ID3V2.
>
> Don't get too upset. I was thinking about it from a programmer
> centric view.
> UTF-8 allows null octet delimited strings, aka C-style strcpy, to
> work.

A C-programmer-centric view. Slimserver is Perl, remember.

> Real unicode support expects 16 bit values.

Out-of-date; Unicode is a 32-bit space now, IIRC.

> Any Latin-1 characters have
> the high octet zero'd so C-style code dies horribly.

But you shouldn't be passing 16-bit data to strcpy anyway.

> UTF-8 is a nice persistence model for Unicode, as it is more compact
> than 16 bit encodings. But I personally hate null delimited string
> formats,
> there are too many places where it can lead to buffer overflows,
> subtle bugs, and security holes. YMMV

Just because UTF-8 *can* be used with null-termination doesn't mean it
has to be.

Phil (not upset, just pedantic)

Johan Bolmsjo
2004-05-12, 12:52
tis 2004-05-11 klockan 23.47 skrev Pat Farrell:
> At 05:19 PM 5/11/2004, you wrote:
> > FLAC files with accented or other international characters in their
> > tags show up bizarrely on both my SqueezeBox and the web interface.
> > As an example Fauré shows up as Faur**, where ** are some odd
> > glyphs. Now I know the software can handle the characters properly
> > as mp3 tags and file names are displayed fine.
> > a) Am I being thick :) Have I just missed something?
>
> Nope, its a feature.
> For example, I had
> Andres Segovia tagged as Andrés Segovia. And the é causes problems
>
> > b) Is it because FLAC uses unicode for it's tags?
>
> I'm not sure it is really Unicode as much as non-American characters.

OggVorbis uses utf8 and displays funny characters too. FLAC is probably
the same.

Peter =?iso-8859-1?Q?N=F5u?=
2004-05-12, 13:04
At 11:35 2004-05-12, you wrote:
>I have this same issue. I made a previous post
><http://thread.gmane.org/gmane.music.equipment.slimdevices.general/5857>http://thread.gmane.org/gmane.music.equipment.slimdevices.general/5857).
>
>As far as I know are the flac (ogg) comments UTF-8. This means that flac
>is capable of storing extended characters.
>
>Next step where it could break is slim in handling on the characters, If
>they expect iso8859 (latin-1) in flac comments then here the byte-to-char
>conversion would break, assume that they use flac comments and not the mp3
>tags.
>
>Next step is the build in slim webserver, It outputs iso-8859 encoded
>bytes, or that is what it tells the browser. Actually it tells nothing to
>the browser in the HTTP header, so the browser defaults to iso-8859 and in
>the html it explicitly sets the charset to iso-8859. In the past I made
>slim indicate that it sends UTF-8 to the browser. That worked for some
>parts of the skin (the extended characters were displayed correctly), but
>in other place the skin completely broke. (did not render at all). But
>this test was on 5.1 with the 'old' template engine.
>
>The last step is the font in the browser. Any character needs a glyph (a
>graphitic representation of a character) to be displayed in the browser.
>If you see squared boxes in your browser, this is most likely because the
>font that is selected to display the characters in the html page does not
>have the glyph for the character.
>
>I have not noticed the square boxes as you, but I have seen that the
>browser displayed the extended characters incorrectly as multiple
>characters. This would indicate that the data send by slim is not
>correctly (according to the specified character set),
>
>IMHO changing the flac comments by replacing the extended characters by
>ascii is not a good way out.
>

I just had to try, know I've had problems with flac-encoded Swedish song
titles. Found some and the (Swedish) characters in the album listing are
screwed up, all right. Trying to access any one of those songs individually
in the web interface crashes the slimserver application completely. Repeted
the problem trice, restart of server in between. Latest nightly (May 12th)
on win2K server

/peter

Dan Sully
2004-05-12, 13:36
* Johan Bolmsjo <johan (AT) nocrew (DOT) org> shaped the electrons to say...

>> I'm not sure it is really Unicode as much as non-American characters.
>
>OggVorbis uses utf8 and displays funny characters too. FLAC is probably the same.

FLAC uses Vorbis tags, so it is the same.

The SlimServer does not yet support UTF8/Unicode character sets.

-D
--
<weezyl> $6.66: The Value Meal of the Beast.

Peter Speck
2004-05-12, 14:36
On 12/5-2004, at 22:36, Dan Sully wrote:

> The SlimServer does not yet support UTF8/Unicode character sets.

Except for the iTunes interface, which does support some UTF-8: the
windows-latin-1 subset.

----
- Peter Speck

dean
2004-05-12, 20:27
On May 12, 2004, at 1:36 PM, Dan Sully wrote:

> * Johan Bolmsjo <johan (AT) nocrew (DOT) org> shaped the electrons to say...
>
>>> I'm not sure it is really Unicode as much as non-American characters.
>>
>> OggVorbis uses utf8 and displays funny characters too. FLAC is
>> probably the same.
>
> FLAC uses Vorbis tags, so it is the same.
>
> The SlimServer does not yet support UTF8/Unicode character sets.

That's true, but SlimServer does support the Latin-1 subset. For
iTunes and MP3 files, SlimServer takes the unicode text and extracts
the Latin-1 versions from that. We should update SlimServer to, first,
do the same with Ogg & FLAC tags, then later, add support for full
UTF8.

-dean

Bob Myers
2004-05-12, 23:27
----- Original Message -----
From: "Pat Farrell" <pfarrell-9Z2/rCHq3cxWk0Htik3J/w (AT) public (DOT) gmane.org>
Newsgroups: gmane.music.equipment.slimdevices.general
Sent: Wednesday, May 12, 2004 8:48 AM
Subject: [slim] FLAC tags and international characters.


> At 05:35 AM 5/12/2004, Dolf Dijkstra wrote:
> >As far as I know are the flac (ogg) comments UTF-8. This means that
flac
> >is capable of storing extended characters.
>
> A little known, and to my knowledge never implemented, feature of
ID3V2
> tags is that they also support both UTF-8 and real unicode.

Unicode is a glyph mapping, and UTF-8 is one particular encoding of
Unicode. Perhaps by "real unicode" you mean UCS-2. This may sound
pedantic, but I think clarity and accuracy are important in discussing
these issues.

Publicly available 16x16 Unicode fonts are available which should be
able to be displayed on the SB display, although perhaps only in LARGE
mode.

--
Bob M.

Pat Farrell
2004-05-13, 00:02
At 02:27 AM 5/13/2004, Bob Myers wrote:
>pfarrell wrote:
> > tags is that they also support both UTF-8 and real unicode.
>Unicode is a glyph mapping, and UTF-8 is one particular encoding of
>Unicode. Perhaps by "real unicode" you mean UCS-2. This may sound
>pedantic, but I think clarity and accuracy are important in discussing
>these issues.


Sigh.
I think this is pushing past "discuss" and belongs on the "developer" list,
but that is just IMHO.

Back when I wrote code that cared about this, Unicode was 16 bit
data, and UTF-8 was a way to store 16-bit codes in less than
two octets per character, for "typical data".
Java deals with Characters as 16 bit data.
UTF-8 is handy for Americans and English because it encodes
most of the USASCII characters in one octet, so there is no
enlargement and you can use normal text editors like vi or
notepad. Of course, if you have non-USASCII characters, you have
to use 8 bits, which UTF-8 makes take two octets. And for
non-Roman alphabets, you can use three or more octets.

I have no idea how Perl handles Unicode or any of the UTF-* /UCS*
classifications. I'm not a professional Perl dude. The little Perl
that I've seen plays pretty fast and loose with bytes, characters and octets.
Which is consistent with the Perl "more than one way to do it" concept.

C# is Unicode smart, and C++ can be, if the programmer wants to be.
More than a few C++ programmers, especially in the old Windows SDK days
were wedded to "a character is eight bits" as a style. There is a non-trivial
amount of that code in use today, a decade after Windows NT came out.

What I do know is that proper internationalization is hard, and it is
very hard to retro-fit it into code that was written with the
assumption that you were dealing with American or English ASCII.

While Windows NT (and W2K and XP) all support Unicode at the
operating system level, there are many, many utilities and programs
that do not. I know that my SlimServer utilities do not, even though
they are written in Java. I detect and whine about encodings with the
idea that if there is sufficient demand from users, I can look at adding
support. I've not seen any ID3 tags with non-ASCII characters in the tags.

I think it is reasonable to request that "some developer" make the FLAC
and Ogg tag reader and usage at least be as smart as the MP3 ID3 reader.
Changing the displays for alternative character sets, changing the network
protocol to support them, etc. may be unrealistic short term goals.
Whether they are realistic or not, I sure have other things I'd rather
have done first.

It may be faster to change the tags using something like my utility
program than it is to wait until the mainstream SlimServer product
gets changed.

Pat

Peter =?iso-8859-1?Q?N=F5u?=
2004-05-13, 00:47
At 05:27 2004-05-13, you wrote:
>>FLAC uses Vorbis tags, so it is the same.
>>
>>The SlimServer does not yet support UTF8/Unicode character sets.
>
>That's true, but SlimServer does support the Latin-1 subset. For iTunes
>and MP3 files, SlimServer takes the unicode text and extracts the Latin-1
>versions from that. We should update SlimServer to, first, do the same
>with Ogg & FLAC tags, then later, add support for full UTF8.

Great! That's great news. (And this is a lame me too comment, but I do
subscribe to the view that for every complaint I give air time, at least
the same amount of worthy praise should be allowed to fly.)

/peter

Phil Barrett
2004-05-13, 10:19
On 13 May 2004, at 07:27, Bob Myers wrote:
> Publicly available 16x16 Unicode fonts are available which should be
> able to be displayed on the SB display, although perhaps only in LARGE
> mode.

Not possible. The display only supports 8 custom characters at once.

So you could only show two 16x16 arbitrary characters at once, with the
rest of the display blank or showing standard characters.

Phil

Phil Barrett
2004-05-13, 10:52
Sorry to reply twice to two halves of the same email, but...

On 13 May 2004, at 07:27, Bob Myers wrote:
> Unicode is a glyph mapping, and UTF-8 is one particular encoding of
> Unicode. Perhaps by "real unicode" you mean UCS-2. This may sound
> pedantic, but I think clarity and accuracy are important in discussing
> these issues.

If you think clarity and accuracy are important, then please be clear
and accurate yourself.

Unicode is most definitely *not* a glyph mapping; it's a character
mapping.

UCS-2 is a pre-Unicode 2.0 encoding which "should now be avoided"
according to unicode.org. UTF-8, UTF-16 and UTF-32 (a subset of UCS-4)
are the encodings now.

Phil