PDA

View Full Version : Problem with character-sets on the CLI



jimwillsher
2007-02-23, 12:07
Hi,

Not sure if the right place to post....

Slimserver 6.5.1 on Ubuntu 6.06. Accessing command line from a Vista PC (telnet 192.168.1.150 9090)

I have a track called "7. Tráthnóna Beag Aréir" by Clannad. When I look at the track on the slimserver interface (IE7) it appears correctly. When I paste that same trackname from IE7 into a hex editor I get:

7. Tráthnóna Beag Aréir
37 2E 20 54 72 E1 74 68 6E F3 6E 61 20 42 65 61 67 20 41 72 E9 72

The interesting charaters (for me) here are:
á E1
ó F3
é E9


When I access the CLI and enter "title ?" I get:

00%3A04%3A20%3A06%3A1c%3A47 title Tr%C3%A1thn%C3%B3na%20Beag%20Ar%C3%A9ir

The spaces have been converted to %20, which I would expect. However if we take the same three characters from earlier we get:

á %C3%A1
ó %C3%B3
é %C3%A9

Could somebody help me to understand what is being returned? According to the CLI documentation the strings are returned as UTF-8, and the encoding on the IE7 webapge is also UTF-8. So, why two characters instead of one, and is there a direct conversion between them? Why does the CLI return %C3%A1 when it should probably return just E1 (or %E1) ?

Many thanks!


Jim

PS For info, I'm querying the CLI via a C++ program (Unicode), but the DOS command line returns the same.

mherger
2007-02-23, 12:13
> So, why two characters
> instead of one, and is there a direct conversion between them? Why does
> the CLI return %C3%A1 when it should probably return just E1 (or %E1) ?

The CLI does additionally url-encode the strings. You should find plenty
of resources on the internet on how to decode these strings (if your
language doesn't offer a function/method yet).

--

Michael

-----------------------------------------------------------------
http://www.herger.net/SlimCD - your SlimServer on a CD
http://www.herger.net/slim - AlbumReview, Biography, MusicInfoSCR

jimwillsher
2007-02-23, 13:00
Many thanks Michael. I must still be missing something though.

When I paste:

title Tr%C3%A1thn%C3%B3na%20Beag%20Ar%C3%A9ir

into website http://www.simplelogic.com/Developer/URLDecode.asp

I get:

title Tráthnóna Beag Aréir

instead of

title Tráthnóna Beag Aréir


e.g. two characters instead of a single accented character.

When I paste

title Tr%E1thn%F3na%20Beag%20Ar%E9ir

into the same decoder website, I get the answer I'm expecting. So as well as being URLEncoded, it looks like there's something else going on in there. Any ideas? Possibly a 7-bit/8-bit issue, or a UTF-8 issue? I still can't see how the CLI returns %C3%A1 when I'm expecting %E1.

Many thanks for your time.



Jim

mherger
2007-02-23, 13:09
> title Tráthnóna Beag Aréir

I ended up adding "charset:iso-8859-1" to the CLI requests.

> Possibly a 7-bit/8-bit issue, or a UTF-8 issue?

I'm sorry, I'm no expert in that area. I took the pragmatic approach
described above as it seemed to do what I expected...

--

Michael

-----------------------------------------------------------------
http://www.herger.net/SlimCD - your SlimServer on a CD
http://www.herger.net/slim - AlbumReview, Biography, MusicInfoSCR

jimwillsher
2007-02-23, 13:50
Hmm...unfortunately the "title ?" and "artists ?" etc. parameters do not seem to accept the charset argument. Unless I'm doing it wrong?


Jim

peterw
2007-02-23, 14:57
7. Tráthnóna Beag Aréir
37 2E 20 54 72 E1 74 68 6E F3 6E 61 20 42 65 61 67 20 41 72 E9 72

The interesting charaters (for me) here are:
á E1
ó F3
é E9


á %C3%A1
ó %C3%B3
é %C3%A9



See http://en.wikipedia.org/wiki/Utf8

UTF-8 requires at least 16 bits to represent a character with an ordinal value greater than 0x7F. The CLI appears to be URI-escaping individual bytes of a UTF-8 encoded sequence.

0xE1 == binary 00011-100001 (dashes for readability)
In UTF-8, that would be 110-00011 10-100001
Binary 110-00011 == 0xC3
Binary 10-100001 == 0xA1

jimwillsher
2007-02-23, 15:57
Hmmm...interesting.

I can see the logic. But how on earth would I convert it BACK to a sensible value - in any language!? I'm not sure it's a practical conversion. e.g. how does C3A1 get to become E1?

Confused....

peterw
2007-02-23, 16:33
$ perl -e 'use CGI; use Encode; $in = "%C3%A1"; $out = decode("utf8", CGI::unescape($in)); \
print "\n\"$in\" decodes to \"$out\"\n";'

"%C3%A1" decodes to "á"

jimwillsher
2007-02-23, 17:07
Hmmm....perl. Not really a solution. But thanks anyway.

jimwillsher
2007-02-23, 17:19
Okay, sorted it.

For the C++ people (not this awful Perl stuff) the solution is:



CString CSqueezeboxStatus::FromUTF8(LPBYTE pUTF8, int nSize)
{
#define MAX_CHAR 1
WCHAR wszResult[MAX_CHAR+1];
DWORD dwResult = MAX_CHAR;
CString cResult;

int iRes = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)pUTF8, nSize, wszResult, dwResult);
if (iRes == 0)
{
DWORD dwErr = GetLastError();
printf("MultiByteToWideChar() failed - > %d\n", dwErr);
}
else
{
wszResult[iRes] = _T('\x00');
cResult = wszResult;
}
return cResult;
}



Many thanks to the contributors to this thread who pointed me in the right direction. This problem has nagged me for months!



Jim

peter
2007-02-24, 01:34
jimwillsher wrote:
> Okay, sorted it.
>
> For the C++ people (not this awful Perl stuff) the solution is:
>

You shouldn't be using slimserver then ;)

Regards,
Peter

jimwillsher
2007-02-24, 01:43
Don't get me wrong, I think SlimServer is excellent. I have an SB3, having previously had an SB2 and an SB1. But Perl is almost exclusively the realm of *nix platforms; very very little perl stuff is every written for Windows systems. Windows development is typically done in C/C++, which is what I'm using.

But fortunately there are standards, and in this case the standard is UTF-8. As such, I can now successfully interface my C+ code with the output form the CLI.


Jim

peter
2007-02-24, 02:09
jimwillsher wrote:
> Don't get me wrong, I think SlimServer is excellent. I have an SB3,
> having previously had an SB2 and an SB1. But Perl is almost exclusively
> the realm of *nix platforms; very very little perl stuff is every
> written for Windows systems. Windows development is typically done in
> C/C++, which is what I'm using.
>

VB is still quite popular as I understand. I've actually done some
Windows stuff in Perl, it wasn't so bad really, but I'd like to see a
good C++ slimserver interface!

> But fortunately there are standards, and in this case the standard is
> UTF-8. As such, I can now successfully interface my C+ code with the
> output form the CLI.
>

Make sure you check it works well in Chinese ;)

Regards,
Peter (charactersets always give me a head ache)

jimwillsher
2007-02-24, 02:33
Hmmm...chinese, yes. I have plenty of gaelic songs, but not chinese.

Anyway, new version of SqueezeMSN about to be uploaded (http://www.jimwillsher.co.uk/Site/Software/SqueezeMSN.php), now that I've fixed the Unicode issue :-)


Jim

peter
2007-02-24, 02:43
jimwillsher wrote:
> Hmmm...chinese, yes. I have plenty of gaelic songs, but not chinese.
>
> Anyway, new version of SqueezeMSN about to be uploaded
> (http://www.jimwillsher.co.uk/Site/Software/SqueezeMSN.php), now that
> I've fixed the Unicode issue :-)
>

Great. Unfortunately I use GAIM to connect to MSN so it' not for me ;)

Perhaps you should get together with this guy:
http://software.johnroark.net/

Regards,
Peter

Marc Sherman
2007-02-24, 13:52
jimwillsher wrote:
> Hmmm...chinese, yes. I have plenty of gaelic songs, but not chinese.
>
> Anyway, new version of SqueezeMSN about to be uploaded
> (http://www.jimwillsher.co.uk/Site/Software/SqueezeMSN.php), now that
> I've fixed the Unicode issue :-)

If you really want to stress test your unicode implementation, try
Turkish. There's essentially a bug in the unicode spec, where the
following Java code will fail if run in a Turkish locale:

assert("SIMPLE".toLowerCase().contains("i"));

For a more topical example:

if(filename.toLowerCase().endsWith("aiff")) {
// ... handle aiff audio file
}

- Marc