Home of the Squeezebox™ & Transporter® network music players.
Page 1 of 2 12 LastLast
Results 1 to 10 of 16
  1. #1
    Senior Member
    Join Date
    Apr 2005
    Location
    Scotland
    Posts
    393

    Problem with character-sets on the CLI

    Hi,

    Not sure if the right place to post....

    Slimserver 6.5.1 on Ubuntu 6.06. Accessing command line from a Vista PC (telnet 192.168.1.150 9090)

    I have a track called "7. Tráthnóna Beag Aréir" by Clannad. When I look at the track on the slimserver interface (IE7) it appears correctly. When I paste that same trackname from IE7 into a hex editor I get:

    7. Tráthnóna Beag Aréir
    37 2E 20 54 72 E1 74 68 6E F3 6E 61 20 42 65 61 67 20 41 72 E9 72

    The interesting charaters (for me) here are:
    á E1
    ó F3
    é E9


    When I access the CLI and enter "title ?" I get:

    00%3A04%3A20%3A06%3A1c%3A47 title Tr%C3%A1thn%C3%B3na%20Beag%20Ar%C3%A9ir

    The spaces have been converted to %20, which I would expect. However if we take the same three characters from earlier we get:

    á %C3%A1
    ó %C3%B3
    é %C3%A9

    Could somebody help me to understand what is being returned? According to the CLI documentation the strings are returned as UTF-8, and the encoding on the IE7 webapge is also UTF-8. So, why two characters instead of one, and is there a direct conversion between them? Why does the CLI return %C3%A1 when it should probably return just E1 (or %E1) ?

    Many thanks!


    Jim

    PS For info, I'm querying the CLI via a C++ program (Unicode), but the DOS command line returns the same.

  2. #2
    Babelfish's Best Boy mherger's Avatar
    Join Date
    Apr 2005
    Location
    Switzerland
    Posts
    20,355

    Problem with character-sets on the CLI

    > So, why two characters
    > instead of one, and is there a direct conversion between them? Why does
    > the CLI return %C3%A1 when it should probably return just E1 (or %E1) ?


    The CLI does additionally url-encode the strings. You should find plenty
    of resources on the internet on how to decode these strings (if your
    language doesn't offer a function/method yet).

    --

    Michael

    -----------------------------------------------------------------
    http://www.herger.net/SlimCD - your SlimServer on a CD
    http://www.herger.net/slim - AlbumReview, Biography, MusicInfoSCR

  3. #3
    Senior Member
    Join Date
    Apr 2005
    Location
    Scotland
    Posts
    393
    Many thanks Michael. I must still be missing something though.

    When I paste:

    title Tr%C3%A1thn%C3%B3na%20Beag%20Ar%C3%A9ir

    into website http://www.simplelogic.com/Developer/URLDecode.asp

    I get:

    title Tráthnóna Beag Aréir

    instead of

    title Tráthnóna Beag Aréir


    e.g. two characters instead of a single accented character.

    When I paste

    title Tr%E1thn%F3na%20Beag%20Ar%E9ir

    into the same decoder website, I get the answer I'm expecting. So as well as being URLEncoded, it looks like there's something else going on in there. Any ideas? Possibly a 7-bit/8-bit issue, or a UTF-8 issue? I still can't see how the CLI returns %C3%A1 when I'm expecting %E1.

    Many thanks for your time.



    Jim

  4. #4
    Babelfish's Best Boy mherger's Avatar
    Join Date
    Apr 2005
    Location
    Switzerland
    Posts
    20,355

    Problem with character-sets on the CLI

    > title Tráthnóna Beag Aréir

    I ended up adding "charset:iso-8859-1" to the CLI requests.

    > Possibly a 7-bit/8-bit issue, or a UTF-8 issue?


    I'm sorry, I'm no expert in that area. I took the pragmatic approach
    described above as it seemed to do what I expected...

    --

    Michael

    -----------------------------------------------------------------
    http://www.herger.net/SlimCD - your SlimServer on a CD
    http://www.herger.net/slim - AlbumReview, Biography, MusicInfoSCR

  5. #5
    Senior Member
    Join Date
    Apr 2005
    Location
    Scotland
    Posts
    393
    Hmm...unfortunately the "title ?" and "artists ?" etc. parameters do not seem to accept the charset argument. Unless I'm doing it wrong?


    Jim

  6. #6

    UTF-8 + URI encoding

    Quote Originally Posted by jimwillsher View Post

    7. Tráthnóna Beag Aréir
    37 2E 20 54 72 E1 74 68 6E F3 6E 61 20 42 65 61 67 20 41 72 E9 72

    The interesting charaters (for me) here are:
    á E1
    ó F3
    é E9


    á %C3%A1
    ó %C3%B3
    é %C3%A9
    See http://en.wikipedia.org/wiki/Utf8

    UTF-8 requires at least 16 bits to represent a character with an ordinal value greater than 0x7F. The CLI appears to be URI-escaping individual bytes of a UTF-8 encoded sequence.

    0xE1 == binary 00011-100001 (dashes for readability)
    In UTF-8, that would be 110-00011 10-100001
    Binary 110-00011 == 0xC3
    Binary 10-100001 == 0xA1

  7. #7
    Senior Member
    Join Date
    Apr 2005
    Location
    Scotland
    Posts
    393
    Hmmm...interesting.

    I can see the logic. But how on earth would I convert it BACK to a sensible value - in any language!? I'm not sure it's a practical conversion. e.g. how does C3A1 get to become E1?

    Confused....

  8. #8
    Code:
    $ perl -e 'use CGI; use Encode; $in = "%C3%A1"; $out = decode("utf8", CGI::unescape($in)); \
    print "\n\"$in\" decodes to \"$out\"\n";'
    
    "%C3%A1" decodes to "á"
    Last edited by peterw; 2007-02-23 at 16:47. Reason: Encode.pm preferred over utf8::decode

  9. #9
    Senior Member
    Join Date
    Apr 2005
    Location
    Scotland
    Posts
    393
    Hmmm....perl. Not really a solution. But thanks anyway.

  10. #10
    Senior Member
    Join Date
    Apr 2005
    Location
    Scotland
    Posts
    393
    Okay, sorted it.

    For the C++ people (not this awful Perl stuff) the solution is:

    Code:
    CString CSqueezeboxStatus::FromUTF8(LPBYTE pUTF8, int nSize)
    {
    #define MAX_CHAR 1
    	WCHAR wszResult[MAX_CHAR+1];
    	DWORD dwResult = MAX_CHAR;
    	CString cResult;
    
    	int iRes = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)pUTF8, nSize, wszResult, dwResult);
    	if (iRes == 0)
    	{
    		DWORD dwErr = GetLastError();
    		printf("MultiByteToWideChar() failed - > %d\n", dwErr);
    	}
    	else
    	{
    		wszResult[iRes] = _T('\x00');
    		cResult = wszResult;
    	}
    	return cResult;
    }

    Many thanks to the contributors to this thread who pointed me in the right direction. This problem has nagged me for months!



    Jim

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •