PDA

View Full Version : Further investigation into mixed encodings in web pages



Renyao
2005-08-14, 19:33
I had reported the mixed encodings in web pages.I wondered if this happen during generation process of web pages or during extraction process of scanning music folders or both.

I installed mysql database and perl DBD::mysql module and configured slimserver to use mysql as database back end.From
mysql client navicat,I can clearly see what's really in the database.

When Slimserver extracts artist,album etc from id3 tags,it stored them into table fields in "utf8 from latin-1" encoding.

When Slimserver extracts artist,album etc from directory/file names ,it stored them into table fields in "raw" encoding(that
is latin-1 or cp936 which can be considered same).


Let's have a look at the attached utf8fromlatin_vs_cp936_pict1.bmp.

Look at the title field at the line where id = 31 in the albums@slimdata table.The original data is Chinese character string,but the content here is two times long as should be.
Apparently,the Chinese characters were considered latin-1 and wrongly decoded into utf8 and would never display rightly without
converting them back.To confirm,I cut the content and pasted it into Notepad.I saved it to aaa.txt and looked at aaa.txt
with debug.The bytes are a series of "11......10......"'s(utf8
of latin-1!!!). Utf8 decoded from cp936 shoud be a series of "111.....10......10......"'s usually.

The title field at the line where id = 29 shows the Chinese characters correctly.It is extracted from directory name using
"guess tags".However uc (and lc) without "use locale" are abused
for Chinese or utf8 characters.

Picture utf8fromlatin_vs_cp936_pict2.bmp is a snapshot at
contributers@slimdata table.The name fields in rows where id<=56
are characters wrongly decoded from latin-1.The name fields in the rows where id>57 are correct cp936 characters.

During installation of mysql,the latin-1 charset is chosen.I have
not experimented with utf8 charset with mysql.

What's importent is that the character encoding in the database should be uniform.

I hope Slimdevices will address this problem soon.

Dan Sully
2005-08-14, 20:53
* Renyao shaped the electrons to say...

>During installation of mysql,the latin-1 charset is chosen.I have
>not experimented with utf8 charset with mysql.
>
>What's importent is that the character encoding in the database should
>be uniform.

Yes - that is correct. I was looking over your previous patch, and wondered
how you had gotten non-UTF8 data into the database. Thanks for the pointer.

>I hope Slimdevices will address this problem soon.

I'll be taking a look into it.

-D
--
<phone> i am a sausage fan

Dan Sully
2005-08-14, 21:18
Does this help out?

--- Slim/Music/Info.pm (revision 3972)
+++ Slim/Music/Info.pm (working copy)
@@ -1011,7 +1011,7 @@
$::d_info && msg("$tags[$i] => $match\n");
$match =~ tr/_/ / if (defined $match);
$match = int($match) if $tags[$i] =~ /TRACKNUM|DISC{1,2}/;
- $taghash->{$tags[$i++]} = $match;
+ $taghash->{$tags[$i++]} = Slim::Utils::Unicode::utf8decode_locale($match);
}
return;
}

-D
--
<jwb> burning substations is manifestly the desire of the free market. hooray for utility deregulation
<jwb> the government would never be able to set fires with such brutal efficiency

Renyao
2005-08-15, 02:59
Hi Dan,

Thank you very much for your patch.

I commented my own patch temporarily and applied yours.

The artist,album and contrubuters information extracted from directory/file names are basically correct now. Chinese characters with "_" in it are missing however.

The artist,album and contrubuters information extracted mp3id3 tags are still utf8_from_latin-1 and cannot be correctly displayed.

I now explain the attached screen snapshots.

In picture browse_artists.bmp,the upper part of left frame is
"utf8_from_latin-1" area (wrong) and the lower part of left frame is "utf8_from_cp936" area (right).

In picture browse_albums.bmp,the upper part of left frame is
"utf8_from_cp936" area (right) and the lower part of left frame is "utf8_from_latin-1" area (wrong).

Picture songinfo_1.bmp shows that mp3 tags are wrongly displayed

Picture songinfo_2.bmp shows that dir/file tags are correctly displayed.But the location display is wrong.

Picture songinfo_3.bmp shows that dir/file tags are almost correctly displayed.And the location display is right.Just because the last Chinese character contains "_",this Chinese character is missing in "Song Info for" and "Title",but it is
there in "Location"(the one just before .wav)

The difference between songinfo_2.bmp and songinfo_3.bmp is that
The file name in songinfo_3.bmp contains "_" while that in songinfo_2.bmp doesn't,leading to different decoding and therefore different display of Location(a bug must exist here).

Thank you again for your patch and endeavour.

Dan Sully
2005-08-15, 11:08
* Renyao shaped the electrons to say...

>In picture browse_artists.bmp,the upper part of left frame is
>"utf8_from_latin-1" area (wrong) and the lower part of left frame is
>"utf8_from_cp936" area (right).
>
>In picture browse_albums.bmp,the upper part of left frame is
>"utf8_from_cp936" area (right) and the lower part of left frame is
>"utf8_from_latin-1" area (wrong).
>
>Picture songinfo_1.bmp shows that mp3 tags are wrongly displayed
>
>Picture songinfo_2.bmp shows that dir/file tags are correctly
>displayed.But the location display is wrong.

Renyao - can you send me some of the MP3 files that are being displayed as
'utf8_from_latin-1' ?

Thanks.

-D
--
<faisal> my life is collapsing to what will soon be NEGATIVE INTEGER degrees of separation.

Renyao
2005-08-16, 01:50
Hi dan,

I sent a mp3 file to you by e-mail.I cannot upload it here.It says "File is too large".

Dan Sully
2005-08-16, 09:25
* Renyao shaped the electrons to say...

>I sent a mp3 file to you by e-mail.I cannot upload it here.It says
>"File is too large".

Thanks - I'll take a look.

What should the encoding be? Big5? EUC-CN?

-D
--
<noah> I used to be indecisive, but now I'm not sure.

Dan Sully
2005-08-16, 12:07
* Dan Sully shaped the electrons to say...

>>I sent a mp3 file to you by e-mail.I cannot upload it here.It says
>>"File is too large".
>
>Thanks - I'll take a look.
>
>What should the encoding be? Big5? EUC-CN?

I've checked in a change as r3983 which might help with this, by using your
locale in the Encode::Guess suspects list.

But really what needs to happen here is that your tags need to be updated to
use ID3 v2.3/UTF-16 or v2.4/UTF-8. Right now you have a multi-byte encoding
stuffed into what the ID3 spec specifies as a Latin1 only field.

-D
--
It's the wrong trousers Gromit, and they've gone wrong!

Renyao
2005-08-16, 13:50
>>But really what needs to happen here is that your tags need to >>be updated to
>>use ID3 v2.3/UTF-16 or v2.4/UTF-8.
>>Right now you have a multi->>byte encoding
>>stuffed into what the ID3 spec specifies as a Latin1 only >>field.

Sorry about that.The mp3 files were downloaded from internet by
my son who loves them and likes playing them through SB2.

Renyao
2005-08-17, 11:42
Hi Dan,

Now Chinese characters are displayed correctly.The strings in mysql database are all "utf8_from_cp936".The annoying list of "Malformed UTF-8 character ..." disappears.

BTW,when will "Search For Songs" and "Advanced Search" behave
correctly?

Thank you again for your effort.

Dan Sully
2005-08-17, 11:50
* Renyao shaped the electrons to say...

>Now Chinese characters are displayed correctly.The strings in mysql
>database are all "utf8_from_cp936".The annoying list of "Malformed
>UTF-8 character ..." disappears.

Great!

>BTW,when will "Search For Songs" and "Advanced Search" behave correctly?

What exactly doesn't work? Can you file a bug on it?

http://bugs.slimdevices.com/

Thanks.

-D
--
"It has become appallingly obvious that our technology has exceeded our humanity." - Albert Einstein