PDA

View Full Version : ASX parsing - XML vs ASX case sensitivity issue



Triode
2005-07-16, 05:28
I've been trying to work out why certain asx files don't parse properly - specifially the Virgin Radio ones such as:
http://www.smgradio.com/core/audio/wmp/live.asx?service=vrbb

It turns out that this contains a case mismatch in the opening and closing tags for a couple of lines:
mismatched tag at line 13, column 44, byte 573 at /usr/local/slimserver/CPAN/XML/Parser.pm line 187

It looks to me that XML specifies that type matters for tags, but Microsoft in there wisdom specify the reverse:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wcewmt/html/_wcesdk_asx_asx_elements_reference.asp
"The opening and closing tags used in an element also can be different case"

Hence is there an easy way to relax XML::Simple to allow this? Otherwise, could someone suggest a generic search & replace which
would change the case of all tags to a given case prior to calling XML::Simple without changing the case of any of the rest of the
XML? [my regex is not up to this, but I am sure someone elses is...]

Adrian

Steve Baumgarten
2005-07-18, 08:04
Triode wrote:

> Hence is there an easy way to relax XML::Simple to allow this?
> Otherwise, could someone suggest a generic search & replace which would
> change the case of all tags to a given case prior to calling XML::Simple
> without changing the case of any of the rest of the XML? [my regex is
> not up to this, but I am sure someone elses is...]

This might do it:

$xml =~ s#(</?[^ >]+)#uc($1)#meg;

I've tested it a bit and it seems to catch all tags and only tags in any
random XML/HTML. It assumes you have the entire contents of the ASX file
(i.e., the XML) in a variable $xml.

Breaking that s/// operation apart a bit, it means take anything that
starts with a left angle bracket, has a '/' character following
(optionally) and then any series of characters up to a space or a right
angle bracket. For each such occurrence, uppercase what was found and
substitute it for the match.

The options on the end of the s/// operation tell perl to treat the
string as 'm'ultiline; 'g'lobal replacement; 'e'val the replacement
string rather than use it verbatim. In this case "uc($1)" tells perl to
run the uc() function on the match to turn it into uppercase.

Like I said, I haven't exhaustively tested this, but it does seem to
work correctly on the various test cases I've thrown at it. Here's my
sample script with test data included in the body of the script:

--------------------------------------------
#! /usr/bin/perl

use strict;
undef $/;

my $xml = <DATA>;
$xml =~ s#(</?[^ >]+)#uc($1)#meg;
print $xml;
exit 0;

__DATA__
<Testing>
<This>
Some stuff
<a
tag=whatever>
something
</a>
</This>
</Testing>
--------------------------------------------

The output:

<TESTING>
<THIS>
Some stuff
<A
tag=whatever>
something
</A>
</THIS>
</TESTING>

I'm sure there are cases that get messed up by the s/// operation as it
is, but if it's used only when XML::Simple complains (that is, only when
normal parsing fails), then there's not much to lose in giving it a try.
The alternative is to reject the ASX file entirely, which is pretty much
the worst case (as far as the user is concerned) anyway.

SBB





Visit our website at http://www.ubs.com

This message contains confidential information and is intended only
for the individual named. If you are not the named addressee you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses. The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission. If
verification is required please request a hard-copy version. This
message is provided for informational purposes and should not be
construed as a solicitation or offer to buy or sell any securities or
related financial instruments.

Grotus
2005-07-18, 10:01
Steve Baumgarten wrote:
> This might do it:
>
> $xml =~ s#(</?[^ >]+)#uc($1)#meg;
>
> Breaking that s/// operation apart a bit, it means take anything that
> starts with a left angle bracket, has a '/' character following
> (optionally) and then any series of characters up to a space or a right
> angle bracket. For each such occurrence, uppercase what was found and
> substitute it for the match.

Some suggestions:
The /? is unnecessary, since / matches [^ >].
Use \s instead of a literal space so that tabs and newlines will be caught.
Use \U$1\E instead of uc($1) so that you don't need to use the e qualifier.
Use something other than # as your regex delimiter, since # confuses
syntax highlighters, | works well, as does {}.

That results in the regex being:
$xml =~ s{(<[^\s>]+)}{\U$1\E}mg;

Triode
2005-07-18, 11:23
> That results in the regex being:
> $xml =~ s{(<[^\s>]+)}{\U$1\E}mg;

Thanks Steve and Robert.

Vidur - are you happy with the resulting diff that makes these asx files parse (attached)? Seems mean the problem asx files work
OK.

I still get a couple of warnings which we may want to get rid of:

untie attempted while 3 inner references still exist at /usr/local/slimserver/CPAN/IO/String.pm line 80, <GEN177> line 32.
Warning: 'ParserOpts' is deprecated, contact the author if you need it at /usr/local/slimserver/Slim/Formats/Parse.pm line 744

The first one appears to come from closing the file to early at the start of Slim::Formats::Parse::readASX (I can remove it by
commenting out the file close). Is there an elegant way to fix?

For the second one, do we need to use:
ParserOpts => [ ProtocolEncoding => 'ISO-8859-1'

Adrian

Steve Baumgarten
2005-07-18, 11:33
Robert Moser wrote:

> Some suggestions:
> The /? is unnecessary, since / matches [^ >].
> Use \s instead of a literal space so that tabs and newlines will be caught.
> Use \U$1\E instead of uc($1) so that you don't need to use the e qualifier.
> Use something other than # as your regex delimiter, since # confuses
> syntax highlighters, | works well, as does {}.
>
> That results in the regex being:
> $xml =~ s{(<[^\s>]+)}{\U$1\E}mg;

Those are all good improvements. I'm still hoping that the regex itself
actually does the trick in enough cases to help...

SBB





Visit our website at http://www.ubs.com

This message contains confidential information and is intended only
for the individual named. If you are not the named addressee you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses. The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission. If
verification is required please request a hard-copy version. This
message is provided for informational purposes and should not be
construed as a solicitation or offer to buy or sell any securities or
related financial instruments.

Triode
2005-07-18, 11:53
>
> Those are all good improvements. I'm still hoping that the regex itself actually does the trick in enough cases to help...
>

Well it works for my failing case. However I missed the bit about you suggesting it is only used when XML::Simple complains. I'll
think about this again unless anyone else is confident in using it on all ASX?

Adrian

vidurapparao
2005-07-18, 12:02
I'm good with it being used always (not only when XMLIn fails). The
ProtocolEncoding option is necessary, despite the warning - otherwise
XML::Parser barfs on high-bit characters. I think the warning is bogus -
the author's rationale for it is that you should be able to "fix your
XML". I suppose we do have the option of prefixing the ASX file with a
XML directive containing the ISO-8859-1, but I'm OK with the warning.

Triode wrote:

> ...
> Vidur - are you happy with the resulting diff that makes these asx
> files parse (attached)? Seems mean the problem asx files work OK.