[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: syndication and i18n



On Tue, May 22, 2001 at 05:01:31PM +0100, hpyle@agora.co.uk wrote:
> 
> My take: use a decent XML parser and you'll have all the parse-side 
> encoding issues completely handled for you, and your Python code will just 
> see Unicode.  It might mean you end up with a stricter aggregator than 
> some (eg. you won't be able to accept <item>stuff<img 
> src="something"></item> because it's badly formed), but IMHO that's not a 
> bad thing.

That's what I'm already doing; unfortunately, it's not that easy in
practice, because unicode handling (in Python, at least) isn't that
transparent. For example, there are non-ASCII characters in both the
Standard and the W3C's RSS feeds right now, which cause Python to
raise an error unless I .encode('utf-8') them into strings.
Parse-side isn't a problem; it's doing something with the output that
is.

For those interested in the minute details...

In the W3C feed, the source HTML (the home page) is charset=us-ascii,
and the offending bit of markup is encoded:
  Philippe Le H&#233;garet
which renders fine in Mozilla.

In the XML RSS file, the XML has an encoding of 'utf-8', and the
offending markup is:
  Philippe Le Hégaret

So, PyXML will spit this out as unicode. If I try to print that to
anything, or combine it with other strings in certain ways, I get
  UnicodeError: ASCII encoding error: ordinal not in range(128)
unless I .encode('utf-8') it, in which case I get something that
prints in ascii as
  Philippe Le Hégaraet

which seems to render correctly, as long as I set the charset to
utf-8. Fine.

The Standard's feed has encoding="ISO-8859-1". The offending markup
is
  Net 21 <96> The Survivors
which, as a Python unicode string, looks like
  u'Net 21 \x96 The Survivors'

If I .encode('utf-8') it, I get
  'Net 21 \xc2\x96 The Survivors' \
which doesn't look correct at all (it's supposed to be an em
dash) when rendered in Mozilla with utf-8. If I change the charset to
8859-1, the original renders correctly, but the unicode-encoded
string does not (it has an extra character prepended, understandably).

I think the root of the problem is that I have no apparent way to
determine the encoding of a unicode string coming out of the XML
parser, or a way to consolidate several different encodings into one
document (although I thought this was what unicode was supposed to
enable).

I should probably take this to the Python XML group...

-- 
Mark Nottingham
http://www.mnot.net/