[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] XML Character encoding (again)

To: syndication@yahoogroups.com
Subject: Re: [syndication] XML Character encoding (again)
From: Doug Ransom <doug.ransom@alumni.uvic.ca>
Date: Tue, 15 Apr 2003 18:54:13 -0700
In-reply-to: <N0zMtdBSIIn+EALt@jblaptop.voidstar.com>
References: <N0zMtdBSIIn+EALt@jblaptop.voidstar.com>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4a) Gecko/20030401

Sound like it might be a parser bug and not your bug.

Try UTF-16 long enough to see if the problem goes away. Or latin-1 tosee if the foreign characters are successfully trashed.



Doug

Julian Bond wrote:

I feel like I should have solved this years ago. But it's still causingme trouble.
I have a situation where a motley crew of users are using all sorts oftools to enter blog text. Which means that they cut and paste in £ poundsigns, MS Smart quotes and the occasional foreign character (evenEuros). This text can also contain embedded html. I get this as POSTdata and store it in a database. Later I read this out into the<description> section of RSS. This is all done with PHP code.
At the moment, I'm wrapping this in a CDATA section with the whole XMLblock using UTF-8 encoding. This appears to break. It's invalid in MarkPilgrim's validator. IE6 complains about "An invalid character was foundin text content". People tell me that other validating XML parserscomplain as well, including the one used by Livejournal. Which ispuzzling when Mark's help text advises this as a technique. I thought aCDATA block would protect against this and it's presumably why MTstarting using this. But looking at the W3C comments on CDATA it onlyprotects against XML special characters being unescaped. It doesn'tappear to protect against bad character encoding.
Previously, I've used UTF-8 with no CDATA but using thehtmlspecialchars() function in PHP to escape the reserved 5 XMLcharacters. This is otherwise fine, and deals with embedded HTML butstill fails with some invalid characters.
I've also tried using PHP's htmlentities() function to encode the textand an ENTITY statement pointing athttp://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent This contains the sameentity translation as the original Netscape RSS DTD. Again this almostworks, but some PHP translations are slightly different. In particular' is missing in PHP and there may be others.
I've also tried alternate character sets and more MS friendly char sets,but some Mac and Linux users have managed to enter data that breaksthose. (I think).
Next? Do I have to convert all high order characters into characternumber form? What will parsers make of this? Especially theultra-liberal Regex ones? (like mine...)
I really thought that UTF-8 would just treat single byte characters assingle bytes and not complain. But I'm not looking at the wire to seewhat PHP and Mysql are actually passing.
There has to be a way to put arbitrary bytes into a defined block withinthe <description> element without having to explicitly encode each one.Hasn't there?
Aaaargh! If anyone has a real answer to this, I'd really appreciate afairly detailed recipe. I suspect I'm not alone.

References:
- XML Character encoding (again)
  - From: Julian Bond <julian_bond@voidstar.com>

Prev by Date: XML Character encoding (again)
Next by Date: Re: [syndication] XML Character encoding (again)
Previous by thread: XML Character encoding (again)
Next by thread: Re: [syndication] XML Character encoding (again)
Index(es):
- Date
- Thread