[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] XML Character encoding (again)



Sound like it might be a parser bug and not your bug.
Try UTF-16 long enough to see if the problem goes away. Or latin-1 to see if the foreign characters are successfully trashed.


Doug

Julian Bond wrote:

I feel like I should have solved this years ago. But it's still causing me trouble.

I have a situation where a motley crew of users are using all sorts of tools to enter blog text. Which means that they cut and paste in £ pound signs, MS Smart quotes and the occasional foreign character (even Euros). This text can also contain embedded html. I get this as POST data and store it in a database. Later I read this out into the <description> section of RSS. This is all done with PHP code.

At the moment, I'm wrapping this in a CDATA section with the whole XML block using UTF-8 encoding. This appears to break. It's invalid in Mark Pilgrim's validator. IE6 complains about "An invalid character was found in text content". People tell me that other validating XML parsers complain as well, including the one used by Livejournal. Which is puzzling when Mark's help text advises this as a technique. I thought a CDATA block would protect against this and it's presumably why MT starting using this. But looking at the W3C comments on CDATA it only protects against XML special characters being unescaped. It doesn't appear to protect against bad character encoding.

Previously, I've used UTF-8 with no CDATA but using the htmlspecialchars() function in PHP to escape the reserved 5 XML characters. This is otherwise fine, and deals with embedded HTML but still fails with some invalid characters.

I've also tried using PHP's htmlentities() function to encode the text and an ENTITY statement pointing at http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent This contains the same entity translation as the original Netscape RSS DTD. Again this almost works, but some PHP translations are slightly different. In particular &apos; is missing in PHP and there may be others.

I've also tried alternate character sets and more MS friendly char sets, but some Mac and Linux users have managed to enter data that breaks those. (I think).

Next? Do I have to convert all high order characters into character number form? What will parsers make of this? Especially the ultra-liberal Regex ones? (like mine...)

I really thought that UTF-8 would just treat single byte characters as single bytes and not complain. But I'm not looking at the wire to see what PHP and Mysql are actually passing.

There has to be a way to put arbitrary bytes into a defined block within the <description> element without having to explicitly encode each one. Hasn't there?

Aaaargh! If anyone has a real answer to this, I'd really appreciate a fairly detailed recipe. I suspect I'm not alone.