[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] XML Character encoding (again)
On Wednesday, 16 April 2003 at 18:16, Julian Bond wrote:
> Programming hours are too short to start figuring out client browser
> capability, UTF-8 conversion from arbitrary encodings and so on.
But this is what you have to do in _any_ web application that accepts
input from multiple sources if you care about interoperability.
When the user's browser submits the POST there will be a Content-type
header sent as well which may have a charset parameter. You have to
assume that the browser has encoded the POST content in the stated
character set. You then have to use this information to encode to
whatever standard character encoding you are using to store the data.
In Perl you'd use the Encode module and the decode function. It
doesn't look like PHP has the same breadth of encoding functions -
utf8_encode only seems to convert from ISO-8859-1 to UTF-8. The Perl
ones aren't complete by any means and they've been in development for
years - it's hard work so I'm not holding my breath for the equivilent
in the PHP world. I don't know about Java but I suspect that that
environment is at least as well developed as Perl in the area of
character encodings.
If the majority of the errors are MS Word related then take a look at
http://www.fourmilab.ch/webtools/demoroniser/ which strips out the
strange characters from documents authored in MS environments. I
believe it's just a big list of regular expressions which should be
portable to PHP.
- Ian <iand@internetalchemy.org>
"Science is organized knowledge. Wisdom is organized life."