[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] html parsing as a horror story



> Here is an html parsing horror story...
> Radio and RSS 0.92
> AHHH!!!!

You mean aaiiiieeeee!!! /runs screaming into the night.../

In Userland's defense, they do try to resolve the stuff eventually.  I hounded
them for weeks to get on the encoding bandwagon.  It wasn't until the statistics
page at Syndic8 revealed the extent of which errors in encoding were unfairly
causing feeds to fail.  They (Jake) were good about making attempts to handle
the bugs.  Not as quickly as some might like, of course.  There is still an
outstanding bug in how Radio creates and parses ampersands.  They end up
double-encoding all over the place by not using regexp lookups or entity tables.
Even UTF-8 stuff gets bastardized from &#999 wrongly into &999.  There's
hope, sort of, in that the xml.entityEncode function within Radio is the
bottleneck.  Getting them to fix that script would solve the output side of
their encoding mistakes.  Doing likewise on the decoding would presumably take
care of the rest.

-Bill Kearney