[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] XML validation with XSD
> I agree in principle that brute forcing your way around bad XML is
generally
> a bad idea. Far better to flag & report the problem, and help the
> content-generator fix the problem. There are times however when this isn't
> an option - for example, if you are building a reader then you can't
always
> expect the end user to provide feedback to the generator of the dodgy
feed.
> A possibility here is to use something like the W3C's HTML Tidy to clean
up
> the markup before the rest of your app gets to see it. I'm trying this now
> myself, and the actual code needed for the filtering is pretty small. I'm
> not yet certain about the time overhead needed for processing, but I
suspect
> that won't be significant compared with the time taken to get the data.
The custom-written RSS/RDF parser behind The Snewp and RSSEngine cleans
up the XML, makes some assumptions based on known data, etc -- almost no
comparable overhead added. Granted, when parsing 15,000 feeds, it /does/ add
to the total time, but it's worth it for the cleaner data.
I have been very careful to not over-parse the data -- a hard line to
define in some cases.
A couple examples of what it does:
If the generator agent can be defined as Radio, the parser knows that
the feed A) probably doesn't have item titles, B) uses encoded HTML (often
double encoded). It does an extra html-decode routine before even parsing
the actual data. If it can confirm that the item titles are indeed missing,
it looks for an HTML link to pull as the title from the item description.
If the format is RSS 1.0 (RDF), the parser normalizes namespace
indentifiers before parsing. From time to time, I have to add to the list of
"bad" names that are normalized - you might be surprised on some of the
namespace identifiers -- "rss091" has been found to be assigned to the
"default" namespace, etc.
James
PS: The parser code will be available in the next couple months sometime.