[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] html parsing as a horror story



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Morten Frederiksen <mof-syndication@mfd-consult.dk> writes:

> Hi there,
> 
> On Saturday 20 July 2002 00:00, Kevin wrote:
> <snip context="RSS 0.92 description encoding"/>
> > I won't bring up the security issues present with the possible syndication
> > of encoded <script> elements...
> I don't suppose this problem will go away with mod_content?

I does make it easier to deal with in some situations.  if you are using a
Literal parseType then you can deal with the data directly as XML (instead of
CDATA).

Then you can do:

<xsl:template match="xhtml:script">

    <!-- don't do anything for scripts -->

</xsl:template>

With SAX you can get rid of the data on your copy.  You could also write an XSLT
extension to handle this or a search/replace for script sections but having the
data as XML makes this a lot more elegant.

> > On the peerfear link, notice the use of images for each <item>. This is
> > done with a mod_itemimage RSS 1.0 module I am about to propose.

> This does look interesting. Suggestion: It looks as if there's just the one
> image element defined in this module, how about adding at least a link element
> as well, this way it could be used for category links, /.-style.  Or maybe
> this belongs in a categorization module?  /thinking out loud/... maybe a full
> rss:image container could be used in some way, instead of several separate
> elements?

Right now I am thinking of building a more generic mod_image module for channel
images of multiple sizes, item images, etc.  I hope to have something published
by next week or so..

The one on peerfear.org was just a quick hack so that it could work on my site.
That code will be upgraded to the proposal I make to the rss-dev team.

> > RSS 0.92 feeds (notice the lack of title with all structure encoded within a
> > <description> element as HTML)

> This is indeed ugly and close to unusable, but your point pushes me to point
> out an issue I have with your feed: The item description contains the entire
> item content, although HTML is stripped, but what is the point of this when
> you use mod_content? Isn't the description element - in any case - supposed to
> contain a *description* of the item, an abstract of sorts, not the item
> itself?

This is just a bug in records-mode:

http://www.peerfear.org/records-mode

records-mode does not yet support the building of descriptions out of the body
of a record so right now we are just using the whole content.  This will be
fixed before we go 1.0 (and soon) and since I am the only one using the code
base right now it isn't too big of a deal.

I do agree this isn't very elegant but it will be fixed.

> I realize this is not against any formal rules or specs, but semantically I
> think it's wrong - and a waste of good bits, it currently can be derived from
> the mod_content content.

yup...

> BTW: Kevin, I noticed you complained that Gordon Mohr doesn't have a weblog.
> As far as I can tell, he has two [1] [2]!
<snip/>

Ah... yes.  He pointed them out to me ;)

"Gordon pointed out that he does have a weblog (I was really just giving him a
hard time!)."

http://www.peerfear.org/rss/permalink/1027129043.shtml

Kevin

- -- 
Kevin A. Burton ( burton@apache.org, burton@openprivacy.org, burton@peerfear.org )
             Location - San Francisco, CA, Cell - 415.595.9965
        Jabber - burtonator@jabber.org,  Web - http://www.peerfear.org/
        GPG fingerprint: 4D20 40A0 C734 307E C7B4  DCAA 0303 3AC5 BD9D 7C4D
         IRC - openprojects.net #infoanarchy | #p2p-hackers | #reptile

All the great empires of the future will be the empires of the mind.
  -- Winston Churchill
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Get my public key at: http://relativity.yi.org/pgpkey.txt

iD8DBQE9OeTYAwM6xb2dfE0RAgJRAJwKptp5Jhfr20cRuuKb7bMvtJN9kwCfQFzR
d3gW/JMXV/gKoD4IIQmVAWc=
=er1U
-----END PGP SIGNATURE-----