[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Bad entities



In article <Pine.SOL.4.21.0110061409580.15777-100000@ic-
unix.ic.utoronto.ca>, Ian Graham <ian.graham@utoronto.ca> writes
>1) include the needed single-character entity definitions from the
>xhtml-lat1.ent file _directly_ inside the DTD at the start of an XML (RSS)
>messge, as in:

>2) include an external entity declaration in the DTD (one that references
>the complete xhtml-lat1.ent resource) and then include that entire entity
>into the DTD, as in:

Ok. I've done a bit more digging and this is what I think is happening.
1) The RSS 1.0 spec[1] gives an example:-

<?xml version="1.0"?>

<!DOCTYPE rdf:RDF [
<!ENTITY % HTMLlat1 PUBLIC
 "-//W3C//ENTITIES Latin 1 for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent";>
%HTMLlat1;
]>

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
 xmlns="http://purl.org/rss/1.0/";
>
etc ...
  
Any RSS(DF) 1.0 feed should include this if they might have html
entities in the <item><description> element. I don't think many of the
them do.

2) When the Netscape RSS 0.91 DTD disappeared off the net temporarily, a
lot of us just removed the <!DOCTYPE entry. But this DTD contains the
HTML entity references. So removing it is fine as long as the reader
doesn't validate the XML and/or we don't allow HTML entities in our
feeds. But of course we do, and it's no longer valid XML. The short term
solution is to put the entry back in. The Netscape spec for 0.91[2]
suggests using this.

<?xml version="1.0"?>
<!DOCTYPE rss SYSTEM "http://my.netscape.com/publish/formats/rss-
0.91.dtd">
<rss version="0.91">
   <channel>
     etc...

3) Manila RSS (at /xml/rss.xml) seems to use this.
<?xml version="1.0"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
"http://my.netscape.com/publish/formats/rss-0.91.dtd";>
<rss version="0.91">
  <channel>
    etc ...

This appears to have the same effect. However, Radio's 0.92 [3] (pretty
much the only source of 0.92 apart from Drupal[6] and RSSify[7]) doesn't
have a <!DOCTYPE entry at all. Neither of the Userland specs for 0.91[4]
or 0.92[5] or their example files, mention it. From the 0.92 spec
"Further, 0.92 allows entity-encoded HTML in the <description> of an
item, to reflect actual practice ... " This is dangerous if HTML entity
encodings are included, as we've discovered.

4) So for RSS 0.9x I'm uncomfortable with depending on the Netscape DTD.
The best solution I can see is to instead depend on the w3.org as the
entity definitions are less likely to disappear. So (assuming I've got
it right) we need to add these lines to the top of the files. 

<!DOCTYPE rss [<!ENTITY % HTMLlat1 PUBLIC 
"-//W3C//ENTITIES Latin 1 for XHTML//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent";>
%HTMLlat1;]>

Phew!

[1] http://groups.yahoo.com/group/rss-dev/files/specification.html
[2] http://my.netscape.com/publish/formats/rss-spec-0.91.html
[3] eg http://wolk.datashed.net/users/adam@curry.com/curryCom.xml or
    http://www.ourfavoritesongs.com/users/dave@userland.com/rss/xml.xml
[4] http://backend.userland.com/rss091
[5] http://backend.userland.com/rss092
[6] http://www.drupal.org
[7] http://www.voidstar.com/rssify.php

-- 
Julian Bond    email: julian_bond@voidstar.com
CV/Resume:         http://www.voidstar.com/cv/
WebLog:               http://www.voidstar.com/
HomeURL:      http://www.shockwav.demon.co.uk/ 
M: +44 (0)77 5907 2173  T: +44 (0)192 0412 433
ICQ:33679568 tag:So many words, so little time