[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Proper use of DOCTYPE?



3/22/02 1:35:45 PM, "Bill Kearney" <wkearney99@hotmail.com> wrote:
>So what you're saying is that the use of this entity would require using the
>DOCTYPE.  Provided, of course, that the DTD presented in that file had the
>entity.  The one for Netscape's simple RSS does have this entity.
>
>http://my.netscape.com/publish/formats/rss-0.91.dtd
>
>So would they be best informed to reference the DTD or to use the encoding
>attribute?  Or both?

There are two separate issues here: encoding and entity definitions.  As you probably know, every 
character in an XML document has an assigned numeric value in Unicode.  "Encoding" refers to how 
these numbers are represented as bit patterns in the physical file, data stream, etc. that 
represents the logical document.  For example, in the ISO-8859-1 encoding, a character whose 
Unicode number is less than 256 is represented as a single byte whose value is the Unicode number, 
and a character whose Unicode number is 256 or greater can't be *directly* represented at all.  In 
the UTF-8 encoding, characters whose numbers are less than 128 are represented as single bytes, 
128-255 as two bytes, and so on.  The most important thing to remember is that the Unicode number 
for a particular character is the same *regardless* of the encoding used.

Now there are two ways to include a particular character in an XML document.  The first way is to 
determine what bit pattern the document's encoding says to use for that character's Unicode number, 
and include that pattern in the document itself.  The second way is to use a numeric character 
reference: "&#" followed by the Unicode number as a decimal or hex string follwed by a semicolon.  
This allows you to include Unicode characters that don't have a direct representation in the 
document's encoding (such as characters over 255 if you're using ISO-8859-1 encoding) or that are 
awkward to insert with your editing tools.

Now where the DTD comes in is that XML allows you to define simple "text macros" called general 
entities, which let you write one thing in a document and have it automatically replaced with 
something else.  These have to be defined in a DTD.  Since it can be hard to remember the Unicode 
numbers for various characters, a common use of general entities is to give various characters 
names that get replaced by numeric character references.

For example, a capital A with a grave accent *always* has the Unicode number 192.  You can *always* 
include it in a document, *regardless* of the encoding, by writing "&#192;".  If your document is 
encoded in ISO-8859-1, you can include it by inserting a byte whose value is 192.  If your document 
is encoded in UTF-8, you can include it by inserting two bytes with values that I don't remember 
offhand.  But if you want to include it by writing "&Agrave;" you need to have a DTD that defines 
"Agrave" as expanding to "&#192;".

If your document doesn't have an encoding declaration, it's assumed to be in either UTF-8 or UTF-16 
(the latter only if the very first two bytes of the document are a UTF-16 Byte Order Mark).

The distinctions between characters, how they're numbered, how they're encoded, and how they're 
represented textually are, admittedly, tricky to understand initially, but they're absolutely 
essential to understand when you're writing international documents.