[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Proper use of DOCTYPE?
3/22/02 1:35:45 PM, "Bill Kearney" <wkearney99@hotmail.com> wrote:
>So what you're saying is that the use of this entity would require using the
>DOCTYPE. Provided, of course, that the DTD presented in that file had the
>entity. The one for Netscape's simple RSS does have this entity.
>
>http://my.netscape.com/publish/formats/rss-0.91.dtd
>
>So would they be best informed to reference the DTD or to use the encoding
>attribute? Or both?
There are two separate issues here: encoding and entity definitions. As you probably know, every
character in an XML document has an assigned numeric value in Unicode. "Encoding" refers to how
these numbers are represented as bit patterns in the physical file, data stream, etc. that
represents the logical document. For example, in the ISO-8859-1 encoding, a character whose
Unicode number is less than 256 is represented as a single byte whose value is the Unicode number,
and a character whose Unicode number is 256 or greater can't be *directly* represented at all. In
the UTF-8 encoding, characters whose numbers are less than 128 are represented as single bytes,
128-255 as two bytes, and so on. The most important thing to remember is that the Unicode number
for a particular character is the same *regardless* of the encoding used.
Now there are two ways to include a particular character in an XML document. The first way is to
determine what bit pattern the document's encoding says to use for that character's Unicode number,
and include that pattern in the document itself. The second way is to use a numeric character
reference: "&#" followed by the Unicode number as a decimal or hex string follwed by a semicolon.
This allows you to include Unicode characters that don't have a direct representation in the
document's encoding (such as characters over 255 if you're using ISO-8859-1 encoding) or that are
awkward to insert with your editing tools.
Now where the DTD comes in is that XML allows you to define simple "text macros" called general
entities, which let you write one thing in a document and have it automatically replaced with
something else. These have to be defined in a DTD. Since it can be hard to remember the Unicode
numbers for various characters, a common use of general entities is to give various characters
names that get replaced by numeric character references.
For example, a capital A with a grave accent *always* has the Unicode number 192. You can *always*
include it in a document, *regardless* of the encoding, by writing "À". If your document is
encoded in ISO-8859-1, you can include it by inserting a byte whose value is 192. If your document
is encoded in UTF-8, you can include it by inserting two bytes with values that I don't remember
offhand. But if you want to include it by writing "À" you need to have a DTD that defines
"Agrave" as expanding to "À".
If your document doesn't have an encoding declaration, it's assumed to be in either UTF-8 or UTF-16
(the latter only if the very first two bytes of the document are a UTF-16 Byte Order Mark).
The distinctions between characters, how they're numbered, how they're encoded, and how they're
represented textually are, admittedly, tricky to understand initially, but they're absolutely
essential to understand when you're writing international documents.