CHWP B.12 Lancashire, "English Renaissance Knowledge Base"

3. What are the Differences between COCOA Tags and SGML Tags?

3.1. Form and Span

COCOA tags normally have three parts in their form: (1) delimiter characters (the diamond brackets, or some other symbols not found in the text), (2) a variable or type name, and (3) a value or token name. It is a convention that the variable or type name may be dropped if the delimiters themselves can carry that meaning. For example,

      <AUTHOR John Palsgrave>
      <<John Palsgrave>>
amount to the same thing. In the first form, single diamond brackets are the delimiters that separate the variable-type AUTHOR and the value-token John Palsgrave from the text. The variable or type may take any form. For example, other tags could be TITLE, DATE, PUBLISHER, etc. The value or token following it, John Palsgrave, may change, say, into other authors such as Cotgrave, Florio or Thomas. In the second form, the double diamond brackets are understood to stand for <AUTHOR > rather than <TITLE > or <DATE >. Any other tags must use different and unique delimiters.

COCOA tags of a given type, like <AUTHOR John Palsgrave>, hold until another tag of the same type occurs. That is, every word in the text following <AUTHOR John Palsgrave> would be tagged as being written by Palsgrave until a subsequent <AUTHOR > tag occurred. The span of such COCOA tags, then, is indefinite.

SGML/TEI tags differ in five principal ways, for my purposes.[8]

  1. SGML/TEI tags do not have an abbreviated form, in which the delimiters stand for the tag itself. The variable name must always be present.
  2. SGML/TEI tags may have an indefinite span, prevailing until they are replaced by another tag of the same kind (e.g., <page.break>), but they also may take a closing tag, normally the opening tag with a forward slash preceding the tag variable. Thus the SGML tag <col> would be concluded by </col>. This is so for the following reason.
  3. All two-part tags of this kind surround text, and this text itself, not an editorial word or phrase inside the tag, becomes the value or token for the tag value or token. Thus the title page of Cotgrave's dictionary states Compiled by RANDLE COTGRAVE. This would be tagged in SGML/TEI as Compiled by <author> RANDLE COTGRAVE </author>.
  4. SGML/TEI tags may take attributes inside their delimiters. For instance, the <col> tag could have the attributes location=, width=, etc.
  5. SGML/TEI allows for special tags called "entities", which function as string substitutions for what is in the text. The form of an entity is always &, followed by the entity's name, and closed by semi-colon ";". For our purposes, entities serve to represent infrequent but unusual strings within an established ISO character set. For example, &shy; and &emacr; could stand respectively for a soft hyphen and an e with a macron over it.
By permitting closing tags, SGML, unlike COCOA-style tagging, employs the text itself to tag the text, whereas the value or the token in a COCOA tag must always be added to the text by the editor. Another way of putting this is that SGML recognizes the difference between tags authorized by the text, and ones created by the editor from scratch. SGML tagging is textually 'conservative'.

3.2. Structure

COCOA-style tags in a text have no structure as a group; they can go anywhere, in any sequence, and are relevant only to the words that follow. Often such tags are never explained within an electronic text. In contrast, SGML-tags are normally declared in a "document-type definition" (DTD) at the start of the text. This declaration associates those tags in hierarchies, trees or groups of tags. A DTD normally expects that a text is structured like a Ukrainian doll or a Chinese box, one smaller piece or tag division completely inside another. 'Higher' or 'outer' structural units never overlap with 'lower' or 'inner' units. For example, a DTD for a novel might tell us that paragraphs fall always inside chapters, and chapters always inside books, i.e. that a paragraph is never split between two chapters. Or a DTD for a play might tell us that speeches always occur inside scenes, and scenes inside acts; a speech that carries on over two scenes would break the hierarchy. Texts that have flat 'lattice' structures, or several different structures, may also be represented within this formalism.

[Return to Table of Contents] [Continue]


Notes

[8] I emphasize that SGML is a much more complex system than I am able to describe here. I touch only on certain characteristics obvious to the scholar doing the tagging.