CHWP B.13 Tompa, "Experiences with the OED"

2. Text data modelling

Conventionally a database is interpreted as a repository of data that, taken as a whole, constitutes a model of (some aspects of) an enterprise. A text database, on the other hand, is a model of one or more texts, which in turn model some aspects of reality (Figure 2). Thus, text databases are used not only for information retrieval (e.g., "What types of monkeys are found in Brazil?"), but also for editorial work and lexical analysis (e.g., "Which words are defined using 'of or pertaining to'?"). Thus, while some queries ask about reality, others ask about the text (Tompa & Raymond, 1991). Furthermore, as well as supporting retrieval activities, a text database must provide mechanisms for update and revision as well as for formal publication and other forms of dissemination.

We need to preserve text 'as written' and to transmit such text from process to process and from machine to machine. Therefore, to indicate the significant units within a text (e.g., the textual extent of an etymology), we have chosen to represent the data using text markup (Coombs, Renear & DeRose, 1987). Three distinct forms of markup are possible: presentational, procedural, and descriptive.

2.1. Presentational markup

This form of text representation, also known as "what you see is what you get" or WYSIWIG, uses typography and layout to indicate textual sub-units.

Ironically, through the adoption of standard printing conventions, this form of markup makes it difficult to distinguish types of text units algorithmically. For example, within the citation to Sporting Mag., where does the location information end and the text itself begin? Consider the difficulty when the last piece of location information is the roman numeral I or the first word in the text is the pronoun I. Furthermore, the string "MACAO" and the string "BYRON" have similar form, but the former is a cross-reference to a dictionary entry whereas the latter is the name of a cited author. The system would find it difficult to satisfy a user who wished to retrieve all citations for Lord Bacon without accidentally retrieving cross-references to pork.

2.2. Procedural markup

An alternative representation uses tags in the text to indicate font shifts and spacing: interpreting the procedural markup converts the tagged text into a corresponding presentational form. This form of tagging is used internally in most word processors as well as for typesetting tapes and to control mainframe typesetting systems. The following example is adapted from the keying conventions used for the OED.

Unfortunately, we still have the same difficulties as before. Furthermore, although each typographically distinct field is marked at its start, the extent of the field now has to be deduced from the starting point of the next field (e.g., the end of the date field for the first citation is indicated by +SC whereas the end of the date field for the next citation is indicated by +I). This places a potentially complicated pattern-matching burden on all programs that must extract fields from the text.

2.3. Descriptive markup

Just as for procedural markup, the third form of text markup uses tags to delimit units of text. However, the name of each tag is chosen to indicate the role of each unit in the text rather than indicating how it is to appear in print.

Notice that each field is delimited at both ends, and that the uses of cross-reference tags (XR and XL) vs. author tags (A) distinguish the role of "MACAO" from the role of "BYRON". This is the form of markup chosen for the OED: a partial list of tags is given in
Figure 3.

[Return to table of contents] [Continue]