CHWP B.12 Lancashire, "English Renaissance Knowledge Base"

5. Expectations of a Text Retrieval System

SGML editors already exist for most platforms. Text-retrieval systems such as Pat from Open Text Systems at Waterloo can manage the new encoding format already under UNIX, and the Oxford English Dictionary, which Frank Tompa and his colleagues have structured into a database, already has SGML-like tags.

While existing software like Oxford Concordance Program and TACT will not easily be rewritten to interpret the new TEI/SGML interchange format, it should be possible for the Text Encoding Initiative itself or software developers to write independent programs to transform texts with TEI-SGML encoding into formats suitable for local processing. Translation of syntactic differences between the interchange and local formats should not be difficult. For instance, tags like <sig>...</sig> could be replaced by a COCOA-style tag such as <texttype sig>...<texttype main>. Attributes of a TEI-SGML tag can be rewritten as separate tags. For example, <page.break no="23" sig="C1v" fol="11v"> could be translated into <page 23> <sig C1v> <fol 11v>.

Other differences may not disappear so quickly. Simple variant readings have been readily encoded in COCOA format, but not ones that replace one sequence of words with another shorter or longer sequence. To match variant and lemma, the variant reading must be 'anchored' to the exact phrase in the main text. Anchoring other pieces of text, such as marginal glosses, poses a similar problem. Contractions and brevigraphs, which do not belong to the collating sequence and are not diacritics, also may not be easily translatable. Textual editors and students of the language will be very interested to recover from these dictionaries all occurrences of a contraction to see how many different ways it has been expanded.

A concordance program that can retrieve words, word groups, word-patterns and tags, either by themselves or in combination, should be able to extract most of what matters in these dictionaries: proper names and titles, inflections or parts-of-speech, labels or citations of a source language, and all examples of a given head-lemma (in its entry, in explicit cross-references and in phrases or sentences used under other head-lemmas).

Ideally, software should also be able to retrieve words, word groups, word patterns and tags not just with the immediate context but also with chunks of text nearby that are explicitly associated with a given tag. For the purposes of RKB, for example, it would be desirable to recall all translations, followed by the sample quotations they translate, which have specified words in them; or a listing of all occurrences of a group of English words with both the French head-lemma after which they appear and the immediate context. Because some English words are located many lines after their head-lemmas (e.g., in one of many phrases that are subsumed under the head-lemma), this context selection is not easily done by most text-retrieval systems. They need to be able to retrieve entire passages encoded by a single tag.

[Return to Table of Contents] [Continue]