Siemens, Lemmatization and parsing: part 4

Siemens, "Lemmatization and parsing"

4. Methodology: Parsing

Should one choose to parse as well as lemmatize -- actions which are performed at the same time with the preprocessing programs -- some consideration must also be given to the way in which a grammar will be applied to the text. As with lemmatization, one finds a certain range of opinion when considering parsing as an ideal versus parsing as a practical reality; consequently (and also similarly) a range exists among prominent examples of what it means to parse in a scholarly context.

When parsing the York Computer Inventory of Prose Style for his study published in Prose Style and Critical Reading, Cluett based his system, called the York Syntactic Code, upon that laid out in Fries' Structure of English. Cluett, while adopting Milic's own revisions of Fries' grammar,[15] made several further alterations to allow him to parse his group of texts in greater detail. His method was to apply a three-digit numeric code to each word to represent its part-of-speech and then to analyze the distribution of those numeric codes (see 16-22); for example, the phrase after leaving the ship would appear in the parsed text as the numeric string 513 071 311 011 (Cluett 1976: 19). Ross and Rasche's program, EYEBALL, offers a different approach, employing a small built-in dictionary to assign each word only a single letter as a code which represents its lexical category; for example, a noun is assigned N, a verb, V, an adjective, J, and an unknown word, ?.[16] Those parsing today, however, are not limited in the same way by the technology employed initially by Cluett and Ross roughly twenty years ago -- early studies, for example, were limited to some degree by the technology employed for data storage and retrieval -- so, when deciding upon a tagging methodology to guide parsing, one need not take as minimalist an approach as that taken in EYEBALL, nor resort to a numeric system to represent a detailed parsing grammar. Today's technology allows a considerable flexibility but, like the decisions one must make when involved in lemmatizing, when parsing one must decide upon a practical parsing grammar which takes into account the intended use of the text and, of course, one must be prepared for problems arising due to homography.

Ideally, one's target should be [a] a grammar detailed enough to accurately represent the structure of the text and [b] a tagset which reflects an accepted system. Figure 4 outlines a tagset based upon that employed in two recent projects using the preprocessing programs, again for application to English texts.[17] The grammar it reflects is based as much as is practically possible upon that presented by Quirk, and others, in their Comprehensive Grammar of the English Language (Quirk 1985).

This tagset, which accounts for word form information alone, attempts to capture the detail of Cluett's system, but represents lexical information in a more visually identifiable form akin to EYEBALL's system; instead of using a numeric system, it employs alphabetic abbreviations of recognizable lexical categories. Thus, the phrase the dog jumped, which Cluett would represent as 311 011 021 and EYEBALL as D N V, would be tagged as <DETce-art> <Ns> <LEXV3spa>. Unlike numeric systems, but like that employed by EYEBALL, this method of tagging is more easily and immediately recognizable while proofreading, editing, and analyzing the parsed text.[18]

[Return to table of contents] [Continue]

Notes

[15] A revised version of Fries' grammar was employed by Milic for his study of Swift's style (Milic 1967).

[16] Refer to Hockey for a descriptive overview of the encoding system used by EYEBALL (Hockey 1980: 108-14).

[17] My electronic text of four editions of an early modern dictionary of English, Robert Cawdrey's A Table Alphabeticall, is parsed with a similar tagset and lemmatized with a concentration on graphical forms. Ian Lancashire's forthcoming electronic edition of a collection of Elizabethan homilies, Certaine Sermons or Homilies 1547-1571, contains tags denoting each word's lemma and part-of-speech based on a similar methodology.

[18] Parsing often includes acknowledgement of syntactic relationships as well; although the tagset in Figure 4 does not account for this, it can be included in the tagset and, thus, applied to the text in the same manner. However, because the preprocessing programs are not context-sensitive in applying tags, tagging at this level can be quite time consuming.