Siemens, Lemmatization and parsing: part 1

Siemens, "Lemmatization and parsing"

1. The Role of the computer

Performing each of the two acts of lemmatizing and parsing requires a great deal of close, comparative, and repetitive work; as those engaged in such work often find the computer to be a valuable aid, it comes as no surprise that recent research has shown that lemmatization and parsing are operations which are most efficiently carried out with the assistance of a computer. Programs such as EYEBALL, MicroEYEBALL, and the Micro English Parser for English, LexiTex and Lemmatiseur for French, and LEMMA2 for German, just to name a few, exemplify advances made in his area.[4] Even with this good development, however, the two tasks are still some distance from being characterised as fully and accurately automated processes.

The complexities inherent in written language -- chiefly, the ambiguity of similar graphic forms (homographs) -- require that manual interaction, to varying but considerable degrees, is necessary when preparing any lemmatized and/or parsed text by computer-assisted means. Because levels of homography vary between languages, however, the amount of success one may hope to achieve when using software to assist in these processes is not determined solely by the software one employs: it is also determined to a large degree by the characteristics of the language with which one is working. While French, English, and German texts can fall victim to substantial errors in parsing and lemmatization because of homography, texts in languages such as Latin and Hebrew, both languages which are much more affected by this phenomenon, yield far more erroneous results from automated processes. Such errors must be corrected manually, and the possibility of error requires that all texts which have been processed with the aid of the computer must be thoroughly proofread. Thus, while many benefits can come from having and using a parsed and/or lemmatized text,[5] these benefits are reaped only after an investment of time and manual intervention that causes many to question the very role of the computer in either process.

In summarizing computer-based lemmatization methods several years ago -- a summary which is equally appropriate to computer-assisted parsing -- Susan Hockey noted that the computer can be employed in three ways (Hockey 1980: 73). Firstly, the computer can use a set of pre-determined algorithms when attempting to establish the form of a word. Secondly, a text can be tagged with this information at the same time as it is input into the computer. Thirdly, the computer can interact with a computer-readable dictionary, retrieve information on a specific word, and then apply it to an electronic text. The advantages of each method, however, are accompanied by limitations.

Inputting all the information manually is quite labour-intensive, and to do so is to make only limited use of the capabilities of the computer as a labour-saving device; as well, tagging manually as the text is entered into the computer increases the likelihood of introducing errors and inconsistency to the text. Though the algorithmic approach to determining lemma forms and parts-of-speech is promising, those using this method are generally unable to alter the algorithms and must, then, accept language- and period-specific limitations; algorithms designed specifically for Latin, for example, would not work with German, nor would algorithms for contemporary Spanish completely suit an Old Spanish text. As well, the users of systems which rely on algorithms typically must allow for many exceptions (the most prominent of these being errors resulting from situations not accounted for in the algorithms); hence, the output often requires significant correction. The high level of emendation required in texts processed by such systems once made scholars question the role of the computer as an automator of lemmatization and parsing processes;[6] however, there have been significant advances made in this area.[7]

Employing the computer as a look-up and tagging device in a process which would still involve much manual editing seems time-consuming, especially with the promise of more fully automatic lemmatization and parsing procedures on the horizon. This last scenario, however, is not so bleak, providing that look-up and tagging procedures are efficient, that the system itself -- in terms of computer and human labour -- is non-redundant, and that computer procedures automate all that can be effectively and accurately automated.

[Return to table of contents] [Continue]

Notes

[4] For EYEBALL and MicroEYEBALL, refer to Ross 1981 as well as Ross and Rasche 1972. The Micro English Parser was developed by Michael Stairs and Graeme Hirst at the Centre for Computing in the Humanities (University of Toronto). LexiTex, an expanded version of Lemmatiseur (the Lemmatiser), was developed at the Centre de traitement de l'information (Université‚ Laval); see Lancashire and McCarty 1988 for a description of Lemmatiseur (221-2) and, for LexiTex, see Lancashire 1991 (505-6). Also to be considered is THEME; refer to Bratley and Fortier 1983. For LEMMA2, see Klein 1990 as well as Krause and Willée 1981. Laurent Catach's program GRAPHIST, to be released in the near future, promises to offer much to those working with French.

[5] Often cited are Ross 1974, Oakman 1975, Cluett 1976, and Milic 1967. As well, there are numerous more recent works, such as those by Birch 1985, McColly 1987, and others.

[6] For example, refer to Aitken 1971.

[7] Refer to Venezky 1973, Dawson 1974, and Hann 1975, among others. More recent work can be found in Kimmo 1983 and Richie and Russell 1992. Of interest also are current products of this nature, including PC-KIMMO (see its review in Computers and the Humanities 26.2 [1992]), those produced by Lingsoft (Finland), and IBM's morphological processing program, OEM.