CHWP A.1 Siemens, "Lemmatization and parsing"

3. Methodology: Lemmatization

Discussion regarding a lemmatization methodology is warranted, for there has been much recent debate about what specifically the act of lemmatization entails. Though acknowledging that the term itself is ambiguous, in his article "Homography and Lemmatization in Dutch Texts", Boot begins his discussion of lemmatization with the notion that it is "generally defined as the transformation of all inflected word forms contained in a text to their dictionary look-up form" (Boot 1980: 175). To lemmatize a text by this definition, then, is to reduce each word in that text from its inflectional and variant forms to its base form. Questions remain, however, as to whether this theoretical definition of lemmatization is something that can be practicably carried out at the level of application. To illustrate this point, we can turn to Hann who, in his article "Towards an Algorithmic Methodology of Lemmatization," differentiates between a theoretical and a practical approach to the act; lemmatization, he states, can be on one level "the transformation of a corpus of raw textual data to a series of lemmata" or, on another level, "simply the masking of input word-forms . . . in order to treat equivalent forms of the same term as such" (Hann 1975: 140).[10]

Does lemmatization, then, require a full and complete reduction of words to lemmas based on their individual forms and distinct meanings within those forms, or is does it entail only a contraction and masking of words, to what may only possibly be their lemma, based on form alone? If each word type could be assigned a unique lemma form, lemmatizing a text would be comparatively uncomplicated; however, because of homography, wherein words which are like one another in form but have distinctly different meanings, this is not so.

In illustration of the larger problem, consider the English word quail, which could mean, as a noun, the bird that we eat or, as a verb, to cower. To conflate both words to the same lemma form would be to lose the distinction in meaning that each separate word has. Improperly treated, this homograph (as others) create ambiguity in the lemmatized text. Such occurrences can be disambiguated by appending lexical information to the lemmatized form. For example, instead of reducing both of the above forms to the lemma quail, the two could be represented by the separate lemma forms quail<N> and quail<V>, where the contents of the angle-brackets denote the part-of-speech for each lemma; when one is parsing as well as lemmatizing, this type of tag can be created automatically by combining elements of each process with the assistance of a word-processor or text editor.

When homographs are of the same part-of-speech, however, the issue becomes more complex. Boot illustrates the problem of homonymy using the word play as it occurs in the two following sentences (Boot 1980: 175):

  1. I watched the play.
  2. I like fair play.

While tagging for part-of-speech would differentiate between the use of play as a noun and as a verb (which is not used in this example), both forms above are nouns, each use carrying with it significant differences in meaning; the first use refers to a drama (OED, cf. play [III:15]) and the second refers to conduct or dealing (OED, cf. play [II:12]). Thus, the same graphic form of the same lexical class carries quite differing meanings, and to reduce each to a common lemma would be to ignore their semantic difference.[11]

To be true to the text semantically, then, homographs cannot be simply assigned one lemma or, in complex cases such as that above, one lemma augmented by basic part-of-speech information; rather, each distinct meaning within a common homographic form must be assigned a fully individual lemma to reflect its difference. One solution to this problem is to lemmatize the text following the example of the dictionary;[12] homographs of the same lexical category could be assigned a common lemma stem, but with further tag suffixes to indicate difference. For the above example of the word play, one might tag according to the distinctions offered by the Oxford English Dictionary and associate the word play in the first occurrence with the lemma play<N-III:15> and the second with play<N-II:12>. This would require significantly more manual interaction on the part of the person performing the lemmatization than is typical in such projects, but those using lemmatized texts for the purposes of semantic analysis would likely wish distinctions such as these to be noted, if it is at all practical to do so. Those with other concerns may choose not to be so particular.

Because it requires exponentially more effort to remove homographic ambiguity than is necessary to simply mask word forms, lemmatization which fully reflects the semantic difference in homographic forms is infrequently done. Those involved in the creation of the Royal Irish Academy's Dictionary of Medieval Latin from Celtic Sources found that "full lemmatization (with removal of ambiguity) . . . demanded more effort on the part of the lexicographer than it was feasible to invest" (Devine 1987: 56; 61). Meunier, who himself has drawn a distinction between graphical and semantic lemmatization, favoured a concentration on graphical forms in his own project (Meunier 1976: 210). Krause and Willée, in their work with German newspaper texts, found that a highly accurate lemmatization, with little ambiguity, came at a cost which exceeded their budget (Krause and Willée 1981: 101). It is no wonder, then, that one criticism of the practice of lemmatization is that it is at most an "operation concealing the semantic problem of homonymy" (Boot 1980: 175). This may very well be the case but, though a lemmatized text may not fully reflect semantic difference in homographic forms, homographs can be disambiguated manually as they arise in data retrieved from that text.

By its nature, then, lemmatization is itself an ambiguous act, and one which significantly transforms a text in ways that can profoundly alter the results of studies performed on it. Because of this, the degree of lemmatization which is chosen for a specific text, and the principles which will be applied in the process of lemmatization, should reflect the intended final use of that text. Figure 3 outlines some general principles of lemma form reduction in English.[13] In addition to these principles, one should take into account a number of further issues which are commonly considered when reducing a word to its lemma. Are adverbs which are derived from adjectives best lemmatized under the adverbial form, as is common lexicographical practice, or, as Devine has practised, under the adjective (Devine 1987: 58)?[14] Should a gerund functioning as a noun be classified under its original verb form? In some studies it is also useful if words which are derived from the same root-form are lemmatized under the stem of that form; under this principle, the verb or adjective live and the noun life are conflated to the same lemma (Boot 1980: 175). Furthermore, in many cases it is useful to retain information about the form of the word as it originally appeared in the text, for much data is lost when the past participle of a verb is represented by its infinitive form, or a genitive noun represented by the nominative; for this reason, one should consider the benefits of retaining the original form of a word along with its lemma, and also weigh the advantages of having lexical information available for each word in that text.

[Return to table of contents] [Continue]


Notes

[10] Meunier, further, distinguishes between the semantic and graphical elements of lemmatization (Meunier 1976: 210).

[11] As a more complex example yet, consider the word present. It functions commonly as a noun, in the senses of gift (a present) and now (the present [time]), as a verb, in the sense of giving something (I present this to you), and as an adjective, in the sense of here (I am present).

[12] Refer to Krause's discussion of Sture Allén's thoughts regarding the types of information which may be obtained from dictionaries, and how that information can be used to help resolve the problem of ambiguity (Krause and Willée 1981: 102).

[13] This table is adapted from Devine 1980: 58.

[14] For example, is the adverb quickly most accurately lemmatized as itself or as the adjective quick?